[Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
@ 2014-11-05  7:12 Aravinda Prasad
  2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 1/4] target-ppc: Extend rtas-blob Aravinda Prasad
                   ` (4 more replies)
  0 siblings, 5 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05  7:12 UTC (permalink / raw)
  To: aik, qemu-ppc, qemu-devel; +Cc: benh, paulus

This series of patches add support for fwnmi in powerKVM guests.

Currently upon machine check exception, if the address in
error belongs to guest then KVM invokes guest's NMI interrupt
vector 0x200.

This patch series adds functionality where the guest's 0x200
interrupt vector is patched such that QEMU gets control. QEMU
then builds error log and reports the error to OS registered
machine check handlers through RTAS space.

Apart from this, the patch series also takes care of synchronization
when multiple processors encounter machine check at or about the
same time.

The patch set was tested by simulating a machine check error in
the guest.

Changes in v3:
    - Incorporated review comments
    - Byte codes in patch 4/4 are now moved to
      pc-bios/spapr-rtas/spapr-rtas.S as instructions.
    - Defined the RTAS blob in-memory layout.
    - FIX: save and restore cr register in the trampoline

Changes in v2:
    - Re-based to github.com/agraf/qemu.git  branch: ppc-next
    - Merged patches 4 and 5.
    - Incorporated other review comments

---

Aravinda Prasad (4):
      target-ppc: Extend rtas-blob
      target-ppc: Register and handle HCALL to receive updated RTAS region
      target-ppc: Build error log
      target-ppc: Handle ibm,nmi-register RTAS call

 hw/ppc/spapr.c                  |    7 +
 hw/ppc/spapr_hcall.c            |  183 +++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_rtas.c             |   93 ++++++++++++++++++++
 include/hw/ppc/spapr.h          |   27 +++++-
 pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++
 5 files changed, 346 insertions(+), 2 deletions(-)

-- 
Aravinda Prasad

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [Qemu-devel] [PATCH v3 1/4] target-ppc: Extend rtas-blob
  2014-11-05  7:12 [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests Aravinda Prasad
@ 2014-11-05  7:12 ` Aravinda Prasad
  2014-11-05  8:11   ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
  2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 2/4] target-ppc: Register and handle HCALL to receive updated RTAS region Aravinda Prasad
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05  7:12 UTC (permalink / raw)
  To: aik, qemu-ppc, qemu-devel; +Cc: benh, paulus

Extend rtas-blob to accommodate error log. Error log
structure is saved in rtas space upon a machine check
exception.

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
---
 hw/ppc/spapr.c         |    7 +++++++
 include/hw/ppc/spapr.h |    5 +++++
 2 files changed, 12 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 30de25d..38e26af 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1431,6 +1431,13 @@ static void ppc_spapr_init(MachineState *machine)
 
     filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, "spapr-rtas.bin");
     spapr->rtas_size = get_image_size(filename);
+
+    /*
+     * Resize blob to accommodate error log. The layout of the rtas
+     * blob is defined in include/hw/ppc/spapr.h
+     */
+    spapr->rtas_size = TARGET_PAGE_ALIGN(spapr->rtas_size);
+
     spapr->rtas_blob = g_malloc(spapr->rtas_size);
     if (load_image_size(filename, spapr->rtas_blob, spapr->rtas_size) < 0) {
         hw_error("qemu: could not load LPAR rtas '%s'\n", filename);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 749daf4..d08fcc2 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -480,4 +480,9 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
 int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                       sPAPRTCETable *tcet);
 
+/* RTAS Blob layout in memory */
+#define RTAS_ENTRY_OFFSET        0
+#define RTAS_TRAMPOLINE_OFFSET   0x200
+#define RTAS_ERRLOG_OFFSET       0x800
+
 #endif /* !defined (__HW_SPAPR_H__) */

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [Qemu-devel] [PATCH v3 2/4] target-ppc: Register and handle HCALL to receive updated RTAS region
  2014-11-05  7:12 [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests Aravinda Prasad
  2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 1/4] target-ppc: Extend rtas-blob Aravinda Prasad
@ 2014-11-05  7:12 ` Aravinda Prasad
  2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 3/4] target-ppc: Build error log Aravinda Prasad
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05  7:12 UTC (permalink / raw)
  To: aik, qemu-ppc, qemu-devel; +Cc: benh, paulus

Receive updates from SLOF about the updated rtas-base.
A separate patch for SLOF [1] adds functionality to invoke
a private HCALL whenever OS issues instantiate-rtas with
a new rtas-base.

This is required as qemu needs to know the updated rtas-base
as it allocates error reporting structure in RTAS space upon
a machine check exception.

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-August/120386.html

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
---
 hw/ppc/spapr_hcall.c   |    8 ++++++++
 include/hw/ppc/spapr.h |    3 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 8651447..01650ba 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -579,6 +579,13 @@ static target_ulong h_rtas(PowerPCCPU *cpu, sPAPREnvironment *spapr,
                            nret, rtas_r3 + 12 + 4*nargs);
 }
 
+static target_ulong h_rtas_update(PowerPCCPU *cpu, sPAPREnvironment *spapr,
+                                  target_ulong opcode, target_ulong *args)
+{
+    spapr->rtas_addr = args[0];
+    return 0;
+}
+
 static target_ulong h_logical_load(PowerPCCPU *cpu, sPAPREnvironment *spapr,
                                    target_ulong opcode, target_ulong *args)
 {
@@ -1003,6 +1010,7 @@ static void hypercall_register_types(void)
 
     /* qemu/KVM-PPC specific hcalls */
     spapr_register_hypercall(KVMPPC_H_RTAS, h_rtas);
+    spapr_register_hypercall(KVMPPC_H_RTAS_UPDATE, h_rtas_update);
 
     spapr_register_hypercall(H_SET_MODE, h_set_mode);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index d08fcc2..ccf67ba 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -308,7 +308,8 @@ typedef struct sPAPREnvironment {
 #define KVMPPC_H_LOGICAL_MEMOP  (KVMPPC_HCALL_BASE + 0x1)
 /* Client Architecture support */
 #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
-#define KVMPPC_HCALL_MAX        KVMPPC_H_CAS
+#define KVMPPC_H_RTAS_UPDATE    (KVMPPC_HCALL_BASE + 0x3)
+#define KVMPPC_HCALL_MAX        KVMPPC_H_RTAS_UPDATE
 
 extern sPAPREnvironment *spapr;
 

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [Qemu-devel] [PATCH v3 3/4] target-ppc: Build error log
  2014-11-05  7:12 [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests Aravinda Prasad
  2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 1/4] target-ppc: Extend rtas-blob Aravinda Prasad
  2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 2/4] target-ppc: Register and handle HCALL to receive updated RTAS region Aravinda Prasad
@ 2014-11-05  7:12 ` Aravinda Prasad
  2014-11-05  7:13 ` [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call Aravinda Prasad
  2014-11-11  3:24 ` [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests David Gibson
  4 siblings, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05  7:12 UTC (permalink / raw)
  To: aik, qemu-ppc, qemu-devel; +Cc: benh, paulus

Whenever there is a physical memory error due to bit
flips, which cannot be corrected by hardware, the error
is passed on to the kernel. If the memory address in
error belongs to guest address space then guest kernel
is responsible to take action. Hence the error is passed
on to guest via KVM by invoking 0x200 NMI vector.

However, guest OS, as per PAPR, expects an error log
upon such error. This patch registers a new hcall
which is issued from 0x200 interrupt vector and builds
the error log, copies the error log to rtas space and
passes the address of the error log to guest

Enhancement to KVM to perform above functionality is
already in upstream kernel.

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
---
 hw/ppc/spapr_hcall.c   |  159 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |    4 +
 2 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 01650ba..8f16160 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -14,6 +14,89 @@ struct SPRSyncState {
     target_ulong mask;
 };
 
+#define RTAS_ELOG_SEVERITY_SHIFT         0x5
+#define RTAS_ELOG_DISPOSITION_SHIFT      0x3
+#define RTAS_ELOG_INITIATOR_SHIFT        0x4
+
+/*
+ * Only required RTAS event severity, disposition, initiator
+ * target and type are copied from arch/powerpc/include/asm/rtas.h
+ */
+
+/* RTAS event severity */
+#define RTAS_SEVERITY_ERROR_SYNC    0x3
+
+/* RTAS event disposition */
+#define RTAS_DISP_NOT_RECOVERED     0x2
+
+/* RTAS event initiator */
+#define RTAS_INITIATOR_MEMORY       0x4
+
+/* RTAS event target */
+#define RTAS_TARGET_MEMORY          0x4
+
+/* RTAS event type */
+#define RTAS_TYPE_ECC_UNCORR        0x09
+
+/*
+ * Currently KVM only passes on the uncorrected machine
+ * check memory error to guest. Other machine check errors
+ * such as SLB multi-hit and TLB multi-hit are recovered
+ * in KVM and are not passed on to guest.
+ *
+ * DSISR Bit for uncorrected machine check error. Based
+ * on arch/powerpc/include/asm/mce.h
+ */
+#define PPC_BIT(bit)                (0x8000000000000000ULL >> bit)
+#define P7_DSISR_MC_UE              (PPC_BIT(48))  /* P8 too */
+
+/* Adopted from kernel source arch/powerpc/include/asm/rtas.h */
+struct rtas_error_log {
+    /* Byte 0 */
+    uint8_t     byte0;          /* Architectural version */
+
+    /* Byte 1 */
+    uint8_t     byte1;
+    /* XXXXXXXX
+     * XXX      3: Severity level of error
+     *    XX    2: Degree of recovery
+     *      X   1: Extended log present?
+     *       XX 2: Reserved
+     */
+
+    /* Byte 2 */
+    uint8_t     byte2;
+    /* XXXXXXXX
+     * XXXX     4: Initiator of event
+     *     XXXX 4: Target of failed operation
+     */
+    uint8_t     byte3;          /* General event or error*/
+    __be32      extended_log_length;    /* length in bytes */
+    unsigned char   buffer[1];      /* Start of extended log */
+                                /* Variable length.      */
+};
+
+/*
+ * Data format in RTAS-Blob
+ *
+ * This structure contains error information related to Machine
+ * Check exception. This is filled up and copied to rtas-blob
+ * upon machine check exception.
+ */
+struct rtas_mc_log {
+    target_ulong srr0;
+    target_ulong srr1;
+    target_ulong crf;
+    /*
+     * Beginning of error log address. This is properly
+     * populated and passed on to OS registered machine
+     * check notification routine upon machine check
+     * exception
+     */
+    target_ulong r3;
+    struct rtas_error_log err_log;
+};
+
 static void do_spr_sync(void *arg)
 {
     struct SPRSyncState *s = arg;
@@ -586,6 +669,81 @@ static target_ulong h_rtas_update(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     return 0;
 }
 
+static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
+                                 target_ulong opcode, target_ulong *args)
+{
+    struct rtas_mc_log mc_log;
+    CPUPPCState *env = &cpu->env;
+
+    cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
+
+    /*
+     * We save the original r3 register in SPRG2 in 0x200 vector,
+     * which is patched during call to ibm.nmi-register. Original
+     * r3 is required to be included in error log
+     */
+    mc_log.r3 = env->spr[SPR_SPRG2];
+
+    /*
+     * SRR0 and SRR1, containing nip and msr at the time of exception,
+     * are clobbered when we return from this hcall. Hence they
+     * need to be properly saved and restored. We save srr0
+     * and srr1 in rtas blob and restore it in 0x200 vector
+     * before branching to OS registered machine check handler
+     */
+    mc_log.srr0 = env->spr[SPR_SRR0];
+    mc_log.srr1 = env->spr[SPR_SRR1];
+    mc_log.crf = ((target_ulong) env->crf[0]) << 32;
+
+    /* Set error log fields */
+    mc_log.err_log.byte0 = 0x00;
+    mc_log.err_log.byte1 =
+        (RTAS_SEVERITY_ERROR_SYNC << RTAS_ELOG_SEVERITY_SHIFT);
+    mc_log.err_log.byte1 |=
+        (RTAS_DISP_NOT_RECOVERED << RTAS_ELOG_DISPOSITION_SHIFT);
+    mc_log.err_log.byte2 =
+        (RTAS_INITIATOR_MEMORY << RTAS_ELOG_INITIATOR_SHIFT);
+    mc_log.err_log.byte2 |= RTAS_TARGET_MEMORY;
+
+    if (env->spr[SPR_DSISR] & P7_DSISR_MC_UE) {
+        mc_log.err_log.byte3 = RTAS_TYPE_ECC_UNCORR;
+    } else {
+        mc_log.err_log.byte3 = 0x0;
+    }
+
+    /* Handle all Host/Guest LE/BE combinations */
+    if (env->msr & (1ULL << MSR_LE)) {
+        mc_log.srr0 = cpu_to_le64(mc_log.srr0);
+        mc_log.srr1 = cpu_to_le64(mc_log.srr1);
+        mc_log.crf = cpu_to_le64(mc_log.crf);
+        mc_log.r3 = cpu_to_le64(mc_log.r3);
+    } else {
+        mc_log.srr0 = cpu_to_be64(mc_log.srr0);
+        mc_log.srr1 = cpu_to_be64(mc_log.srr1);
+        mc_log.crf = cpu_to_be64(mc_log.crf);
+        mc_log.r3 = cpu_to_be64(mc_log.r3);
+    }
+
+    cpu_physical_memory_write(spapr->rtas_addr + RTAS_ERRLOG_OFFSET,
+                              &mc_log, sizeof(mc_log));
+
+    /*
+     * spapr->rtas_addr + RTAS_ERRLOG_OFFSET now contains srr0, srr1,
+     * original r3, followed by the error log structure. The address
+     * of the error log should be passed on to guest's machine check
+     * notification routine. As this hcall is directly called from
+     * 0x200 interrupt vector and returns to assembly routine, we
+     * return (spapr->rtas_addr + RTAS_ERRLOG_OFFSET) instead of
+     * H_SUCCESS. Upon return, We restore srr0 and srr1, increment
+     * r3 to point to the error log and branch to machine check
+     * notification routine in 0x200. r3 containing the error address
+     * is now the argument to OS registered machine check notification
+     * routine. This way we also avoid clobbering additional
+     * registers in 0x200 vector.
+     */
+    return spapr->rtas_addr + RTAS_ERRLOG_OFFSET;
+}
+
 static target_ulong h_logical_load(PowerPCCPU *cpu, sPAPREnvironment *spapr,
                                    target_ulong opcode, target_ulong *args)
 {
@@ -1011,6 +1169,7 @@ static void hypercall_register_types(void)
     /* qemu/KVM-PPC specific hcalls */
     spapr_register_hypercall(KVMPPC_H_RTAS, h_rtas);
     spapr_register_hypercall(KVMPPC_H_RTAS_UPDATE, h_rtas_update);
+    spapr_register_hypercall(KVMPPC_H_REPORT_MC_ERR, h_report_mc_err);
 
     spapr_register_hypercall(H_SET_MODE, h_set_mode);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index ccf67ba..a2d67e9 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -309,7 +309,9 @@ typedef struct sPAPREnvironment {
 /* Client Architecture support */
 #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
 #define KVMPPC_H_RTAS_UPDATE    (KVMPPC_HCALL_BASE + 0x3)
-#define KVMPPC_HCALL_MAX        KVMPPC_H_RTAS_UPDATE
+/* Report Machine Check error */
+#define KVMPPC_H_REPORT_MC_ERR  (KVMPPC_HCALL_BASE + 0x4)
+#define KVMPPC_HCALL_MAX        KVMPPC_H_REPORT_MC_ERR
 
 extern sPAPREnvironment *spapr;
 

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05  7:12 [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests Aravinda Prasad
                   ` (2 preceding siblings ...)
  2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 3/4] target-ppc: Build error log Aravinda Prasad
@ 2014-11-05  7:13 ` Aravinda Prasad
  2014-11-05  8:32   ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
  2014-11-11  3:16   ` [Qemu-devel] " David Gibson
  2014-11-11  3:24 ` [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests David Gibson
  4 siblings, 2 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05  7:13 UTC (permalink / raw)
  To: aik, qemu-ppc, qemu-devel; +Cc: benh, paulus

This patch adds FWNMI support in qemu for powerKVM
guests by handling the ibm,nmi-register rtas call.
Whenever OS issues ibm,nmi-register RTAS call, the
machine check notification address is saved and the
machine check interrupt vector 0x200 is patched to
issue a private hcall.

This patch also handles the cases when multi-processors
experience machine check at or about the same time.
As per PAPR, subsequent processors serialize waiting
for the first processor to issue the ibm,nmi-interlock call.
The second processor retries if the first processor which
received a machine check is still reading the error log
and is yet to issue ibm,nmi-interlock call.

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
---
 hw/ppc/spapr_hcall.c            |   16 +++++++
 hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h          |   17 +++++++
 pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
 4 files changed, 163 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 8f16160..eceb5e5 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -97,6 +97,9 @@ struct rtas_mc_log {
     struct rtas_error_log err_log;
 };
 
+/* Whether machine check handling is in progress by any CPU */
+bool mc_in_progress;
+
 static void do_spr_sync(void *arg)
 {
     struct SPRSyncState *s = arg;
@@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
 
     /*
+     * Only one VCPU can process machine check NMI at a time. Hence
+     * set the lock mc_in_progress. Once the VCPU finishes processing
+     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
+     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
+     * NMI we return 0 asking the VCPU to retry h_report_mc_err
+     */
+    if (mc_in_progress == 1) {
+        return 0;
+    }
+
+    mc_in_progress = 1;
+
+    /*
      * We save the original r3 register in SPRG2 in 0x200 vector,
      * which is patched during call to ibm.nmi-register. Original
      * r3 is required to be included in error log
diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
index 2ec2a8e..71c7662 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -36,6 +36,9 @@
 
 #include <libfdt.h>
 
+#define BRANCH_INST_MASK  0xFC000000
+extern bool mc_in_progress;
+
 static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
                                    uint32_t token, uint32_t nargs,
                                    target_ulong args,
@@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
     rtas_st(rets, 0, ret);
 }
 
+static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
+                                  sPAPREnvironment *spapr,
+                                  uint32_t token, uint32_t nargs,
+                                  target_ulong args,
+                                  uint32_t nret, target_ulong rets)
+{
+    int i;
+    uint32_t ori_inst = 0x60630000;
+    uint32_t branch_inst = 0x48000002;
+    target_ulong guest_machine_check_addr;
+    uint32_t trampoline[TRAMPOLINE_INSTS];
+    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
+    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
+
+    /* Store the system reset and machine check address */
+    guest_machine_check_addr = rtas_ld(args, 1);
+
+    /*
+     * Read the trampoline instructions from RTAS Blob and patch
+     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
+     * machine check address before copying to 0x200 vector
+     */
+    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
+                             trampoline, sizeof(trampoline));
+
+    /* Safety Check */
+    QEMU_BUILD_BUG_ON(sizeof(trampoline) > MC_INTERRUPT_VECTOR_SIZE);
+
+    /* Update the KVMPPC_H_REPORT_MC_ERR value in trampoline */
+    ori_inst |= KVMPPC_H_REPORT_MC_ERR;
+    memcpy(&trampoline[TRAMPOLINE_ORI_INST_INDEX], &ori_inst,
+            sizeof(ori_inst));
+
+    /*
+     * Sanity check guest_machine_check_addr to prevent clobbering
+     * operator value in branch instruction
+     */
+    if (guest_machine_check_addr & BRANCH_INST_MASK) {
+        fprintf(stderr, "Unable to register ibm,nmi_register: "
+                "Invalid machine check handler address\n");
+        rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);
+        return;
+    }
+
+    /*
+     * Update the branch instruction in trampoline
+     * with the absolute machine check address requested by OS.
+     */
+    branch_inst |= guest_machine_check_addr;
+    memcpy(&trampoline[TRAMPOLINE_BR_INST_INDEX], &branch_inst,
+            sizeof(branch_inst));
+
+    /* Handle all Host/Guest LE/BE combinations */
+    if ((*pcc->interrupts_big_endian)(cpu)) {
+        for (i = 0; i < total_inst; i++) {
+            trampoline[i] = cpu_to_be32(trampoline[i]);
+        }
+    } else {
+        for (i = 0; i < total_inst; i++) {
+            trampoline[i] = cpu_to_le32(trampoline[i]);
+        }
+    }
+
+    /* Patch 0x200 NMI interrupt vector memory area of guest */
+    cpu_physical_memory_write(MC_INTERRUPT_VECTOR, trampoline,
+                              sizeof(trampoline));
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+}
+
+static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
+                                   sPAPREnvironment *spapr,
+                                   uint32_t token, uint32_t nargs,
+                                   target_ulong args,
+                                   uint32_t nret, target_ulong rets)
+{
+    /*
+     * VCPU issuing ibm,nmi-interlock is done with NMI handling,
+     * hence unset mc_in_progress.
+     */
+    mc_in_progress = 0;
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+}
+
 static struct rtas_call {
     const char *name;
     spapr_rtas_fn fn;
@@ -419,6 +506,12 @@ static void core_rtas_register_types(void)
                         rtas_ibm_set_system_parameter);
     spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
                         rtas_ibm_os_term);
+    spapr_rtas_register(RTAS_IBM_NMI_REGISTER,
+                        "ibm,nmi-register",
+                        rtas_ibm_nmi_register);
+    spapr_rtas_register(RTAS_IBM_NMI_INTERLOCK,
+                        "ibm,nmi-interlock",
+                        rtas_ibm_nmi_interlock);
 }
 
 type_init(core_rtas_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index a2d67e9..98d0a6c 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -384,8 +384,10 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
 #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
 #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
+#define RTAS_IBM_NMI_REGISTER                   (RTAS_TOKEN_BASE + 0x20)
+#define RTAS_IBM_NMI_INTERLOCK                  (RTAS_TOKEN_BASE + 0x21)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x22)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
@@ -488,4 +490,17 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
 #define RTAS_TRAMPOLINE_OFFSET   0x200
 #define RTAS_ERRLOG_OFFSET       0x800
 
+/* Machine Check Trampoline related macros
+ *
+ * These macros should co-relate to the code we
+ * have in pc-bios/spapr-rtas/spapr-rtas.S
+ */
+#define TRAMPOLINE_INSTS           17
+#define TRAMPOLINE_ORI_INST_INDEX  2
+#define TRAMPOLINE_BR_INST_INDEX   15
+
+/* Machine Check Interrupt related macros */
+#define MC_INTERRUPT_VECTOR           0x200
+#define MC_INTERRUPT_VECTOR_SIZE      0x100
+
 #endif /* !defined (__HW_SPAPR_H__) */
diff --git a/pc-bios/spapr-rtas/spapr-rtas.S b/pc-bios/spapr-rtas/spapr-rtas.S
index 903bec2..c315332 100644
--- a/pc-bios/spapr-rtas/spapr-rtas.S
+++ b/pc-bios/spapr-rtas/spapr-rtas.S
@@ -35,3 +35,41 @@ _start:
 	ori	3,3,KVMPPC_H_RTAS@l
 	sc	1
 	blr
+	. = 0x200
+	/*
+	 * Trampoline saves r3 in sprg2 and issues private hcall
+	 * to request qemu to build error log. QEMU builds the
+	 * error log, copies to rtas-blob and returns the address.
+	 * The initial 16 bytes in return adress consist of saved
+	 * srr0 and srr1 which we restore and pass on the actual error
+	 * log address to OS handled mcachine check notification
+	 * routine
+	 *
+	 * All the below instructions are copied to interrupt vector
+	 * 0x200 at the time of handling ibm,nmi-register rtas call.
+	 */
+	mtsprg  2,3
+	li      3,0
+	/*
+	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
+	 * value is patched below
+	 */
+1:	ori     3,3,0
+	sc      1               /* Issue H_CALL */
+	cmpdi   cr0,3,0
+	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
+	mtsprg  2,4
+	ld      4,0(3)
+	mtsrr0  4               /* Restore srr0 */
+	ld      4,8(3)
+	mtsrr1  4               /* Restore srr1 */
+	ld      4,16(3)
+	mtcrf   0,4             /* Restore cr */
+	addi    3,3,24
+	mfsprg  4,2
+	/*
+	 * Branch to address registered by OS. The branch address is
+	 * patched in the ibm,nmi-register rtas call.
+	 */
+	ba      0x0
+	b       .

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 1/4] target-ppc: Extend rtas-blob
  2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 1/4] target-ppc: Extend rtas-blob Aravinda Prasad
@ 2014-11-05  8:11   ` Alexander Graf
  2014-11-05  8:46     ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: Alexander Graf @ 2014-11-05  8:11 UTC (permalink / raw)
  To: Aravinda Prasad, aik, qemu-ppc, qemu-devel; +Cc: benh, paulus



On 05.11.14 08:12, Aravinda Prasad wrote:
> Extend rtas-blob to accommodate error log. Error log
> structure is saved in rtas space upon a machine check
> exception.
> 
> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
> ---
>  hw/ppc/spapr.c         |    7 +++++++
>  include/hw/ppc/spapr.h |    5 +++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 30de25d..38e26af 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1431,6 +1431,13 @@ static void ppc_spapr_init(MachineState *machine)
>  
>      filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, "spapr-rtas.bin");
>      spapr->rtas_size = get_image_size(filename);
> +
> +    /*
> +     * Resize blob to accommodate error log. The layout of the rtas
> +     * blob is defined in include/hw/ppc/spapr.h
> +     */
> +    spapr->rtas_size = TARGET_PAGE_ALIGN(spapr->rtas_size);

How big is the error log? You could just extend the RTAS blob to include
space for it if it's not too big.

> +
>      spapr->rtas_blob = g_malloc(spapr->rtas_size);
>      if (load_image_size(filename, spapr->rtas_blob, spapr->rtas_size) < 0) {
>          hw_error("qemu: could not load LPAR rtas '%s'\n", filename);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 749daf4..d08fcc2 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -480,4 +480,9 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        sPAPRTCETable *tcet);
>  
> +/* RTAS Blob layout in memory */
> +#define RTAS_ENTRY_OFFSET        0
> +#define RTAS_TRAMPOLINE_OFFSET   0x200
> +#define RTAS_ERRLOG_OFFSET       0x800

I thought we agreed that these offsets should've been defined by the
blob itself?


Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05  7:13 ` [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call Aravinda Prasad
@ 2014-11-05  8:32   ` Alexander Graf
  2014-11-05 10:37     ` Aravinda Prasad
  2014-11-05 15:46     ` Tom Musta
  2014-11-11  3:16   ` [Qemu-devel] " David Gibson
  1 sibling, 2 replies; 66+ messages in thread
From: Alexander Graf @ 2014-11-05  8:32 UTC (permalink / raw)
  To: Aravinda Prasad, aik, qemu-ppc, qemu-devel; +Cc: benh, paulus



On 05.11.14 08:13, Aravinda Prasad wrote:
> This patch adds FWNMI support in qemu for powerKVM
> guests by handling the ibm,nmi-register rtas call.
> Whenever OS issues ibm,nmi-register RTAS call, the
> machine check notification address is saved and the
> machine check interrupt vector 0x200 is patched to
> issue a private hcall.
> 
> This patch also handles the cases when multi-processors
> experience machine check at or about the same time.
> As per PAPR, subsequent processors serialize waiting
> for the first processor to issue the ibm,nmi-interlock call.
> The second processor retries if the first processor which
> received a machine check is still reading the error log
> and is yet to issue ibm,nmi-interlock call.
> 
> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
> ---
>  hw/ppc/spapr_hcall.c            |   16 +++++++
>  hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h          |   17 +++++++
>  pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
>  4 files changed, 163 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 8f16160..eceb5e5 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -97,6 +97,9 @@ struct rtas_mc_log {
>      struct rtas_error_log err_log;
>  };
>  
> +/* Whether machine check handling is in progress by any CPU */
> +bool mc_in_progress;
> +
>  static void do_spr_sync(void *arg)
>  {
>      struct SPRSyncState *s = arg;
> @@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>      cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
>  
>      /*
> +     * Only one VCPU can process machine check NMI at a time. Hence
> +     * set the lock mc_in_progress. Once the VCPU finishes processing
> +     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
> +     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
> +     * NMI we return 0 asking the VCPU to retry h_report_mc_err
> +     */
> +    if (mc_in_progress == 1) {

Please don't depend on bools being numbers. Use true / false. For if()s,
just don't use == at all - it makes it more readable.

> +        return 0;
> +    }
> +
> +    mc_in_progress = 1;
> +
> +    /*
>       * We save the original r3 register in SPRG2 in 0x200 vector,
>       * which is patched during call to ibm.nmi-register. Original
>       * r3 is required to be included in error log
> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
> index 2ec2a8e..71c7662 100644
> --- a/hw/ppc/spapr_rtas.c
> +++ b/hw/ppc/spapr_rtas.c
> @@ -36,6 +36,9 @@
>  
>  #include <libfdt.h>
>  
> +#define BRANCH_INST_MASK  0xFC000000
> +extern bool mc_in_progress;

Please put this into the spapr struct.

> +
>  static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>                                     uint32_t token, uint32_t nargs,
>                                     target_ulong args,
> @@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
>      rtas_st(rets, 0, ret);
>  }
>  
> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
> +                                  sPAPREnvironment *spapr,
> +                                  uint32_t token, uint32_t nargs,
> +                                  target_ulong args,
> +                                  uint32_t nret, target_ulong rets)
> +{
> +    int i;
> +    uint32_t ori_inst = 0x60630000;
> +    uint32_t branch_inst = 0x48000002;
> +    target_ulong guest_machine_check_addr;
> +    uint32_t trampoline[TRAMPOLINE_INSTS];
> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);

ARRAY_SIZE(trampoline), though I don't quite understand why you need a
variable that contains the same value as a constant (TRAMPOLINE_INSTS).

But since you're moving all of those bits into variable fields on the
rtas blob itself as we discussed in the last version, I guess this code
will go away anyways ;).

> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
> +
> +    /* Store the system reset and machine check address */
> +    guest_machine_check_addr = rtas_ld(args, 1);

Load or Store? I don't find the comment particularly useful either ;).

> +
> +    /*
> +     * Read the trampoline instructions from RTAS Blob and patch
> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
> +     * machine check address before copying to 0x200 vector
> +     */
> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
> +                             trampoline, sizeof(trampoline));
> +
> +    /* Safety Check */

Same for this comment.

> +    QEMU_BUILD_BUG_ON(sizeof(trampoline) > MC_INTERRUPT_VECTOR_SIZE);
> +
> +    /* Update the KVMPPC_H_REPORT_MC_ERR value in trampoline */
> +    ori_inst |= KVMPPC_H_REPORT_MC_ERR;
> +    memcpy(&trampoline[TRAMPOLINE_ORI_INST_INDEX], &ori_inst,
> +            sizeof(ori_inst));

Why memcpy a u32 into a u32 array?

> +
> +    /*
> +     * Sanity check guest_machine_check_addr to prevent clobbering
> +     * operator value in branch instruction
> +     */
> +    if (guest_machine_check_addr & BRANCH_INST_MASK) {
> +        fprintf(stderr, "Unable to register ibm,nmi_register: "
> +                "Invalid machine check handler address\n");

In general, printf's in guest triggerable code aren't a great idea,
since the guest could flood our host logs with this. I can't say we're
doing a great job at it already though, so it probably doesn't matter much.

> +        rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);
> +        return;
> +    }
> +
> +    /*
> +     * Update the branch instruction in trampoline
> +     * with the absolute machine check address requested by OS.
> +     */
> +    branch_inst |= guest_machine_check_addr;
> +    memcpy(&trampoline[TRAMPOLINE_BR_INST_INDEX], &branch_inst,
> +            sizeof(branch_inst));
> +
> +    /* Handle all Host/Guest LE/BE combinations */
> +    if ((*pcc->interrupts_big_endian)(cpu)) {
> +        for (i = 0; i < total_inst; i++) {
> +            trampoline[i] = cpu_to_be32(trampoline[i]);
> +        }
> +    } else {
> +        for (i = 0; i < total_inst; i++) {
> +            trampoline[i] = cpu_to_le32(trampoline[i]);
> +        }
> +    }
> +
> +    /* Patch 0x200 NMI interrupt vector memory area of guest */
> +    cpu_physical_memory_write(MC_INTERRUPT_VECTOR, trampoline,
> +                              sizeof(trampoline));
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +}
> +
> +static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
> +                                   sPAPREnvironment *spapr,
> +                                   uint32_t token, uint32_t nargs,
> +                                   target_ulong args,
> +                                   uint32_t nret, target_ulong rets)
> +{
> +    /*
> +     * VCPU issuing ibm,nmi-interlock is done with NMI handling,
> +     * hence unset mc_in_progress.
> +     */
> +    mc_in_progress = 0;
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +}
> +
>  static struct rtas_call {
>      const char *name;
>      spapr_rtas_fn fn;
> @@ -419,6 +506,12 @@ static void core_rtas_register_types(void)
>                          rtas_ibm_set_system_parameter);
>      spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
>                          rtas_ibm_os_term);
> +    spapr_rtas_register(RTAS_IBM_NMI_REGISTER,
> +                        "ibm,nmi-register",
> +                        rtas_ibm_nmi_register);
> +    spapr_rtas_register(RTAS_IBM_NMI_INTERLOCK,
> +                        "ibm,nmi-interlock",
> +                        rtas_ibm_nmi_interlock);
>  }
>  
>  type_init(core_rtas_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index a2d67e9..98d0a6c 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -384,8 +384,10 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
>  #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
>  #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
> +#define RTAS_IBM_NMI_REGISTER                   (RTAS_TOKEN_BASE + 0x20)
> +#define RTAS_IBM_NMI_INTERLOCK                  (RTAS_TOKEN_BASE + 0x21)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x22)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> @@ -488,4 +490,17 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>  #define RTAS_TRAMPOLINE_OFFSET   0x200
>  #define RTAS_ERRLOG_OFFSET       0x800
>  
> +/* Machine Check Trampoline related macros
> + *
> + * These macros should co-relate to the code we
> + * have in pc-bios/spapr-rtas/spapr-rtas.S
> + */
> +#define TRAMPOLINE_INSTS           17
> +#define TRAMPOLINE_ORI_INST_INDEX  2
> +#define TRAMPOLINE_BR_INST_INDEX   15
> +
> +/* Machine Check Interrupt related macros */
> +#define MC_INTERRUPT_VECTOR           0x200
> +#define MC_INTERRUPT_VECTOR_SIZE      0x100
> +
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/pc-bios/spapr-rtas/spapr-rtas.S b/pc-bios/spapr-rtas/spapr-rtas.S
> index 903bec2..c315332 100644
> --- a/pc-bios/spapr-rtas/spapr-rtas.S
> +++ b/pc-bios/spapr-rtas/spapr-rtas.S

Please add #defines at the top of the file for the register names:

  #define r0 0
  #define r1 1
  ...

That way the code below will get much more readable :)

Also, you want a jump table here as we discussed in the last review
round. That table would tell you

  a) Entry address for RTAS
  b) Offset of the NMI code
  c) To-be-patched offsets of the instructions inside the NMI code

Then we have all offsets automatically generated inside a single file
and don't have to maintain fragile relationships between random headers
with offset defines and the .S file.


Alex

> @@ -35,3 +35,41 @@ _start:
>  	ori	3,3,KVMPPC_H_RTAS@l
>  	sc	1
>  	blr
> +	. = 0x200
> +	/*
> +	 * Trampoline saves r3 in sprg2 and issues private hcall
> +	 * to request qemu to build error log. QEMU builds the
> +	 * error log, copies to rtas-blob and returns the address.
> +	 * The initial 16 bytes in return adress consist of saved
> +	 * srr0 and srr1 which we restore and pass on the actual error
> +	 * log address to OS handled mcachine check notification
> +	 * routine
> +	 *
> +	 * All the below instructions are copied to interrupt vector
> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
> +	 */
> +	mtsprg  2,3
> +	li      3,0
> +	/*
> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
> +	 * value is patched below
> +	 */
> +1:	ori     3,3,0
> +	sc      1               /* Issue H_CALL */
> +	cmpdi   cr0,3,0
> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
> +	mtsprg  2,4
> +	ld      4,0(3)
> +	mtsrr0  4               /* Restore srr0 */
> +	ld      4,8(3)
> +	mtsrr1  4               /* Restore srr1 */
> +	ld      4,16(3)
> +	mtcrf   0,4             /* Restore cr */
> +	addi    3,3,24
> +	mfsprg  4,2
> +	/*
> +	 * Branch to address registered by OS. The branch address is
> +	 * patched in the ibm,nmi-register rtas call.
> +	 */
> +	ba      0x0
> +	b       .
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 1/4] target-ppc: Extend rtas-blob
  2014-11-05  8:11   ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
@ 2014-11-05  8:46     ` Aravinda Prasad
  2014-11-05  9:00       ` Alexander Graf
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05  8:46 UTC (permalink / raw)
  To: Alexander Graf; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus



On Wednesday 05 November 2014 01:41 PM, Alexander Graf wrote:
> 
> 
> On 05.11.14 08:12, Aravinda Prasad wrote:
>> Extend rtas-blob to accommodate error log. Error log
>> structure is saved in rtas space upon a machine check
>> exception.
>>
>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>> ---
>>  hw/ppc/spapr.c         |    7 +++++++
>>  include/hw/ppc/spapr.h |    5 +++++
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 30de25d..38e26af 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1431,6 +1431,13 @@ static void ppc_spapr_init(MachineState *machine)
>>  
>>      filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, "spapr-rtas.bin");
>>      spapr->rtas_size = get_image_size(filename);
>> +
>> +    /*
>> +     * Resize blob to accommodate error log. The layout of the rtas
>> +     * blob is defined in include/hw/ppc/spapr.h
>> +     */
>> +    spapr->rtas_size = TARGET_PAGE_ALIGN(spapr->rtas_size);
> 
> How big is the error log? You could just extend the RTAS blob to include
> space for it if it's not too big.

Error log is around 10 bytes and requires additional 24 bytes to store
saved sro/srr1.

Hmm.. yes it can be included in RTAS blob itself.


> 
>> +
>>      spapr->rtas_blob = g_malloc(spapr->rtas_size);
>>      if (load_image_size(filename, spapr->rtas_blob, spapr->rtas_size) < 0) {
>>          hw_error("qemu: could not load LPAR rtas '%s'\n", filename);
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 749daf4..d08fcc2 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -480,4 +480,9 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>                        sPAPRTCETable *tcet);
>>  
>> +/* RTAS Blob layout in memory */
>> +#define RTAS_ENTRY_OFFSET        0
>> +#define RTAS_TRAMPOLINE_OFFSET   0x200
>> +#define RTAS_ERRLOG_OFFSET       0x800
> 
> I thought we agreed that these offsets should've been defined by the
> blob itself?
>

I think I got it wrong.

I will include these indexes at the entry of RTAS blob. With that we
will have something like this:

RTAS_ENTRY_OFFSET  =      *(spapr->rtas_addr)
RTAS_TRAMPOLINE_OFFSET =  *(spapr->rtas_addr+8)
RTAS_ERRLOG_OFFSET =      *(spapr->rtas_addr+16)

I will fix this.

Regards,
Aravinda

> 
> Alex
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 1/4] target-ppc: Extend rtas-blob
  2014-11-05  8:46     ` Aravinda Prasad
@ 2014-11-05  9:00       ` Alexander Graf
  2014-11-05  9:07         ` Alexander Graf
  0 siblings, 1 reply; 66+ messages in thread
From: Alexander Graf @ 2014-11-05  9:00 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus



On 05.11.14 09:46, Aravinda Prasad wrote:
> 
> 
> On Wednesday 05 November 2014 01:41 PM, Alexander Graf wrote:
>>
>>
>> On 05.11.14 08:12, Aravinda Prasad wrote:
>>> Extend rtas-blob to accommodate error log. Error log
>>> structure is saved in rtas space upon a machine check
>>> exception.
>>>
>>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>>> ---
>>>  hw/ppc/spapr.c         |    7 +++++++
>>>  include/hw/ppc/spapr.h |    5 +++++
>>>  2 files changed, 12 insertions(+)
>>>
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index 30de25d..38e26af 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -1431,6 +1431,13 @@ static void ppc_spapr_init(MachineState *machine)
>>>  
>>>      filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, "spapr-rtas.bin");
>>>      spapr->rtas_size = get_image_size(filename);
>>> +
>>> +    /*
>>> +     * Resize blob to accommodate error log. The layout of the rtas
>>> +     * blob is defined in include/hw/ppc/spapr.h
>>> +     */
>>> +    spapr->rtas_size = TARGET_PAGE_ALIGN(spapr->rtas_size);
>>
>> How big is the error log? You could just extend the RTAS blob to include
>> space for it if it's not too big.
> 
> Error log is around 10 bytes and requires additional 24 bytes to store
> saved sro/srr1.
> 
> Hmm.. yes it can be included in RTAS blob itself.
> 
> 
>>
>>> +
>>>      spapr->rtas_blob = g_malloc(spapr->rtas_size);
>>>      if (load_image_size(filename, spapr->rtas_blob, spapr->rtas_size) < 0) {
>>>          hw_error("qemu: could not load LPAR rtas '%s'\n", filename);
>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>> index 749daf4..d08fcc2 100644
>>> --- a/include/hw/ppc/spapr.h
>>> +++ b/include/hw/ppc/spapr.h
>>> @@ -480,4 +480,9 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>                        sPAPRTCETable *tcet);
>>>  
>>> +/* RTAS Blob layout in memory */
>>> +#define RTAS_ENTRY_OFFSET        0
>>> +#define RTAS_TRAMPOLINE_OFFSET   0x200
>>> +#define RTAS_ERRLOG_OFFSET       0x800
>>
>> I thought we agreed that these offsets should've been defined by the
>> blob itself?
>>
> 
> I think I got it wrong.
> 
> I will include these indexes at the entry of RTAS blob. With that we
> will have something like this:
> 
> RTAS_ENTRY_OFFSET  =      *(spapr->rtas_addr)
> RTAS_TRAMPOLINE_OFFSET =  *(spapr->rtas_addr+8)
> RTAS_ERRLOG_OFFSET =      *(spapr->rtas_addr+16)
> 
> I will fix this.

Cool :). Just store the offsets inside of a helper struct that you for
example store in the spapr struct, then we don't need to read volatile
guest memory for the offsets.


Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 1/4] target-ppc: Extend rtas-blob
  2014-11-05  9:00       ` Alexander Graf
@ 2014-11-05  9:07         ` Alexander Graf
  2014-11-05 10:41           ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: Alexander Graf @ 2014-11-05  9:07 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus



On 05.11.14 10:00, Alexander Graf wrote:
> 
> 
> On 05.11.14 09:46, Aravinda Prasad wrote:
>>
>>
>> On Wednesday 05 November 2014 01:41 PM, Alexander Graf wrote:
>>>
>>>
>>> On 05.11.14 08:12, Aravinda Prasad wrote:
>>>> Extend rtas-blob to accommodate error log. Error log
>>>> structure is saved in rtas space upon a machine check
>>>> exception.
>>>>
>>>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>>>> ---
>>>>  hw/ppc/spapr.c         |    7 +++++++
>>>>  include/hw/ppc/spapr.h |    5 +++++
>>>>  2 files changed, 12 insertions(+)
>>>>
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index 30de25d..38e26af 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -1431,6 +1431,13 @@ static void ppc_spapr_init(MachineState *machine)
>>>>  
>>>>      filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, "spapr-rtas.bin");
>>>>      spapr->rtas_size = get_image_size(filename);
>>>> +
>>>> +    /*
>>>> +     * Resize blob to accommodate error log. The layout of the rtas
>>>> +     * blob is defined in include/hw/ppc/spapr.h
>>>> +     */
>>>> +    spapr->rtas_size = TARGET_PAGE_ALIGN(spapr->rtas_size);
>>>
>>> How big is the error log? You could just extend the RTAS blob to include
>>> space for it if it's not too big.
>>
>> Error log is around 10 bytes and requires additional 24 bytes to store
>> saved sro/srr1.
>>
>> Hmm.. yes it can be included in RTAS blob itself.
>>
>>
>>>
>>>> +
>>>>      spapr->rtas_blob = g_malloc(spapr->rtas_size);
>>>>      if (load_image_size(filename, spapr->rtas_blob, spapr->rtas_size) < 0) {
>>>>          hw_error("qemu: could not load LPAR rtas '%s'\n", filename);
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index 749daf4..d08fcc2 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -480,4 +480,9 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>>                        sPAPRTCETable *tcet);
>>>>  
>>>> +/* RTAS Blob layout in memory */
>>>> +#define RTAS_ENTRY_OFFSET        0
>>>> +#define RTAS_TRAMPOLINE_OFFSET   0x200
>>>> +#define RTAS_ERRLOG_OFFSET       0x800
>>>
>>> I thought we agreed that these offsets should've been defined by the
>>> blob itself?
>>>
>>
>> I think I got it wrong.
>>
>> I will include these indexes at the entry of RTAS blob. With that we
>> will have something like this:
>>
>> RTAS_ENTRY_OFFSET  =      *(spapr->rtas_addr)
>> RTAS_TRAMPOLINE_OFFSET =  *(spapr->rtas_addr+8)
>> RTAS_ERRLOG_OFFSET =      *(spapr->rtas_addr+16)
>>
>> I will fix this.
> 
> Cool :). Just store the offsets inside of a helper struct that you for
> example store in the spapr struct, then we don't need to read volatile
> guest memory for the offsets.

I just reread what I wrote and figured it's not exactly verbose. What I
meant was that you read them on load into a struct. Then when working
with the offsets, you only use the cached ones from the struct.

That way when the guest for whatever reason modifies the RTAS blob in
memory, we would still use the old offsets and ensure that we don't end
up overwriting memory that we never intended to overwrite ;).


Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05  8:32   ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
@ 2014-11-05 10:37     ` Aravinda Prasad
  2014-11-05 11:07       ` Alexander Graf
  2014-11-05 15:46     ` Tom Musta
  1 sibling, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05 10:37 UTC (permalink / raw)
  To: Alexander Graf; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus



On Wednesday 05 November 2014 02:02 PM, Alexander Graf wrote:
> 
> 
> On 05.11.14 08:13, Aravinda Prasad wrote:
>> This patch adds FWNMI support in qemu for powerKVM
>> guests by handling the ibm,nmi-register rtas call.
>> Whenever OS issues ibm,nmi-register RTAS call, the
>> machine check notification address is saved and the
>> machine check interrupt vector 0x200 is patched to
>> issue a private hcall.
>>
>> This patch also handles the cases when multi-processors
>> experience machine check at or about the same time.
>> As per PAPR, subsequent processors serialize waiting
>> for the first processor to issue the ibm,nmi-interlock call.
>> The second processor retries if the first processor which
>> received a machine check is still reading the error log
>> and is yet to issue ibm,nmi-interlock call.
>>
>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>> ---
>>  hw/ppc/spapr_hcall.c            |   16 +++++++
>>  hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
>>  include/hw/ppc/spapr.h          |   17 +++++++
>>  pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
>>  4 files changed, 163 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index 8f16160..eceb5e5 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -97,6 +97,9 @@ struct rtas_mc_log {
>>      struct rtas_error_log err_log;
>>  };
>>  
>> +/* Whether machine check handling is in progress by any CPU */
>> +bool mc_in_progress;
>> +
>>  static void do_spr_sync(void *arg)
>>  {
>>      struct SPRSyncState *s = arg;
>> @@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>      cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
>>  
>>      /*
>> +     * Only one VCPU can process machine check NMI at a time. Hence
>> +     * set the lock mc_in_progress. Once the VCPU finishes processing
>> +     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
>> +     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
>> +     * NMI we return 0 asking the VCPU to retry h_report_mc_err
>> +     */
>> +    if (mc_in_progress == 1) {
> 
> Please don't depend on bools being numbers. Use true / false. For if()s,
> just don't use == at all - it makes it more readable.

ok

> 
>> +        return 0;
>> +    }
>> +
>> +    mc_in_progress = 1;
>> +
>> +    /*
>>       * We save the original r3 register in SPRG2 in 0x200 vector,
>>       * which is patched during call to ibm.nmi-register. Original
>>       * r3 is required to be included in error log
>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>> index 2ec2a8e..71c7662 100644
>> --- a/hw/ppc/spapr_rtas.c
>> +++ b/hw/ppc/spapr_rtas.c
>> @@ -36,6 +36,9 @@
>>  
>>  #include <libfdt.h>
>>  
>> +#define BRANCH_INST_MASK  0xFC000000
>> +extern bool mc_in_progress;
> 
> Please put this into the spapr struct.

ok

> 
>> +
>>  static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>                                     uint32_t token, uint32_t nargs,
>>                                     target_ulong args,
>> @@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
>>      rtas_st(rets, 0, ret);
>>  }
>>  
>> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
>> +                                  sPAPREnvironment *spapr,
>> +                                  uint32_t token, uint32_t nargs,
>> +                                  target_ulong args,
>> +                                  uint32_t nret, target_ulong rets)
>> +{
>> +    int i;
>> +    uint32_t ori_inst = 0x60630000;
>> +    uint32_t branch_inst = 0x48000002;
>> +    target_ulong guest_machine_check_addr;
>> +    uint32_t trampoline[TRAMPOLINE_INSTS];
>> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
> 
> ARRAY_SIZE(trampoline), though I don't quite understand why you need a
> variable that contains the same value as a constant (TRAMPOLINE_INSTS).
> 
> But since you're moving all of those bits into variable fields on the
> rtas blob itself as we discussed in the last version, I guess this code
> will go away anyways ;).

I think we still need this. We need to patch the KVMPPC_H_REPORT_MC_ERR
number and branch address in the trampoline and also, depending on
whether the guest running in LE/BE we may need to flip the bits in the
trampoline before copying it to 0x200 machine check vector.

As rtas-blob is part of the guest memory I do not want to patch these in
rtas-blob, hence I copy the trampoline from the rtas-blob to an array,
modify accordingly and then move it to 0x200 machine check vector.

> 
>> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
>> +
>> +    /* Store the system reset and machine check address */
>> +    guest_machine_check_addr = rtas_ld(args, 1);
> 
> Load or Store? I don't find the comment particularly useful either ;).

will reword it or may delete it.

> 
>> +
>> +    /*
>> +     * Read the trampoline instructions from RTAS Blob and patch
>> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
>> +     * machine check address before copying to 0x200 vector
>> +     */
>> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
>> +                             trampoline, sizeof(trampoline));
>> +
>> +    /* Safety Check */
> 
> Same for this comment.

we have only 0x100 bytes that can be copied at 0x200. If the trampoline
size exceeds then the next interrupt vector code is overwritten. Hence a
safety check during compile time to make sure trampoline is within 0x100
bytes.

> 
>> +    QEMU_BUILD_BUG_ON(sizeof(trampoline) > MC_INTERRUPT_VECTOR_SIZE);
>> +
>> +    /* Update the KVMPPC_H_REPORT_MC_ERR value in trampoline */
>> +    ori_inst |= KVMPPC_H_REPORT_MC_ERR;
>> +    memcpy(&trampoline[TRAMPOLINE_ORI_INST_INDEX], &ori_inst,
>> +            sizeof(ori_inst));
> 
> Why memcpy a u32 into a u32 array?

not required. forgot to remove while transitioning from earlier patch
where the trampoline was char *

> 
>> +
>> +    /*
>> +     * Sanity check guest_machine_check_addr to prevent clobbering
>> +     * operator value in branch instruction
>> +     */
>> +    if (guest_machine_check_addr & BRANCH_INST_MASK) {
>> +        fprintf(stderr, "Unable to register ibm,nmi_register: "
>> +                "Invalid machine check handler address\n");
> 
> In general, printf's in guest triggerable code aren't a great idea,
> since the guest could flood our host logs with this. I can't say we're
> doing a great job at it already though, so it probably doesn't matter much.

noted

> 
>> +        rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Update the branch instruction in trampoline
>> +     * with the absolute machine check address requested by OS.
>> +     */
>> +    branch_inst |= guest_machine_check_addr;
>> +    memcpy(&trampoline[TRAMPOLINE_BR_INST_INDEX], &branch_inst,
>> +            sizeof(branch_inst));
>> +
>> +    /* Handle all Host/Guest LE/BE combinations */
>> +    if ((*pcc->interrupts_big_endian)(cpu)) {
>> +        for (i = 0; i < total_inst; i++) {
>> +            trampoline[i] = cpu_to_be32(trampoline[i]);
>> +        }
>> +    } else {
>> +        for (i = 0; i < total_inst; i++) {
>> +            trampoline[i] = cpu_to_le32(trampoline[i]);
>> +        }
>> +    }
>> +
>> +    /* Patch 0x200 NMI interrupt vector memory area of guest */
>> +    cpu_physical_memory_write(MC_INTERRUPT_VECTOR, trampoline,
>> +                              sizeof(trampoline));
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +}
>> +
>> +static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
>> +                                   sPAPREnvironment *spapr,
>> +                                   uint32_t token, uint32_t nargs,
>> +                                   target_ulong args,
>> +                                   uint32_t nret, target_ulong rets)
>> +{
>> +    /*
>> +     * VCPU issuing ibm,nmi-interlock is done with NMI handling,
>> +     * hence unset mc_in_progress.
>> +     */
>> +    mc_in_progress = 0;
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +}
>> +
>>  static struct rtas_call {
>>      const char *name;
>>      spapr_rtas_fn fn;
>> @@ -419,6 +506,12 @@ static void core_rtas_register_types(void)
>>                          rtas_ibm_set_system_parameter);
>>      spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
>>                          rtas_ibm_os_term);
>> +    spapr_rtas_register(RTAS_IBM_NMI_REGISTER,
>> +                        "ibm,nmi-register",
>> +                        rtas_ibm_nmi_register);
>> +    spapr_rtas_register(RTAS_IBM_NMI_INTERLOCK,
>> +                        "ibm,nmi-interlock",
>> +                        rtas_ibm_nmi_interlock);
>>  }
>>  
>>  type_init(core_rtas_register_types)
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index a2d67e9..98d0a6c 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -384,8 +384,10 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
>>  #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
>>  #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
>> +#define RTAS_IBM_NMI_REGISTER                   (RTAS_TOKEN_BASE + 0x20)
>> +#define RTAS_IBM_NMI_INTERLOCK                  (RTAS_TOKEN_BASE + 0x21)
>>  
>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x22)
>>  
>>  /* RTAS ibm,get-system-parameter token values */
>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>> @@ -488,4 +490,17 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>  #define RTAS_TRAMPOLINE_OFFSET   0x200
>>  #define RTAS_ERRLOG_OFFSET       0x800
>>  
>> +/* Machine Check Trampoline related macros
>> + *
>> + * These macros should co-relate to the code we
>> + * have in pc-bios/spapr-rtas/spapr-rtas.S
>> + */
>> +#define TRAMPOLINE_INSTS           17
>> +#define TRAMPOLINE_ORI_INST_INDEX  2
>> +#define TRAMPOLINE_BR_INST_INDEX   15
>> +
>> +/* Machine Check Interrupt related macros */
>> +#define MC_INTERRUPT_VECTOR           0x200
>> +#define MC_INTERRUPT_VECTOR_SIZE      0x100
>> +
>>  #endif /* !defined (__HW_SPAPR_H__) */
>> diff --git a/pc-bios/spapr-rtas/spapr-rtas.S b/pc-bios/spapr-rtas/spapr-rtas.S
>> index 903bec2..c315332 100644
>> --- a/pc-bios/spapr-rtas/spapr-rtas.S
>> +++ b/pc-bios/spapr-rtas/spapr-rtas.S
> 
> Please add #defines at the top of the file for the register names:
> 
>   #define r0 0
>   #define r1 1
>   ...
> 
> That way the code below will get much more readable :)

hmm will do that

> 
> Also, you want a jump table here as we discussed in the last review
> round. That table would tell you
> 
>   a) Entry address for RTAS
>   b) Offset of the NMI code
>   c) To-be-patched offsets of the instructions inside the NMI code
> 
> Then we have all offsets automatically generated inside a single file
> and don't have to maintain fragile relationships between random headers
> with offset defines and the .S file.

I think I got this wrong last time. I will fix this.

> 
> 
> Alex
> 
>> @@ -35,3 +35,41 @@ _start:
>>  	ori	3,3,KVMPPC_H_RTAS@l
>>  	sc	1
>>  	blr
>> +	. = 0x200
>> +	/*
>> +	 * Trampoline saves r3 in sprg2 and issues private hcall
>> +	 * to request qemu to build error log. QEMU builds the
>> +	 * error log, copies to rtas-blob and returns the address.
>> +	 * The initial 16 bytes in return adress consist of saved
>> +	 * srr0 and srr1 which we restore and pass on the actual error
>> +	 * log address to OS handled mcachine check notification
>> +	 * routine
>> +	 *
>> +	 * All the below instructions are copied to interrupt vector
>> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
>> +	 */
>> +	mtsprg  2,3
>> +	li      3,0
>> +	/*
>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>> +	 * value is patched below
>> +	 */
>> +1:	ori     3,3,0
>> +	sc      1               /* Issue H_CALL */
>> +	cmpdi   cr0,3,0
>> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
>> +	mtsprg  2,4
>> +	ld      4,0(3)
>> +	mtsrr0  4               /* Restore srr0 */
>> +	ld      4,8(3)
>> +	mtsrr1  4               /* Restore srr1 */
>> +	ld      4,16(3)
>> +	mtcrf   0,4             /* Restore cr */
>> +	addi    3,3,24
>> +	mfsprg  4,2
>> +	/*
>> +	 * Branch to address registered by OS. The branch address is
>> +	 * patched in the ibm,nmi-register rtas call.
>> +	 */
>> +	ba      0x0
>> +	b       .
>>
>>
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 1/4] target-ppc: Extend rtas-blob
  2014-11-05  9:07         ` Alexander Graf
@ 2014-11-05 10:41           ` Aravinda Prasad
  0 siblings, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05 10:41 UTC (permalink / raw)
  To: Alexander Graf; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus



On Wednesday 05 November 2014 02:37 PM, Alexander Graf wrote:
> 
> 
> On 05.11.14 10:00, Alexander Graf wrote:
>>
>>
>> On 05.11.14 09:46, Aravinda Prasad wrote:
>>>
>>>
>>> On Wednesday 05 November 2014 01:41 PM, Alexander Graf wrote:
>>>>
>>>>
>>>> On 05.11.14 08:12, Aravinda Prasad wrote:
>>>>> Extend rtas-blob to accommodate error log. Error log
>>>>> structure is saved in rtas space upon a machine check
>>>>> exception.
>>>>>
>>>>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>>>>> ---
>>>>>  hw/ppc/spapr.c         |    7 +++++++
>>>>>  include/hw/ppc/spapr.h |    5 +++++
>>>>>  2 files changed, 12 insertions(+)
>>>>>
>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>> index 30de25d..38e26af 100644
>>>>> --- a/hw/ppc/spapr.c
>>>>> +++ b/hw/ppc/spapr.c
>>>>> @@ -1431,6 +1431,13 @@ static void ppc_spapr_init(MachineState *machine)
>>>>>  
>>>>>      filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, "spapr-rtas.bin");
>>>>>      spapr->rtas_size = get_image_size(filename);
>>>>> +
>>>>> +    /*
>>>>> +     * Resize blob to accommodate error log. The layout of the rtas
>>>>> +     * blob is defined in include/hw/ppc/spapr.h
>>>>> +     */
>>>>> +    spapr->rtas_size = TARGET_PAGE_ALIGN(spapr->rtas_size);
>>>>
>>>> How big is the error log? You could just extend the RTAS blob to include
>>>> space for it if it's not too big.
>>>
>>> Error log is around 10 bytes and requires additional 24 bytes to store
>>> saved sro/srr1.
>>>
>>> Hmm.. yes it can be included in RTAS blob itself.
>>>
>>>
>>>>
>>>>> +
>>>>>      spapr->rtas_blob = g_malloc(spapr->rtas_size);
>>>>>      if (load_image_size(filename, spapr->rtas_blob, spapr->rtas_size) < 0) {
>>>>>          hw_error("qemu: could not load LPAR rtas '%s'\n", filename);
>>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>>> index 749daf4..d08fcc2 100644
>>>>> --- a/include/hw/ppc/spapr.h
>>>>> +++ b/include/hw/ppc/spapr.h
>>>>> @@ -480,4 +480,9 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>>>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>>>                        sPAPRTCETable *tcet);
>>>>>  
>>>>> +/* RTAS Blob layout in memory */
>>>>> +#define RTAS_ENTRY_OFFSET        0
>>>>> +#define RTAS_TRAMPOLINE_OFFSET   0x200
>>>>> +#define RTAS_ERRLOG_OFFSET       0x800
>>>>
>>>> I thought we agreed that these offsets should've been defined by the
>>>> blob itself?
>>>>
>>>
>>> I think I got it wrong.
>>>
>>> I will include these indexes at the entry of RTAS blob. With that we
>>> will have something like this:
>>>
>>> RTAS_ENTRY_OFFSET  =      *(spapr->rtas_addr)
>>> RTAS_TRAMPOLINE_OFFSET =  *(spapr->rtas_addr+8)
>>> RTAS_ERRLOG_OFFSET =      *(spapr->rtas_addr+16)
>>>
>>> I will fix this.
>>
>> Cool :). Just store the offsets inside of a helper struct that you for
>> example store in the spapr struct, then we don't need to read volatile
>> guest memory for the offsets.
> 
> I just reread what I wrote and figured it's not exactly verbose. What I
> meant was that you read them on load into a struct. Then when working
> with the offsets, you only use the cached ones from the struct.
> 
> That way when the guest for whatever reason modifies the RTAS blob in
> memory, we would still use the old offsets and ensure that we don't end
> up overwriting memory that we never intended to overwrite ;).

sure

> 
> 
> Alex
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05 10:37     ` Aravinda Prasad
@ 2014-11-05 11:07       ` Alexander Graf
  2014-11-05 11:24         ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: Alexander Graf @ 2014-11-05 11:07 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus



On 05.11.14 11:37, Aravinda Prasad wrote:
> 
> 
> On Wednesday 05 November 2014 02:02 PM, Alexander Graf wrote:
>>
>>
>> On 05.11.14 08:13, Aravinda Prasad wrote:
>>> This patch adds FWNMI support in qemu for powerKVM
>>> guests by handling the ibm,nmi-register rtas call.
>>> Whenever OS issues ibm,nmi-register RTAS call, the
>>> machine check notification address is saved and the
>>> machine check interrupt vector 0x200 is patched to
>>> issue a private hcall.
>>>
>>> This patch also handles the cases when multi-processors
>>> experience machine check at or about the same time.
>>> As per PAPR, subsequent processors serialize waiting
>>> for the first processor to issue the ibm,nmi-interlock call.
>>> The second processor retries if the first processor which
>>> received a machine check is still reading the error log
>>> and is yet to issue ibm,nmi-interlock call.
>>>
>>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>>> ---
>>>  hw/ppc/spapr_hcall.c            |   16 +++++++
>>>  hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
>>>  include/hw/ppc/spapr.h          |   17 +++++++
>>>  pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
>>>  4 files changed, 163 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>> index 8f16160..eceb5e5 100644
>>> --- a/hw/ppc/spapr_hcall.c
>>> +++ b/hw/ppc/spapr_hcall.c
>>> @@ -97,6 +97,9 @@ struct rtas_mc_log {
>>>      struct rtas_error_log err_log;
>>>  };
>>>  
>>> +/* Whether machine check handling is in progress by any CPU */
>>> +bool mc_in_progress;
>>> +
>>>  static void do_spr_sync(void *arg)
>>>  {
>>>      struct SPRSyncState *s = arg;
>>> @@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>      cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
>>>  
>>>      /*
>>> +     * Only one VCPU can process machine check NMI at a time. Hence
>>> +     * set the lock mc_in_progress. Once the VCPU finishes processing
>>> +     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
>>> +     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
>>> +     * NMI we return 0 asking the VCPU to retry h_report_mc_err
>>> +     */
>>> +    if (mc_in_progress == 1) {
>>
>> Please don't depend on bools being numbers. Use true / false. For if()s,
>> just don't use == at all - it makes it more readable.
> 
> ok
> 
>>
>>> +        return 0;
>>> +    }
>>> +
>>> +    mc_in_progress = 1;
>>> +
>>> +    /*
>>>       * We save the original r3 register in SPRG2 in 0x200 vector,
>>>       * which is patched during call to ibm.nmi-register. Original
>>>       * r3 is required to be included in error log
>>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>>> index 2ec2a8e..71c7662 100644
>>> --- a/hw/ppc/spapr_rtas.c
>>> +++ b/hw/ppc/spapr_rtas.c
>>> @@ -36,6 +36,9 @@
>>>  
>>>  #include <libfdt.h>
>>>  
>>> +#define BRANCH_INST_MASK  0xFC000000
>>> +extern bool mc_in_progress;
>>
>> Please put this into the spapr struct.
> 
> ok
> 
>>
>>> +
>>>  static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>                                     uint32_t token, uint32_t nargs,
>>>                                     target_ulong args,
>>> @@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
>>>      rtas_st(rets, 0, ret);
>>>  }
>>>  
>>> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
>>> +                                  sPAPREnvironment *spapr,
>>> +                                  uint32_t token, uint32_t nargs,
>>> +                                  target_ulong args,
>>> +                                  uint32_t nret, target_ulong rets)
>>> +{
>>> +    int i;
>>> +    uint32_t ori_inst = 0x60630000;
>>> +    uint32_t branch_inst = 0x48000002;
>>> +    target_ulong guest_machine_check_addr;
>>> +    uint32_t trampoline[TRAMPOLINE_INSTS];
>>> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
>>
>> ARRAY_SIZE(trampoline), though I don't quite understand why you need a
>> variable that contains the same value as a constant (TRAMPOLINE_INSTS).
>>
>> But since you're moving all of those bits into variable fields on the
>> rtas blob itself as we discussed in the last version, I guess this code
>> will go away anyways ;).
> 
> I think we still need this. We need to patch the KVMPPC_H_REPORT_MC_ERR
> number and branch address in the trampoline and also, depending on
> whether the guest running in LE/BE we may need to flip the bits in the
> trampoline before copying it to 0x200 machine check vector.
> 
> As rtas-blob is part of the guest memory I do not want to patch these in
> rtas-blob, hence I copy the trampoline from the rtas-blob to an array,
> modify accordingly and then move it to 0x200 machine check vector.

Yes, you will still need the array. But the array should be dynamically
sized based on spapr->rtas_info->fwnmi_size which you extract from the
blob on load.

That way you wouldn't need the "total_inst" variable anymore ;).

> 
>>
>>> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
>>> +
>>> +    /* Store the system reset and machine check address */
>>> +    guest_machine_check_addr = rtas_ld(args, 1);
>>
>> Load or Store? I don't find the comment particularly useful either ;).
> 
> will reword it or may delete it.
> 
>>
>>> +
>>> +    /*
>>> +     * Read the trampoline instructions from RTAS Blob and patch
>>> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
>>> +     * machine check address before copying to 0x200 vector
>>> +     */
>>> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
>>> +                             trampoline, sizeof(trampoline));
>>> +
>>> +    /* Safety Check */
>>
>> Same for this comment.
> 
> we have only 0x100 bytes that can be copied at 0x200. If the trampoline
> size exceeds then the next interrupt vector code is overwritten. Hence a
> safety check during compile time to make sure trampoline is within 0x100
> bytes.

I think the check is fine, the comment is just redundant with
QEMU_BUILD_BUG_ON. Either be more verbose in the comment or remove it
;). But something a la

  /* check for failure */
  BUG_ON(foo);

is useless redundancy, because everyone already knows that BUG_ON checks
for failure.

The interesting bit is the why. Also, as a general rule of thumb, if you
need a comment explaining the "what" of what you're doing, your function
and/or variable names are probably not well chosen ;). But I don't think
this is a problem here.

Thanks for the patches btw :)


Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05 11:07       ` Alexander Graf
@ 2014-11-05 11:24         ` Aravinda Prasad
  2014-11-05 11:27           ` Alexander Graf
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-05 11:24 UTC (permalink / raw)
  To: Alexander Graf; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus



On Wednesday 05 November 2014 04:37 PM, Alexander Graf wrote:
> 
> 
> On 05.11.14 11:37, Aravinda Prasad wrote:
>>
>>
>> On Wednesday 05 November 2014 02:02 PM, Alexander Graf wrote:
>>>
>>>
>>> On 05.11.14 08:13, Aravinda Prasad wrote:
>>>> This patch adds FWNMI support in qemu for powerKVM
>>>> guests by handling the ibm,nmi-register rtas call.
>>>> Whenever OS issues ibm,nmi-register RTAS call, the
>>>> machine check notification address is saved and the
>>>> machine check interrupt vector 0x200 is patched to
>>>> issue a private hcall.
>>>>
>>>> This patch also handles the cases when multi-processors
>>>> experience machine check at or about the same time.
>>>> As per PAPR, subsequent processors serialize waiting
>>>> for the first processor to issue the ibm,nmi-interlock call.
>>>> The second processor retries if the first processor which
>>>> received a machine check is still reading the error log
>>>> and is yet to issue ibm,nmi-interlock call.
>>>>
>>>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>>>> ---
>>>>  hw/ppc/spapr_hcall.c            |   16 +++++++
>>>>  hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/ppc/spapr.h          |   17 +++++++
>>>>  pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
>>>>  4 files changed, 163 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>>> index 8f16160..eceb5e5 100644
>>>> --- a/hw/ppc/spapr_hcall.c
>>>> +++ b/hw/ppc/spapr_hcall.c
>>>> @@ -97,6 +97,9 @@ struct rtas_mc_log {
>>>>      struct rtas_error_log err_log;
>>>>  };
>>>>  
>>>> +/* Whether machine check handling is in progress by any CPU */
>>>> +bool mc_in_progress;
>>>> +
>>>>  static void do_spr_sync(void *arg)
>>>>  {
>>>>      struct SPRSyncState *s = arg;
>>>> @@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>>      cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
>>>>  
>>>>      /*
>>>> +     * Only one VCPU can process machine check NMI at a time. Hence
>>>> +     * set the lock mc_in_progress. Once the VCPU finishes processing
>>>> +     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
>>>> +     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
>>>> +     * NMI we return 0 asking the VCPU to retry h_report_mc_err
>>>> +     */
>>>> +    if (mc_in_progress == 1) {
>>>
>>> Please don't depend on bools being numbers. Use true / false. For if()s,
>>> just don't use == at all - it makes it more readable.
>>
>> ok
>>
>>>
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    mc_in_progress = 1;
>>>> +
>>>> +    /*
>>>>       * We save the original r3 register in SPRG2 in 0x200 vector,
>>>>       * which is patched during call to ibm.nmi-register. Original
>>>>       * r3 is required to be included in error log
>>>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>>>> index 2ec2a8e..71c7662 100644
>>>> --- a/hw/ppc/spapr_rtas.c
>>>> +++ b/hw/ppc/spapr_rtas.c
>>>> @@ -36,6 +36,9 @@
>>>>  
>>>>  #include <libfdt.h>
>>>>  
>>>> +#define BRANCH_INST_MASK  0xFC000000
>>>> +extern bool mc_in_progress;
>>>
>>> Please put this into the spapr struct.
>>
>> ok
>>
>>>
>>>> +
>>>>  static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>>                                     uint32_t token, uint32_t nargs,
>>>>                                     target_ulong args,
>>>> @@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
>>>>      rtas_st(rets, 0, ret);
>>>>  }
>>>>  
>>>> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
>>>> +                                  sPAPREnvironment *spapr,
>>>> +                                  uint32_t token, uint32_t nargs,
>>>> +                                  target_ulong args,
>>>> +                                  uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    int i;
>>>> +    uint32_t ori_inst = 0x60630000;
>>>> +    uint32_t branch_inst = 0x48000002;
>>>> +    target_ulong guest_machine_check_addr;
>>>> +    uint32_t trampoline[TRAMPOLINE_INSTS];
>>>> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
>>>
>>> ARRAY_SIZE(trampoline), though I don't quite understand why you need a
>>> variable that contains the same value as a constant (TRAMPOLINE_INSTS).
>>>
>>> But since you're moving all of those bits into variable fields on the
>>> rtas blob itself as we discussed in the last version, I guess this code
>>> will go away anyways ;).
>>
>> I think we still need this. We need to patch the KVMPPC_H_REPORT_MC_ERR
>> number and branch address in the trampoline and also, depending on
>> whether the guest running in LE/BE we may need to flip the bits in the
>> trampoline before copying it to 0x200 machine check vector.
>>
>> As rtas-blob is part of the guest memory I do not want to patch these in
>> rtas-blob, hence I copy the trampoline from the rtas-blob to an array,
>> modify accordingly and then move it to 0x200 machine check vector.
> 
> Yes, you will still need the array. But the array should be dynamically
> sized based on spapr->rtas_info->fwnmi_size which you extract from the
> blob on load.
> 
> That way you wouldn't need the "total_inst" variable anymore ;).

Yes, I will fix it that way.

> 
>>
>>>
>>>> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
>>>> +
>>>> +    /* Store the system reset and machine check address */
>>>> +    guest_machine_check_addr = rtas_ld(args, 1);
>>>
>>> Load or Store? I don't find the comment particularly useful either ;).
>>
>> will reword it or may delete it.
>>
>>>
>>>> +
>>>> +    /*
>>>> +     * Read the trampoline instructions from RTAS Blob and patch
>>>> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
>>>> +     * machine check address before copying to 0x200 vector
>>>> +     */
>>>> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
>>>> +                             trampoline, sizeof(trampoline));
>>>> +
>>>> +    /* Safety Check */
>>>
>>> Same for this comment.
>>
>> we have only 0x100 bytes that can be copied at 0x200. If the trampoline
>> size exceeds then the next interrupt vector code is overwritten. Hence a
>> safety check during compile time to make sure trampoline is within 0x100
>> bytes.
> 
> I think the check is fine, the comment is just redundant with
> QEMU_BUILD_BUG_ON. Either be more verbose in the comment or remove it

I will add above lines as comment.

> ;). But something a la
> 
>   /* check for failure */
>   BUG_ON(foo);
> 
> is useless redundancy, because everyone already knows that BUG_ON checks
> for failure.
> 
> The interesting bit is the why. Also, as a general rule of thumb, if you
> need a comment explaining the "what" of what you're doing, your function
> and/or variable names are probably not well chosen ;). But I don't think
> this is a problem here.
> 
> Thanks for the patches btw :)
> 
> 
> Alex
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05 11:24         ` Aravinda Prasad
@ 2014-11-05 11:27           ` Alexander Graf
  0 siblings, 0 replies; 66+ messages in thread
From: Alexander Graf @ 2014-11-05 11:27 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus



On 05.11.14 12:24, Aravinda Prasad wrote:
> 
> 
> On Wednesday 05 November 2014 04:37 PM, Alexander Graf wrote:
>>
>>
>> On 05.11.14 11:37, Aravinda Prasad wrote:
>>>
>>>
>>> On Wednesday 05 November 2014 02:02 PM, Alexander Graf wrote:
>>>>
>>>>
>>>> On 05.11.14 08:13, Aravinda Prasad wrote:
>>>>> This patch adds FWNMI support in qemu for powerKVM
>>>>> guests by handling the ibm,nmi-register rtas call.
>>>>> Whenever OS issues ibm,nmi-register RTAS call, the
>>>>> machine check notification address is saved and the
>>>>> machine check interrupt vector 0x200 is patched to
>>>>> issue a private hcall.
>>>>>
>>>>> This patch also handles the cases when multi-processors
>>>>> experience machine check at or about the same time.
>>>>> As per PAPR, subsequent processors serialize waiting
>>>>> for the first processor to issue the ibm,nmi-interlock call.
>>>>> The second processor retries if the first processor which
>>>>> received a machine check is still reading the error log
>>>>> and is yet to issue ibm,nmi-interlock call.
>>>>>
>>>>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>>>>> ---
>>>>>  hw/ppc/spapr_hcall.c            |   16 +++++++
>>>>>  hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
>>>>>  include/hw/ppc/spapr.h          |   17 +++++++
>>>>>  pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
>>>>>  4 files changed, 163 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>>>> index 8f16160..eceb5e5 100644
>>>>> --- a/hw/ppc/spapr_hcall.c
>>>>> +++ b/hw/ppc/spapr_hcall.c
>>>>> @@ -97,6 +97,9 @@ struct rtas_mc_log {
>>>>>      struct rtas_error_log err_log;
>>>>>  };
>>>>>  
>>>>> +/* Whether machine check handling is in progress by any CPU */
>>>>> +bool mc_in_progress;
>>>>> +
>>>>>  static void do_spr_sync(void *arg)
>>>>>  {
>>>>>      struct SPRSyncState *s = arg;
>>>>> @@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>>>      cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
>>>>>  
>>>>>      /*
>>>>> +     * Only one VCPU can process machine check NMI at a time. Hence
>>>>> +     * set the lock mc_in_progress. Once the VCPU finishes processing
>>>>> +     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
>>>>> +     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
>>>>> +     * NMI we return 0 asking the VCPU to retry h_report_mc_err
>>>>> +     */
>>>>> +    if (mc_in_progress == 1) {
>>>>
>>>> Please don't depend on bools being numbers. Use true / false. For if()s,
>>>> just don't use == at all - it makes it more readable.
>>>
>>> ok
>>>
>>>>
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    mc_in_progress = 1;
>>>>> +
>>>>> +    /*
>>>>>       * We save the original r3 register in SPRG2 in 0x200 vector,
>>>>>       * which is patched during call to ibm.nmi-register. Original
>>>>>       * r3 is required to be included in error log
>>>>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>>>>> index 2ec2a8e..71c7662 100644
>>>>> --- a/hw/ppc/spapr_rtas.c
>>>>> +++ b/hw/ppc/spapr_rtas.c
>>>>> @@ -36,6 +36,9 @@
>>>>>  
>>>>>  #include <libfdt.h>
>>>>>  
>>>>> +#define BRANCH_INST_MASK  0xFC000000
>>>>> +extern bool mc_in_progress;
>>>>
>>>> Please put this into the spapr struct.
>>>
>>> ok
>>>
>>>>
>>>>> +
>>>>>  static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>>>                                     uint32_t token, uint32_t nargs,
>>>>>                                     target_ulong args,
>>>>> @@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
>>>>>      rtas_st(rets, 0, ret);
>>>>>  }
>>>>>  
>>>>> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
>>>>> +                                  sPAPREnvironment *spapr,
>>>>> +                                  uint32_t token, uint32_t nargs,
>>>>> +                                  target_ulong args,
>>>>> +                                  uint32_t nret, target_ulong rets)
>>>>> +{
>>>>> +    int i;
>>>>> +    uint32_t ori_inst = 0x60630000;
>>>>> +    uint32_t branch_inst = 0x48000002;
>>>>> +    target_ulong guest_machine_check_addr;
>>>>> +    uint32_t trampoline[TRAMPOLINE_INSTS];
>>>>> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
>>>>
>>>> ARRAY_SIZE(trampoline), though I don't quite understand why you need a
>>>> variable that contains the same value as a constant (TRAMPOLINE_INSTS).
>>>>
>>>> But since you're moving all of those bits into variable fields on the
>>>> rtas blob itself as we discussed in the last version, I guess this code
>>>> will go away anyways ;).
>>>
>>> I think we still need this. We need to patch the KVMPPC_H_REPORT_MC_ERR
>>> number and branch address in the trampoline and also, depending on
>>> whether the guest running in LE/BE we may need to flip the bits in the
>>> trampoline before copying it to 0x200 machine check vector.
>>>
>>> As rtas-blob is part of the guest memory I do not want to patch these in
>>> rtas-blob, hence I copy the trampoline from the rtas-blob to an array,
>>> modify accordingly and then move it to 0x200 machine check vector.
>>
>> Yes, you will still need the array. But the array should be dynamically
>> sized based on spapr->rtas_info->fwnmi_size which you extract from the

spapr->rtas_info.fwnmi_size of course ;). No need for yet another
allocation to keep track of.

>> blob on load.
>>
>> That way you wouldn't need the "total_inst" variable anymore ;).
> 
> Yes, I will fix it that way.
> 
>>
>>>
>>>>
>>>>> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
>>>>> +
>>>>> +    /* Store the system reset and machine check address */
>>>>> +    guest_machine_check_addr = rtas_ld(args, 1);
>>>>
>>>> Load or Store? I don't find the comment particularly useful either ;).
>>>
>>> will reword it or may delete it.
>>>
>>>>
>>>>> +
>>>>> +    /*
>>>>> +     * Read the trampoline instructions from RTAS Blob and patch
>>>>> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
>>>>> +     * machine check address before copying to 0x200 vector
>>>>> +     */
>>>>> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
>>>>> +                             trampoline, sizeof(trampoline));
>>>>> +
>>>>> +    /* Safety Check */
>>>>
>>>> Same for this comment.
>>>
>>> we have only 0x100 bytes that can be copied at 0x200. If the trampoline
>>> size exceeds then the next interrupt vector code is overwritten. Hence a
>>> safety check during compile time to make sure trampoline is within 0x100
>>> bytes.
>>
>> I think the check is fine, the comment is just redundant with
>> QEMU_BUILD_BUG_ON. Either be more verbose in the comment or remove it
> 
> I will add above lines as comment.

Awesome. You can also move the check to rtas load time, since the size
will be defined by the blob with your next version ;).


Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05  8:32   ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
  2014-11-05 10:37     ` Aravinda Prasad
@ 2014-11-05 15:46     ` Tom Musta
  2014-11-06 10:00       ` Aravinda Prasad
  1 sibling, 1 reply; 66+ messages in thread
From: Tom Musta @ 2014-11-05 15:46 UTC (permalink / raw)
  To: Alexander Graf, Aravinda Prasad, aik, qemu-ppc, qemu-devel; +Cc: benh, paulus

On 11/5/2014 2:32 AM, Alexander Graf wrote:
> 
> 
> On 05.11.14 08:13, Aravinda Prasad wrote:
>> This patch adds FWNMI support in qemu for powerKVM
>> guests by handling the ibm,nmi-register rtas call.
>> Whenever OS issues ibm,nmi-register RTAS call, the
>> machine check notification address is saved and the
>> machine check interrupt vector 0x200 is patched to
>> issue a private hcall.
>>
>> This patch also handles the cases when multi-processors
>> experience machine check at or about the same time.
>> As per PAPR, subsequent processors serialize waiting
>> for the first processor to issue the ibm,nmi-interlock call.
>> The second processor retries if the first processor which
>> received a machine check is still reading the error log
>> and is yet to issue ibm,nmi-interlock call.
>>
>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>> ---
>>  hw/ppc/spapr_hcall.c            |   16 +++++++
>>  hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
>>  include/hw/ppc/spapr.h          |   17 +++++++
>>  pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
>>  4 files changed, 163 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index 8f16160..eceb5e5 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -97,6 +97,9 @@ struct rtas_mc_log {
>>      struct rtas_error_log err_log;
>>  };
>>  
>> +/* Whether machine check handling is in progress by any CPU */
>> +bool mc_in_progress;
>> +
>>  static void do_spr_sync(void *arg)
>>  {
>>      struct SPRSyncState *s = arg;
>> @@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>      cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
>>  
>>      /*
>> +     * Only one VCPU can process machine check NMI at a time. Hence
>> +     * set the lock mc_in_progress. Once the VCPU finishes processing
>> +     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
>> +     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
>> +     * NMI we return 0 asking the VCPU to retry h_report_mc_err
>> +     */
>> +    if (mc_in_progress == 1) {
> 
> Please don't depend on bools being numbers. Use true / false. For if()s,
> just don't use == at all - it makes it more readable.
> 
>> +        return 0;
>> +    }
>> +
>> +    mc_in_progress = 1;
>> +
>> +    /*
>>       * We save the original r3 register in SPRG2 in 0x200 vector,
>>       * which is patched during call to ibm.nmi-register. Original
>>       * r3 is required to be included in error log
>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>> index 2ec2a8e..71c7662 100644
>> --- a/hw/ppc/spapr_rtas.c
>> +++ b/hw/ppc/spapr_rtas.c
>> @@ -36,6 +36,9 @@
>>  
>>  #include <libfdt.h>
>>  
>> +#define BRANCH_INST_MASK  0xFC000000
>> +extern bool mc_in_progress;
> 
> Please put this into the spapr struct.
> 
>> +
>>  static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>                                     uint32_t token, uint32_t nargs,
>>                                     target_ulong args,
>> @@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
>>      rtas_st(rets, 0, ret);
>>  }
>>  
>> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
>> +                                  sPAPREnvironment *spapr,
>> +                                  uint32_t token, uint32_t nargs,
>> +                                  target_ulong args,
>> +                                  uint32_t nret, target_ulong rets)
>> +{
>> +    int i;
>> +    uint32_t ori_inst = 0x60630000;
>> +    uint32_t branch_inst = 0x48000002;
>> +    target_ulong guest_machine_check_addr;
>> +    uint32_t trampoline[TRAMPOLINE_INSTS];
>> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
> 
> ARRAY_SIZE(trampoline), though I don't quite understand why you need a
> variable that contains the same value as a constant (TRAMPOLINE_INSTS).
> 
> But since you're moving all of those bits into variable fields on the
> rtas blob itself as we discussed in the last version, I guess this code
> will go away anyways ;).
> 
>> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
>> +
>> +    /* Store the system reset and machine check address */
>> +    guest_machine_check_addr = rtas_ld(args, 1);
> 
> Load or Store? I don't find the comment particularly useful either ;).
> 
>> +
>> +    /*
>> +     * Read the trampoline instructions from RTAS Blob and patch
>> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
>> +     * machine check address before copying to 0x200 vector
>> +     */
>> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
>> +                             trampoline, sizeof(trampoline));
>> +
>> +    /* Safety Check */
> 
> Same for this comment.
> 
>> +    QEMU_BUILD_BUG_ON(sizeof(trampoline) > MC_INTERRUPT_VECTOR_SIZE);
>> +
>> +    /* Update the KVMPPC_H_REPORT_MC_ERR value in trampoline */
>> +    ori_inst |= KVMPPC_H_REPORT_MC_ERR;
>> +    memcpy(&trampoline[TRAMPOLINE_ORI_INST_INDEX], &ori_inst,
>> +            sizeof(ori_inst));
> 
> Why memcpy a u32 into a u32 array?

Additionally, I don't see the need for the ori_inst *variable* .... the instruction is known at compile time.
So why not just do

  trampoline[TRAMPOLINE_ORI_INST_INDEX] = 0x60630000 | KVMPPC_H_REPORT_MC_ERR;

Likewise for the branch_inst variable.

Also see my comment in the trampoline code below.
> 
>> +
>> +    /*
>> +     * Sanity check guest_machine_check_addr to prevent clobbering
>> +     * operator value in branch instruction
>> +     */
>> +    if (guest_machine_check_addr & BRANCH_INST_MASK) {
>> +        fprintf(stderr, "Unable to register ibm,nmi_register: "
>> +                "Invalid machine check handler address\n");
> 
> In general, printf's in guest triggerable code aren't a great idea,
> since the guest could flood our host logs with this. I can't say we're
> doing a great job at it already though, so it probably doesn't matter much.
> 
>> +        rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);

NIT:  Shouldn't this be RTAS_OUT_PARAM_ERR?  That is what SPAPR says (both are implemented to be -3).

>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Update the branch instruction in trampoline
>> +     * with the absolute machine check address requested by OS.
>> +     */
>> +    branch_inst |= guest_machine_check_addr;
>> +    memcpy(&trampoline[TRAMPOLINE_BR_INST_INDEX], &branch_inst,
>> +            sizeof(branch_inst));
>> +
>> +    /* Handle all Host/Guest LE/BE combinations */
>> +    if ((*pcc->interrupts_big_endian)(cpu)) {
>> +        for (i = 0; i < total_inst; i++) {
>> +            trampoline[i] = cpu_to_be32(trampoline[i]);
>> +        }
>> +    } else {
>> +        for (i = 0; i < total_inst; i++) {
>> +            trampoline[i] = cpu_to_le32(trampoline[i]);
>> +        }
>> +    }
>> +
>> +    /* Patch 0x200 NMI interrupt vector memory area of guest */
>> +    cpu_physical_memory_write(MC_INTERRUPT_VECTOR, trampoline,
>> +                              sizeof(trampoline));
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +}
>> +
>> +static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
>> +                                   sPAPREnvironment *spapr,
>> +                                   uint32_t token, uint32_t nargs,
>> +                                   target_ulong args,
>> +                                   uint32_t nret, target_ulong rets)
>> +{
>> +    /*
>> +     * VCPU issuing ibm,nmi-interlock is done with NMI handling,
>> +     * hence unset mc_in_progress.
>> +     */
>> +    mc_in_progress = 0;
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +}
>> +
>>  static struct rtas_call {
>>      const char *name;
>>      spapr_rtas_fn fn;
>> @@ -419,6 +506,12 @@ static void core_rtas_register_types(void)
>>                          rtas_ibm_set_system_parameter);
>>      spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
>>                          rtas_ibm_os_term);
>> +    spapr_rtas_register(RTAS_IBM_NMI_REGISTER,
>> +                        "ibm,nmi-register",
>> +                        rtas_ibm_nmi_register);
>> +    spapr_rtas_register(RTAS_IBM_NMI_INTERLOCK,
>> +                        "ibm,nmi-interlock",
>> +                        rtas_ibm_nmi_interlock);
>>  }
>>  
>>  type_init(core_rtas_register_types)
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index a2d67e9..98d0a6c 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -384,8 +384,10 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
>>  #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
>>  #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
>> +#define RTAS_IBM_NMI_REGISTER                   (RTAS_TOKEN_BASE + 0x20)
>> +#define RTAS_IBM_NMI_INTERLOCK                  (RTAS_TOKEN_BASE + 0x21)
>>  
>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x22)
>>  
>>  /* RTAS ibm,get-system-parameter token values */
>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>> @@ -488,4 +490,17 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>  #define RTAS_TRAMPOLINE_OFFSET   0x200
>>  #define RTAS_ERRLOG_OFFSET       0x800
>>  
>> +/* Machine Check Trampoline related macros
>> + *
>> + * These macros should co-relate to the code we
>> + * have in pc-bios/spapr-rtas/spapr-rtas.S
>> + */
>> +#define TRAMPOLINE_INSTS           17
>> +#define TRAMPOLINE_ORI_INST_INDEX  2
>> +#define TRAMPOLINE_BR_INST_INDEX   15
>> +
>> +/* Machine Check Interrupt related macros */
>> +#define MC_INTERRUPT_VECTOR           0x200
>> +#define MC_INTERRUPT_VECTOR_SIZE      0x100
>> +
>>  #endif /* !defined (__HW_SPAPR_H__) */
>> diff --git a/pc-bios/spapr-rtas/spapr-rtas.S b/pc-bios/spapr-rtas/spapr-rtas.S
>> index 903bec2..c315332 100644
>> --- a/pc-bios/spapr-rtas/spapr-rtas.S
>> +++ b/pc-bios/spapr-rtas/spapr-rtas.S
> 
> Please add #defines at the top of the file for the register names:
> 
>   #define r0 0
>   #define r1 1
>   ...
> 
> That way the code below will get much more readable :)
> 
> Also, you want a jump table here as we discussed in the last review
> round. That table would tell you
> 
>   a) Entry address for RTAS
>   b) Offset of the NMI code
>   c) To-be-patched offsets of the instructions inside the NMI code
> 
> Then we have all offsets automatically generated inside a single file
> and don't have to maintain fragile relationships between random headers
> with offset defines and the .S file.
> 
> 
> Alex
> 
>> @@ -35,3 +35,41 @@ _start:
>>  	ori	3,3,KVMPPC_H_RTAS@l
>>  	sc	1
>>  	blr
>> +	. = 0x200
>> +	/*
>> +	 * Trampoline saves r3 in sprg2 and issues private hcall
>> +	 * to request qemu to build error log. QEMU builds the
>> +	 * error log, copies to rtas-blob and returns the address.
>> +	 * The initial 16 bytes in return adress consist of saved
>> +	 * srr0 and srr1 which we restore and pass on the actual error
>> +	 * log address to OS handled mcachine check notification
>> +	 * routine
>> +	 *
>> +	 * All the below instructions are copied to interrupt vector
>> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
>> +	 */
>> +	mtsprg  2,3
>> +	li      3,0
>> +	/*
>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>> +	 * value is patched below
>> +	 */
>> +1:	ori     3,3,0

Why do "li 3,0" followed by "ori 3,3,X"?  Isn't this just "li 3,X" ?  (aka "addi 3,0,X")

And, perhaps this was discussed in an earlier patch, but couldn't you just do:

	li 3,KVMPPC_H_REPORT_MC_ERR

here and avoid the patching altogether?



>> +	sc      1               /* Issue H_CALL */
>> +	cmpdi   cr0,3,0
>> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
>> +	mtsprg  2,4
>> +	ld      4,0(3)
>> +	mtsrr0  4               /* Restore srr0 */
>> +	ld      4,8(3)
>> +	mtsrr1  4               /* Restore srr1 */
>> +	ld      4,16(3)
>> +	mtcrf   0,4             /* Restore cr */
>> +	addi    3,3,24
>> +	mfsprg  4,2
>> +	/*
>> +	 * Branch to address registered by OS. The branch address is
>> +	 * patched in the ibm,nmi-register rtas call.
>> +	 */
>> +	ba      0x0
>> +	b       .
>>
>>
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05 15:46     ` Tom Musta
@ 2014-11-06 10:00       ` Aravinda Prasad
  2014-11-06 10:29         ` Alexander Graf
  2014-11-11  3:19         ` David Gibson
  0 siblings, 2 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-06 10:00 UTC (permalink / raw)
  To: Tom Musta; +Cc: benh, aik, Alexander Graf, qemu-devel, qemu-ppc, paulus



On Wednesday 05 November 2014 09:16 PM, Tom Musta wrote:
> On 11/5/2014 2:32 AM, Alexander Graf wrote:
>>
>>
>> On 05.11.14 08:13, Aravinda Prasad wrote:
>>> This patch adds FWNMI support in qemu for powerKVM
>>> guests by handling the ibm,nmi-register rtas call.
>>> Whenever OS issues ibm,nmi-register RTAS call, the
>>> machine check notification address is saved and the
>>> machine check interrupt vector 0x200 is patched to
>>> issue a private hcall.
>>>
>>> This patch also handles the cases when multi-processors
>>> experience machine check at or about the same time.
>>> As per PAPR, subsequent processors serialize waiting
>>> for the first processor to issue the ibm,nmi-interlock call.
>>> The second processor retries if the first processor which
>>> received a machine check is still reading the error log
>>> and is yet to issue ibm,nmi-interlock call.
>>>
>>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>>> ---
>>>  hw/ppc/spapr_hcall.c            |   16 +++++++
>>>  hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
>>>  include/hw/ppc/spapr.h          |   17 +++++++
>>>  pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
>>>  4 files changed, 163 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>> index 8f16160..eceb5e5 100644
>>> --- a/hw/ppc/spapr_hcall.c
>>> +++ b/hw/ppc/spapr_hcall.c
>>> @@ -97,6 +97,9 @@ struct rtas_mc_log {
>>>      struct rtas_error_log err_log;
>>>  };
>>>  
>>> +/* Whether machine check handling is in progress by any CPU */
>>> +bool mc_in_progress;
>>> +
>>>  static void do_spr_sync(void *arg)
>>>  {
>>>      struct SPRSyncState *s = arg;
>>> @@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>      cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
>>>  
>>>      /*
>>> +     * Only one VCPU can process machine check NMI at a time. Hence
>>> +     * set the lock mc_in_progress. Once the VCPU finishes processing
>>> +     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
>>> +     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
>>> +     * NMI we return 0 asking the VCPU to retry h_report_mc_err
>>> +     */
>>> +    if (mc_in_progress == 1) {
>>
>> Please don't depend on bools being numbers. Use true / false. For if()s,
>> just don't use == at all - it makes it more readable.
>>
>>> +        return 0;
>>> +    }
>>> +
>>> +    mc_in_progress = 1;
>>> +
>>> +    /*
>>>       * We save the original r3 register in SPRG2 in 0x200 vector,
>>>       * which is patched during call to ibm.nmi-register. Original
>>>       * r3 is required to be included in error log
>>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>>> index 2ec2a8e..71c7662 100644
>>> --- a/hw/ppc/spapr_rtas.c
>>> +++ b/hw/ppc/spapr_rtas.c
>>> @@ -36,6 +36,9 @@
>>>  
>>>  #include <libfdt.h>
>>>  
>>> +#define BRANCH_INST_MASK  0xFC000000
>>> +extern bool mc_in_progress;
>>
>> Please put this into the spapr struct.
>>
>>> +
>>>  static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>                                     uint32_t token, uint32_t nargs,
>>>                                     target_ulong args,
>>> @@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
>>>      rtas_st(rets, 0, ret);
>>>  }
>>>  
>>> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
>>> +                                  sPAPREnvironment *spapr,
>>> +                                  uint32_t token, uint32_t nargs,
>>> +                                  target_ulong args,
>>> +                                  uint32_t nret, target_ulong rets)
>>> +{
>>> +    int i;
>>> +    uint32_t ori_inst = 0x60630000;
>>> +    uint32_t branch_inst = 0x48000002;
>>> +    target_ulong guest_machine_check_addr;
>>> +    uint32_t trampoline[TRAMPOLINE_INSTS];
>>> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
>>
>> ARRAY_SIZE(trampoline), though I don't quite understand why you need a
>> variable that contains the same value as a constant (TRAMPOLINE_INSTS).
>>
>> But since you're moving all of those bits into variable fields on the
>> rtas blob itself as we discussed in the last version, I guess this code
>> will go away anyways ;).
>>
>>> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
>>> +
>>> +    /* Store the system reset and machine check address */
>>> +    guest_machine_check_addr = rtas_ld(args, 1);
>>
>> Load or Store? I don't find the comment particularly useful either ;).
>>
>>> +
>>> +    /*
>>> +     * Read the trampoline instructions from RTAS Blob and patch
>>> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
>>> +     * machine check address before copying to 0x200 vector
>>> +     */
>>> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
>>> +                             trampoline, sizeof(trampoline));
>>> +
>>> +    /* Safety Check */
>>
>> Same for this comment.
>>
>>> +    QEMU_BUILD_BUG_ON(sizeof(trampoline) > MC_INTERRUPT_VECTOR_SIZE);
>>> +
>>> +    /* Update the KVMPPC_H_REPORT_MC_ERR value in trampoline */
>>> +    ori_inst |= KVMPPC_H_REPORT_MC_ERR;
>>> +    memcpy(&trampoline[TRAMPOLINE_ORI_INST_INDEX], &ori_inst,
>>> +            sizeof(ori_inst));
>>
>> Why memcpy a u32 into a u32 array?
> 
> Additionally, I don't see the need for the ori_inst *variable* .... the instruction is known at compile time.
> So why not just do
> 
>   trampoline[TRAMPOLINE_ORI_INST_INDEX] = 0x60630000 | KVMPPC_H_REPORT_MC_ERR;

I can directly do trampoline[TRAMPOLINE_ORI_INST_INDEX] |=
KVMPPC_H_REPORT_MC_ERR;

as trampoline[TRAMPOLINE_ORI_INST_INDEX] already contains 0x60630000

> 
> Likewise for the branch_inst variable.
> 
> Also see my comment in the trampoline code below.
>>
>>> +
>>> +    /*
>>> +     * Sanity check guest_machine_check_addr to prevent clobbering
>>> +     * operator value in branch instruction
>>> +     */
>>> +    if (guest_machine_check_addr & BRANCH_INST_MASK) {
>>> +        fprintf(stderr, "Unable to register ibm,nmi_register: "
>>> +                "Invalid machine check handler address\n");
>>
>> In general, printf's in guest triggerable code aren't a great idea,
>> since the guest could flood our host logs with this. I can't say we're
>> doing a great job at it already though, so it probably doesn't matter much.
>>
>>> +        rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);
> 
> NIT:  Shouldn't this be RTAS_OUT_PARAM_ERR?  That is what SPAPR says (both are implemented to be -3).

Yes, SPAPR says -3 Parameter Error. I think RTAS_OUT_PARAM_ERR is better
to be in consistent with SPAPR.

> 
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * Update the branch instruction in trampoline
>>> +     * with the absolute machine check address requested by OS.
>>> +     */
>>> +    branch_inst |= guest_machine_check_addr;
>>> +    memcpy(&trampoline[TRAMPOLINE_BR_INST_INDEX], &branch_inst,
>>> +            sizeof(branch_inst));
>>> +
>>> +    /* Handle all Host/Guest LE/BE combinations */
>>> +    if ((*pcc->interrupts_big_endian)(cpu)) {
>>> +        for (i = 0; i < total_inst; i++) {
>>> +            trampoline[i] = cpu_to_be32(trampoline[i]);
>>> +        }
>>> +    } else {
>>> +        for (i = 0; i < total_inst; i++) {
>>> +            trampoline[i] = cpu_to_le32(trampoline[i]);
>>> +        }
>>> +    }
>>> +
>>> +    /* Patch 0x200 NMI interrupt vector memory area of guest */
>>> +    cpu_physical_memory_write(MC_INTERRUPT_VECTOR, trampoline,
>>> +                              sizeof(trampoline));
>>> +
>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>> +}
>>> +
>>> +static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
>>> +                                   sPAPREnvironment *spapr,
>>> +                                   uint32_t token, uint32_t nargs,
>>> +                                   target_ulong args,
>>> +                                   uint32_t nret, target_ulong rets)
>>> +{
>>> +    /*
>>> +     * VCPU issuing ibm,nmi-interlock is done with NMI handling,
>>> +     * hence unset mc_in_progress.
>>> +     */
>>> +    mc_in_progress = 0;
>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>> +}
>>> +
>>>  static struct rtas_call {
>>>      const char *name;
>>>      spapr_rtas_fn fn;
>>> @@ -419,6 +506,12 @@ static void core_rtas_register_types(void)
>>>                          rtas_ibm_set_system_parameter);
>>>      spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
>>>                          rtas_ibm_os_term);
>>> +    spapr_rtas_register(RTAS_IBM_NMI_REGISTER,
>>> +                        "ibm,nmi-register",
>>> +                        rtas_ibm_nmi_register);
>>> +    spapr_rtas_register(RTAS_IBM_NMI_INTERLOCK,
>>> +                        "ibm,nmi-interlock",
>>> +                        rtas_ibm_nmi_interlock);
>>>  }
>>>  
>>>  type_init(core_rtas_register_types)
>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>> index a2d67e9..98d0a6c 100644
>>> --- a/include/hw/ppc/spapr.h
>>> +++ b/include/hw/ppc/spapr.h
>>> @@ -384,8 +384,10 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>>  #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
>>>  #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
>>>  #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
>>> +#define RTAS_IBM_NMI_REGISTER                   (RTAS_TOKEN_BASE + 0x20)
>>> +#define RTAS_IBM_NMI_INTERLOCK                  (RTAS_TOKEN_BASE + 0x21)
>>>  
>>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
>>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x22)
>>>  
>>>  /* RTAS ibm,get-system-parameter token values */
>>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>>> @@ -488,4 +490,17 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>  #define RTAS_TRAMPOLINE_OFFSET   0x200
>>>  #define RTAS_ERRLOG_OFFSET       0x800
>>>  
>>> +/* Machine Check Trampoline related macros
>>> + *
>>> + * These macros should co-relate to the code we
>>> + * have in pc-bios/spapr-rtas/spapr-rtas.S
>>> + */
>>> +#define TRAMPOLINE_INSTS           17
>>> +#define TRAMPOLINE_ORI_INST_INDEX  2
>>> +#define TRAMPOLINE_BR_INST_INDEX   15
>>> +
>>> +/* Machine Check Interrupt related macros */
>>> +#define MC_INTERRUPT_VECTOR           0x200
>>> +#define MC_INTERRUPT_VECTOR_SIZE      0x100
>>> +
>>>  #endif /* !defined (__HW_SPAPR_H__) */
>>> diff --git a/pc-bios/spapr-rtas/spapr-rtas.S b/pc-bios/spapr-rtas/spapr-rtas.S
>>> index 903bec2..c315332 100644
>>> --- a/pc-bios/spapr-rtas/spapr-rtas.S
>>> +++ b/pc-bios/spapr-rtas/spapr-rtas.S
>>
>> Please add #defines at the top of the file for the register names:
>>
>>   #define r0 0
>>   #define r1 1
>>   ...
>>
>> That way the code below will get much more readable :)
>>
>> Also, you want a jump table here as we discussed in the last review
>> round. That table would tell you
>>
>>   a) Entry address for RTAS
>>   b) Offset of the NMI code
>>   c) To-be-patched offsets of the instructions inside the NMI code
>>
>> Then we have all offsets automatically generated inside a single file
>> and don't have to maintain fragile relationships between random headers
>> with offset defines and the .S file.
>>
>>
>> Alex
>>
>>> @@ -35,3 +35,41 @@ _start:
>>>  	ori	3,3,KVMPPC_H_RTAS@l
>>>  	sc	1
>>>  	blr
>>> +	. = 0x200
>>> +	/*
>>> +	 * Trampoline saves r3 in sprg2 and issues private hcall
>>> +	 * to request qemu to build error log. QEMU builds the
>>> +	 * error log, copies to rtas-blob and returns the address.
>>> +	 * The initial 16 bytes in return adress consist of saved
>>> +	 * srr0 and srr1 which we restore and pass on the actual error
>>> +	 * log address to OS handled mcachine check notification
>>> +	 * routine
>>> +	 *
>>> +	 * All the below instructions are copied to interrupt vector
>>> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
>>> +	 */
>>> +	mtsprg  2,3
>>> +	li      3,0
>>> +	/*
>>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>>> +	 * value is patched below
>>> +	 */
>>> +1:	ori     3,3,0
> 
> Why do "li 3,0" followed by "ori 3,3,X"?  Isn't this just "li 3,X" ?  (aka "addi 3,0,X")

I remember I first tried doing li r3,X but faced some problem (but not
able to exactly recall what was the problem) may be due to not familiar
with ppc assembly.

I will fix this.

> 
> And, perhaps this was discussed in an earlier patch, but couldn't you just do:
> 
> 	li 3,KVMPPC_H_REPORT_MC_ERR
> 
> here and avoid the patching altogether?

KVMPPC_H_REPORT_MC_ERR def in not visible in spapr-rtas.S, either I can
define it in spapr-rtas.S as already done for KVMPPC_H_RTAS or patch it
in ibm,nmi-register call.

It is very unlikely that the KVMPPC_H_REPORT_MC_ERR will be changed, but
I prefer to patch it to avoid maintaining it in both places. What do you
think?

> 
> 
> 
>>> +	sc      1               /* Issue H_CALL */
>>> +	cmpdi   cr0,3,0
>>> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
>>> +	mtsprg  2,4
>>> +	ld      4,0(3)
>>> +	mtsrr0  4               /* Restore srr0 */
>>> +	ld      4,8(3)
>>> +	mtsrr1  4               /* Restore srr1 */
>>> +	ld      4,16(3)
>>> +	mtcrf   0,4             /* Restore cr */
>>> +	addi    3,3,24
>>> +	mfsprg  4,2
>>> +	/*
>>> +	 * Branch to address registered by OS. The branch address is
>>> +	 * patched in the ibm,nmi-register rtas call.
>>> +	 */
>>> +	ba      0x0
>>> +	b       .
>>>
>>>
>>
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-06 10:00       ` Aravinda Prasad
@ 2014-11-06 10:29         ` Alexander Graf
  2014-11-06 10:36           ` Aravinda Prasad
  2014-11-11  3:19         ` David Gibson
  1 sibling, 1 reply; 66+ messages in thread
From: Alexander Graf @ 2014-11-06 10:29 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: Tom Musta, benh, aik, qemu-devel, qemu-ppc, paulus




> Am 06.11.2014 um 11:00 schrieb Aravinda Prasad <aravinda@linux.vnet.ibm.com>:
> 
> 
> 
>> On Wednesday 05 November 2014 09:16 PM, Tom Musta wrote:
>>> On 11/5/2014 2:32 AM, Alexander Graf wrote:
>>> 
>>> 
>>>> On 05.11.14 08:13, Aravinda Prasad wrote:
>>>> This patch adds FWNMI support in qemu for powerKVM
>>>> guests by handling the ibm,nmi-register rtas call.
>>>> Whenever OS issues ibm,nmi-register RTAS call, the
>>>> machine check notification address is saved and the
>>>> machine check interrupt vector 0x200 is patched to
>>>> issue a private hcall.
>>>> 
>>>> This patch also handles the cases when multi-processors
>>>> experience machine check at or about the same time.
>>>> As per PAPR, subsequent processors serialize waiting
>>>> for the first processor to issue the ibm,nmi-interlock call.
>>>> The second processor retries if the first processor which
>>>> received a machine check is still reading the error log
>>>> and is yet to issue ibm,nmi-interlock call.
>>>> 
>>>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>>>> ---
>>>> hw/ppc/spapr_hcall.c            |   16 +++++++
>>>> hw/ppc/spapr_rtas.c             |   93 +++++++++++++++++++++++++++++++++++++++
>>>> include/hw/ppc/spapr.h          |   17 +++++++
>>>> pc-bios/spapr-rtas/spapr-rtas.S |   38 ++++++++++++++++
>>>> 4 files changed, 163 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>>> index 8f16160..eceb5e5 100644
>>>> --- a/hw/ppc/spapr_hcall.c
>>>> +++ b/hw/ppc/spapr_hcall.c
>>>> @@ -97,6 +97,9 @@ struct rtas_mc_log {
>>>>     struct rtas_error_log err_log;
>>>> };
>>>> 
>>>> +/* Whether machine check handling is in progress by any CPU */
>>>> +bool mc_in_progress;
>>>> +
>>>> static void do_spr_sync(void *arg)
>>>> {
>>>>     struct SPRSyncState *s = arg;
>>>> @@ -678,6 +681,19 @@ static target_ulong h_report_mc_err(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>>     cpu_synchronize_state(CPU(ppc_env_get_cpu(env)));
>>>> 
>>>>     /*
>>>> +     * Only one VCPU can process machine check NMI at a time. Hence
>>>> +     * set the lock mc_in_progress. Once the VCPU finishes processing
>>>> +     * NMI, it executes ibm,nmi-interlock and mc_in_progress is unset
>>>> +     * in ibm,nmi-interlock handler. Meanwhile if other VCPUs encounter
>>>> +     * NMI we return 0 asking the VCPU to retry h_report_mc_err
>>>> +     */
>>>> +    if (mc_in_progress == 1) {
>>> 
>>> Please don't depend on bools being numbers. Use true / false. For if()s,
>>> just don't use == at all - it makes it more readable.
>>> 
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    mc_in_progress = 1;
>>>> +
>>>> +    /*
>>>>      * We save the original r3 register in SPRG2 in 0x200 vector,
>>>>      * which is patched during call to ibm.nmi-register. Original
>>>>      * r3 is required to be included in error log
>>>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>>>> index 2ec2a8e..71c7662 100644
>>>> --- a/hw/ppc/spapr_rtas.c
>>>> +++ b/hw/ppc/spapr_rtas.c
>>>> @@ -36,6 +36,9 @@
>>>> 
>>>> #include <libfdt.h>
>>>> 
>>>> +#define BRANCH_INST_MASK  0xFC000000
>>>> +extern bool mc_in_progress;
>>> 
>>> Please put this into the spapr struct.
>>> 
>>>> +
>>>> static void rtas_display_character(PowerPCCPU *cpu, sPAPREnvironment *spapr,
>>>>                                    uint32_t token, uint32_t nargs,
>>>>                                    target_ulong args,
>>>> @@ -290,6 +293,90 @@ static void rtas_ibm_os_term(PowerPCCPU *cpu,
>>>>     rtas_st(rets, 0, ret);
>>>> }
>>>> 
>>>> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
>>>> +                                  sPAPREnvironment *spapr,
>>>> +                                  uint32_t token, uint32_t nargs,
>>>> +                                  target_ulong args,
>>>> +                                  uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    int i;
>>>> +    uint32_t ori_inst = 0x60630000;
>>>> +    uint32_t branch_inst = 0x48000002;
>>>> +    target_ulong guest_machine_check_addr;
>>>> +    uint32_t trampoline[TRAMPOLINE_INSTS];
>>>> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
>>> 
>>> ARRAY_SIZE(trampoline), though I don't quite understand why you need a
>>> variable that contains the same value as a constant (TRAMPOLINE_INSTS).
>>> 
>>> But since you're moving all of those bits into variable fields on the
>>> rtas blob itself as we discussed in the last version, I guess this code
>>> will go away anyways ;).
>>> 
>>>> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
>>>> +
>>>> +    /* Store the system reset and machine check address */
>>>> +    guest_machine_check_addr = rtas_ld(args, 1);
>>> 
>>> Load or Store? I don't find the comment particularly useful either ;).
>>> 
>>>> +
>>>> +    /*
>>>> +     * Read the trampoline instructions from RTAS Blob and patch
>>>> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
>>>> +     * machine check address before copying to 0x200 vector
>>>> +     */
>>>> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
>>>> +                             trampoline, sizeof(trampoline));
>>>> +
>>>> +    /* Safety Check */
>>> 
>>> Same for this comment.
>>> 
>>>> +    QEMU_BUILD_BUG_ON(sizeof(trampoline) > MC_INTERRUPT_VECTOR_SIZE);
>>>> +
>>>> +    /* Update the KVMPPC_H_REPORT_MC_ERR value in trampoline */
>>>> +    ori_inst |= KVMPPC_H_REPORT_MC_ERR;
>>>> +    memcpy(&trampoline[TRAMPOLINE_ORI_INST_INDEX], &ori_inst,
>>>> +            sizeof(ori_inst));
>>> 
>>> Why memcpy a u32 into a u32 array?
>> 
>> Additionally, I don't see the need for the ori_inst *variable* .... the instruction is known at compile time.
>> So why not just do
>> 
>>  trampoline[TRAMPOLINE_ORI_INST_INDEX] = 0x60630000 | KVMPPC_H_REPORT_MC_ERR;
> 
> I can directly do trampoline[TRAMPOLINE_ORI_INST_INDEX] |=
> KVMPPC_H_REPORT_MC_ERR;
> 
> as trampoline[TRAMPOLINE_ORI_INST_INDEX] already contains 0x60630000
> 
>> 
>> Likewise for the branch_inst variable.
>> 
>> Also see my comment in the trampoline code below.
>>> 
>>>> +
>>>> +    /*
>>>> +     * Sanity check guest_machine_check_addr to prevent clobbering
>>>> +     * operator value in branch instruction
>>>> +     */
>>>> +    if (guest_machine_check_addr & BRANCH_INST_MASK) {
>>>> +        fprintf(stderr, "Unable to register ibm,nmi_register: "
>>>> +                "Invalid machine check handler address\n");
>>> 
>>> In general, printf's in guest triggerable code aren't a great idea,
>>> since the guest could flood our host logs with this. I can't say we're
>>> doing a great job at it already though, so it probably doesn't matter much.
>>> 
>>>> +        rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);
>> 
>> NIT:  Shouldn't this be RTAS_OUT_PARAM_ERR?  That is what SPAPR says (both are implemented to be -3).
> 
> Yes, SPAPR says -3 Parameter Error. I think RTAS_OUT_PARAM_ERR is better
> to be in consistent with SPAPR.
> 
>> 
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Update the branch instruction in trampoline
>>>> +     * with the absolute machine check address requested by OS.
>>>> +     */
>>>> +    branch_inst |= guest_machine_check_addr;
>>>> +    memcpy(&trampoline[TRAMPOLINE_BR_INST_INDEX], &branch_inst,
>>>> +            sizeof(branch_inst));
>>>> +
>>>> +    /* Handle all Host/Guest LE/BE combinations */
>>>> +    if ((*pcc->interrupts_big_endian)(cpu)) {
>>>> +        for (i = 0; i < total_inst; i++) {
>>>> +            trampoline[i] = cpu_to_be32(trampoline[i]);
>>>> +        }
>>>> +    } else {
>>>> +        for (i = 0; i < total_inst; i++) {
>>>> +            trampoline[i] = cpu_to_le32(trampoline[i]);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Patch 0x200 NMI interrupt vector memory area of guest */
>>>> +    cpu_physical_memory_write(MC_INTERRUPT_VECTOR, trampoline,
>>>> +                              sizeof(trampoline));
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +}
>>>> +
>>>> +static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
>>>> +                                   sPAPREnvironment *spapr,
>>>> +                                   uint32_t token, uint32_t nargs,
>>>> +                                   target_ulong args,
>>>> +                                   uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    /*
>>>> +     * VCPU issuing ibm,nmi-interlock is done with NMI handling,
>>>> +     * hence unset mc_in_progress.
>>>> +     */
>>>> +    mc_in_progress = 0;
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +}
>>>> +
>>>> static struct rtas_call {
>>>>     const char *name;
>>>>     spapr_rtas_fn fn;
>>>> @@ -419,6 +506,12 @@ static void core_rtas_register_types(void)
>>>>                         rtas_ibm_set_system_parameter);
>>>>     spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
>>>>                         rtas_ibm_os_term);
>>>> +    spapr_rtas_register(RTAS_IBM_NMI_REGISTER,
>>>> +                        "ibm,nmi-register",
>>>> +                        rtas_ibm_nmi_register);
>>>> +    spapr_rtas_register(RTAS_IBM_NMI_INTERLOCK,
>>>> +                        "ibm,nmi-interlock",
>>>> +                        rtas_ibm_nmi_interlock);
>>>> }
>>>> 
>>>> type_init(core_rtas_register_types)
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index a2d67e9..98d0a6c 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -384,8 +384,10 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>>> #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
>>>> #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
>>>> #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
>>>> +#define RTAS_IBM_NMI_REGISTER                   (RTAS_TOKEN_BASE + 0x20)
>>>> +#define RTAS_IBM_NMI_INTERLOCK                  (RTAS_TOKEN_BASE + 0x21)
>>>> 
>>>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
>>>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x22)
>>>> 
>>>> /* RTAS ibm,get-system-parameter token values */
>>>> #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>>>> @@ -488,4 +490,17 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>> #define RTAS_TRAMPOLINE_OFFSET   0x200
>>>> #define RTAS_ERRLOG_OFFSET       0x800
>>>> 
>>>> +/* Machine Check Trampoline related macros
>>>> + *
>>>> + * These macros should co-relate to the code we
>>>> + * have in pc-bios/spapr-rtas/spapr-rtas.S
>>>> + */
>>>> +#define TRAMPOLINE_INSTS           17
>>>> +#define TRAMPOLINE_ORI_INST_INDEX  2
>>>> +#define TRAMPOLINE_BR_INST_INDEX   15
>>>> +
>>>> +/* Machine Check Interrupt related macros */
>>>> +#define MC_INTERRUPT_VECTOR           0x200
>>>> +#define MC_INTERRUPT_VECTOR_SIZE      0x100
>>>> +
>>>> #endif /* !defined (__HW_SPAPR_H__) */
>>>> diff --git a/pc-bios/spapr-rtas/spapr-rtas.S b/pc-bios/spapr-rtas/spapr-rtas.S
>>>> index 903bec2..c315332 100644
>>>> --- a/pc-bios/spapr-rtas/spapr-rtas.S
>>>> +++ b/pc-bios/spapr-rtas/spapr-rtas.S
>>> 
>>> Please add #defines at the top of the file for the register names:
>>> 
>>>  #define r0 0
>>>  #define r1 1
>>>  ...
>>> 
>>> That way the code below will get much more readable :)
>>> 
>>> Also, you want a jump table here as we discussed in the last review
>>> round. That table would tell you
>>> 
>>>  a) Entry address for RTAS
>>>  b) Offset of the NMI code
>>>  c) To-be-patched offsets of the instructions inside the NMI code
>>> 
>>> Then we have all offsets automatically generated inside a single file
>>> and don't have to maintain fragile relationships between random headers
>>> with offset defines and the .S file.
>>> 
>>> 
>>> Alex
>>> 
>>>> @@ -35,3 +35,41 @@ _start:
>>>>    ori    3,3,KVMPPC_H_RTAS@l
>>>>    sc    1
>>>>    blr
>>>> +    . = 0x200
>>>> +    /*
>>>> +     * Trampoline saves r3 in sprg2 and issues private hcall
>>>> +     * to request qemu to build error log. QEMU builds the
>>>> +     * error log, copies to rtas-blob and returns the address.
>>>> +     * The initial 16 bytes in return adress consist of saved
>>>> +     * srr0 and srr1 which we restore and pass on the actual error
>>>> +     * log address to OS handled mcachine check notification
>>>> +     * routine
>>>> +     *
>>>> +     * All the below instructions are copied to interrupt vector
>>>> +     * 0x200 at the time of handling ibm,nmi-register rtas call.
>>>> +     */
>>>> +    mtsprg  2,3
>>>> +    li      3,0
>>>> +    /*
>>>> +     * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>>>> +     * value is patched below
>>>> +     */
>>>> +1:    ori     3,3,0
>> 
>> Why do "li 3,0" followed by "ori 3,3,X"?  Isn't this just "li 3,X" ?  (aka "addi 3,0,X")
> 
> I remember I first tried doing li r3,X but faced some problem (but not
> able to exactly recall what was the problem) may be due to not familiar
> with ppc assembly.
> 
> I will fix this.
> 
>> 
>> And, perhaps this was discussed in an earlier patch, but couldn't you just do:
>> 
>>    li 3,KVMPPC_H_REPORT_MC_ERR
>> 
>> here and avoid the patching altogether?
> 
> KVMPPC_H_REPORT_MC_ERR def in not visible in spapr-rtas.S, either I can
> define it in spapr-rtas.S as already done for KVMPPC_H_RTAS or patch it
> in ibm,nmi-register call.

Could you include the header?

> 
> It is very unlikely that the KVMPPC_H_REPORT_MC_ERR will be changed, but
> I prefer to patch it to avoid maintaining it in both places. What do you
> think?

Hypercall numbers need to be stable anyway in case we migrate from an older qemu version, so it must not change.


Alex

> 
>> 
>> 
>> 
>>>> +    sc      1               /* Issue H_CALL */
>>>> +    cmpdi   cr0,3,0
>>>> +    beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
>>>> +    mtsprg  2,4
>>>> +    ld      4,0(3)
>>>> +    mtsrr0  4               /* Restore srr0 */
>>>> +    ld      4,8(3)
>>>> +    mtsrr1  4               /* Restore srr1 */
>>>> +    ld      4,16(3)
>>>> +    mtcrf   0,4             /* Restore cr */
>>>> +    addi    3,3,24
>>>> +    mfsprg  4,2
>>>> +    /*
>>>> +     * Branch to address registered by OS. The branch address is
>>>> +     * patched in the ibm,nmi-register rtas call.
>>>> +     */
>>>> +    ba      0x0
>>>> +    b       .
> 
> -- 
> Regards,
> Aravinda
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-06 10:29         ` Alexander Graf
@ 2014-11-06 10:36           ` Aravinda Prasad
  0 siblings, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-06 10:36 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Tom Musta, benh, aik, qemu-devel, qemu-ppc, paulus



On Thursday 06 November 2014 03:59 PM, Alexander Graf wrote:
> 
> 
> 
>> Am 06.11.2014 um 11:00 schrieb Aravinda Prasad <aravinda@linux.vnet.ibm.com>:
>>
>>
>>

[...]

>>>
>>> And, perhaps this was discussed in an earlier patch, but couldn't you just do:
>>>
>>>    li 3,KVMPPC_H_REPORT_MC_ERR
>>>
>>> here and avoid the patching altogether?
>>
>> KVMPPC_H_REPORT_MC_ERR def in not visible in spapr-rtas.S, either I can
>> define it in spapr-rtas.S as already done for KVMPPC_H_RTAS or patch it
>> in ibm,nmi-register call.
> 
> Could you include the header?

hmm. ok.

> 
>>
>> It is very unlikely that the KVMPPC_H_REPORT_MC_ERR will be changed, but
>> I prefer to patch it to avoid maintaining it in both places. What do you
>> think?
> 
> Hypercall numbers need to be stable anyway in case we migrate from an older qemu version, so it must not change.

ok

> 
> 
> Alex
> 
>>
>>>
>>>
>>>
>>>>> +    sc      1               /* Issue H_CALL */
>>>>> +    cmpdi   cr0,3,0
>>>>> +    beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
>>>>> +    mtsprg  2,4
>>>>> +    ld      4,0(3)
>>>>> +    mtsrr0  4               /* Restore srr0 */
>>>>> +    ld      4,8(3)
>>>>> +    mtsrr1  4               /* Restore srr1 */
>>>>> +    ld      4,16(3)
>>>>> +    mtcrf   0,4             /* Restore cr */
>>>>> +    addi    3,3,24
>>>>> +    mfsprg  4,2
>>>>> +    /*
>>>>> +     * Branch to address registered by OS. The branch address is
>>>>> +     * patched in the ibm,nmi-register rtas call.
>>>>> +     */
>>>>> +    ba      0x0
>>>>> +    b       .
>>
>> -- 
>> Regards,
>> Aravinda
>>
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-05  7:13 ` [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call Aravinda Prasad
  2014-11-05  8:32   ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
@ 2014-11-11  3:16   ` David Gibson
  2014-11-11  6:44     ` Aravinda Prasad
  1 sibling, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-11  3:16 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus

[-- Attachment #1: Type: text/plain, Size: 9438 bytes --]

On Wed, Nov 05, 2014 at 12:43:15PM +0530, Aravinda Prasad wrote:
> This patch adds FWNMI support in qemu for powerKVM
> guests by handling the ibm,nmi-register rtas call.
> Whenever OS issues ibm,nmi-register RTAS call, the
> machine check notification address is saved and the
> machine check interrupt vector 0x200 is patched to
> issue a private hcall.
> 
> This patch also handles the cases when multi-processors
> experience machine check at or about the same time.
> As per PAPR, subsequent processors serialize waiting
> for the first processor to issue the ibm,nmi-interlock call.
> The second processor retries if the first processor which
> received a machine check is still reading the error log
> and is yet to issue ibm,nmi-interlock call.
> 
> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>

[snip]
> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
> +                                  sPAPREnvironment *spapr,
> +                                  uint32_t token, uint32_t nargs,
> +                                  target_ulong args,
> +                                  uint32_t nret, target_ulong rets)
> +{
> +    int i;
> +    uint32_t ori_inst = 0x60630000;
> +    uint32_t branch_inst = 0x48000002;
> +    target_ulong guest_machine_check_addr;
> +    uint32_t trampoline[TRAMPOLINE_INSTS];
> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);

You should sanity check the RTAS arguments before doing anything - in
particular verify that nargs and nrets have the expected values.

> +
> +    /* Store the system reset and machine check address */
> +    guest_machine_check_addr = rtas_ld(args, 1);
> +
> +    /*
> +     * Read the trampoline instructions from RTAS Blob and patch
> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
> +     * machine check address before copying to 0x200 vector
> +     */
> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
> +                             trampoline, sizeof(trampoline));
> +
> +    /* Safety Check */
> +    QEMU_BUILD_BUG_ON(sizeof(trampoline) > MC_INTERRUPT_VECTOR_SIZE);
> +
> +    /* Update the KVMPPC_H_REPORT_MC_ERR value in trampoline */
> +    ori_inst |= KVMPPC_H_REPORT_MC_ERR;
> +    memcpy(&trampoline[TRAMPOLINE_ORI_INST_INDEX], &ori_inst,
> +            sizeof(ori_inst));

Given that we already code the KVMPPC_H_RTAS value directly into the
.S file, I don't think it's worth the trouble of patching the
H_REPORT_MC_ERR value.  As Alex says, it has to stay the same for
migration anyway.

> +    /*
> +     * Sanity check guest_machine_check_addr to prevent clobbering
> +     * operator value in branch instruction
> +     */
> +    if (guest_machine_check_addr & BRANCH_INST_MASK) {
> +        fprintf(stderr, "Unable to register ibm,nmi_register: "
> +                "Invalid machine check handler address\n");
> +        rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);
> +        return;
> +    }
> +
> +    /*
> +     * Update the branch instruction in trampoline
> +     * with the absolute machine check address requested by OS.
> +     */
> +    branch_inst |= guest_machine_check_addr;
> +    memcpy(&trampoline[TRAMPOLINE_BR_INST_INDEX], &branch_inst,
> +            sizeof(branch_inst));
> +
> +    /* Handle all Host/Guest LE/BE combinations */
> +    if ((*pcc->interrupts_big_endian)(cpu)) {
> +        for (i = 0; i < total_inst; i++) {
> +            trampoline[i] = cpu_to_be32(trampoline[i]);
> +        }
> +    } else {
> +        for (i = 0; i < total_inst; i++) {
> +            trampoline[i] = cpu_to_le32(trampoline[i]);
> +        }
> +    }
> +
> +    /* Patch 0x200 NMI interrupt vector memory area of guest */
> +    cpu_physical_memory_write(MC_INTERRUPT_VECTOR, trampoline,
> +                              sizeof(trampoline));
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +}
> +
> +static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
> +                                   sPAPREnvironment *spapr,
> +                                   uint32_t token, uint32_t nargs,
> +                                   target_ulong args,
> +                                   uint32_t nret, target_ulong rets)
> +{

Again you should sanity check the arguments - at least check nargs and
nrets.

> +    /*
> +     * VCPU issuing ibm,nmi-interlock is done with NMI handling,
> +     * hence unset mc_in_progress.
> +     */
> +    mc_in_progress = 0;
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +}
> +
>  static struct rtas_call {
>      const char *name;
>      spapr_rtas_fn fn;
> @@ -419,6 +506,12 @@ static void core_rtas_register_types(void)
>                          rtas_ibm_set_system_parameter);
>      spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
>                          rtas_ibm_os_term);
> +    spapr_rtas_register(RTAS_IBM_NMI_REGISTER,
> +                        "ibm,nmi-register",
> +                        rtas_ibm_nmi_register);
> +    spapr_rtas_register(RTAS_IBM_NMI_INTERLOCK,
> +                        "ibm,nmi-interlock",
> +                        rtas_ibm_nmi_interlock);
>  }
>  
>  type_init(core_rtas_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index a2d67e9..98d0a6c 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -384,8 +384,10 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
>  #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
>  #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
> +#define RTAS_IBM_NMI_REGISTER                   (RTAS_TOKEN_BASE + 0x20)
> +#define RTAS_IBM_NMI_INTERLOCK                  (RTAS_TOKEN_BASE + 0x21)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x22)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> @@ -488,4 +490,17 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>  #define RTAS_TRAMPOLINE_OFFSET   0x200
>  #define RTAS_ERRLOG_OFFSET       0x800
>  
> +/* Machine Check Trampoline related macros
> + *
> + * These macros should co-relate to the code we
> + * have in pc-bios/spapr-rtas/spapr-rtas.S
> + */
> +#define TRAMPOLINE_INSTS           17
> +#define TRAMPOLINE_ORI_INST_INDEX  2
> +#define TRAMPOLINE_BR_INST_INDEX   15
> +
> +/* Machine Check Interrupt related macros */
> +#define MC_INTERRUPT_VECTOR           0x200
> +#define MC_INTERRUPT_VECTOR_SIZE      0x100
> +
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/pc-bios/spapr-rtas/spapr-rtas.S b/pc-bios/spapr-rtas/spapr-rtas.S
> index 903bec2..c315332 100644
> --- a/pc-bios/spapr-rtas/spapr-rtas.S
> +++ b/pc-bios/spapr-rtas/spapr-rtas.S
> @@ -35,3 +35,41 @@ _start:
>  	ori	3,3,KVMPPC_H_RTAS@l
>  	sc	1
>  	blr
> +	. = 0x200
> +	/*
> +	 * Trampoline saves r3 in sprg2 and issues private hcall
> +	 * to request qemu to build error log. QEMU builds the
> +	 * error log, copies to rtas-blob and returns the address.
> +	 * The initial 16 bytes in return adress consist of saved
> +	 * srr0 and srr1 which we restore and pass on the actual error
> +	 * log address to OS handled mcachine check notification
> +	 * routine
> +	 *
> +	 * All the below instructions are copied to interrupt vector
> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
> +	 */
> +	mtsprg  2,3
> +	li      3,0
> +	/*
> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
> +	 * value is patched below
> +	 */
> +1:	ori     3,3,0
> +	sc      1               /* Issue H_CALL */
> +	cmpdi   cr0,3,0
> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */

Having to retry the hcall from here seems very awkward.  This is a
private hcall, so you can define it to do whatever retries are
necessary internally (and I don't think your current implementation
can fail anyway).

> +	mtsprg  2,4

Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.

> +	ld      4,0(3)
> +	mtsrr0  4               /* Restore srr0 */
> +	ld      4,8(3)
> +	mtsrr1  4               /* Restore srr1 */
> +	ld      4,16(3)
> +	mtcrf   0,4             /* Restore cr */

mtcrf?  aren't you restoring the whole CR?

> +	addi    3,3,24
> +	mfsprg  4,2
> +	/*
> +	 * Branch to address registered by OS. The branch address is
> +	 * patched in the ibm,nmi-register rtas call.
> +	 */
> +	ba      0x0
> +	b       .

The branch to self is pointless.  Even if the instruction above is
not patched, or patched incorrectly, it's a ba, so you're not likely
to end up at the instruction underneath.

Actually, what would probably make more sense would be to just have a
"b ." *instead* of the ba, and have the qemu patching replace it with
the correct ba instruction.  That will limit the damage if it somehow
gets executed without being patched.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-06 10:00       ` Aravinda Prasad
  2014-11-06 10:29         ` Alexander Graf
@ 2014-11-11  3:19         ` David Gibson
  2014-11-11  5:48           ` Aravinda Prasad
  1 sibling, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-11  3:19 UTC (permalink / raw)
  To: Aravinda Prasad
  Cc: Tom Musta, benh, aik, Alexander Graf, qemu-devel, qemu-ppc, paulus

[-- Attachment #1: Type: text/plain, Size: 1081 bytes --]

On Thu, Nov 06, 2014 at 03:30:01PM +0530, Aravinda Prasad wrote:
> On Wednesday 05 November 2014 09:16 PM, Tom Musta wrote:
> > On 11/5/2014 2:32 AM, Alexander Graf wrote:
> >> On 05.11.14 08:13, Aravinda Prasad wrote:

[snip]
> >>> +	/*
> >>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
> >>> +	 * value is patched below
> >>> +	 */
> >>> +1:	ori     3,3,0
> > 
> > Why do "li 3,0" followed by "ori 3,3,X"?  Isn't this just "li 3,X" ?  (aka "addi 3,0,X")
> 
> I remember I first tried doing li r3,X but faced some problem (but not
> able to exactly recall what was the problem) may be due to not familiar
> with ppc assembly.

This would be because with the offset to the private hcalls, the
actual hcall number is 0xf003, which means an li instruction will sign
extend it incorrectly. So you will need two instructions to load the
number.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-05  7:12 [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests Aravinda Prasad
                   ` (3 preceding siblings ...)
  2014-11-05  7:13 ` [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call Aravinda Prasad
@ 2014-11-11  3:24 ` David Gibson
  2014-11-11  7:15   ` Aravinda Prasad
  2014-11-19  5:48   ` Aravinda Prasad
  4 siblings, 2 replies; 66+ messages in thread
From: David Gibson @ 2014-11-11  3:24 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus

[-- Attachment #1: Type: text/plain, Size: 2315 bytes --]

On Wed, Nov 05, 2014 at 12:42:03PM +0530, Aravinda Prasad wrote:
> This series of patches add support for fwnmi in powerKVM guests.
> 
> Currently upon machine check exception, if the address in
> error belongs to guest then KVM invokes guest's NMI interrupt
> vector 0x200.
> 
> This patch series adds functionality where the guest's 0x200
> interrupt vector is patched such that QEMU gets control. QEMU
> then builds error log and reports the error to OS registered
> machine check handlers through RTAS space.
> 
> Apart from this, the patch series also takes care of synchronization
> when multiple processors encounter machine check at or about the
> same time.
> 
> The patch set was tested by simulating a machine check error in
> the guest.
> 
> Changes in v3:
>     - Incorporated review comments
>     - Byte codes in patch 4/4 are now moved to
>       pc-bios/spapr-rtas/spapr-rtas.S as instructions.
>     - Defined the RTAS blob in-memory layout.
>     - FIX: save and restore cr register in the trampoline
> 
> Changes in v2:
>     - Re-based to github.com/agraf/qemu.git  branch: ppc-next
>     - Merged patches 4 and 5.
>     - Incorporated other review comments

So, this may not still be possible depending on whether the KVM side
of this is already merged, but it occurs to me that there's a simpler
way.

Rather than mucking about with having to update the hypervisor on the
RTAS location, they have qemu copy the code out of RTAS, patch it and
copy it back into the vector, you could instead do this:

  1. Make KVM instead of immediately delivering a 0x200 for a guest
machine check, cause a special exit to qemu.

  2. Have the register-nmi RTAS call store the guest side MC handler
address in the spapr structure, but perform no actual guest code
patching.

  3. Allocate the error log buffer independently from the RTAS blob,
so qemu always knows where it is.

  4. When qemu gets the MC exit condition, instead of going via a
patched 0x200 vector, just directly set the guest register state and
jump straight into the guest side MC handler.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-11  3:19         ` David Gibson
@ 2014-11-11  5:48           ` Aravinda Prasad
  2014-11-11  6:11             ` David Gibson
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-11  5:48 UTC (permalink / raw)
  To: David Gibson; +Cc: Tom Musta, benh, aik, qemu-devel, qemu-ppc, paulus



On Tuesday 11 November 2014 08:49 AM, David Gibson wrote:
> On Thu, Nov 06, 2014 at 03:30:01PM +0530, Aravinda Prasad wrote:
>> On Wednesday 05 November 2014 09:16 PM, Tom Musta wrote:
>>> On 11/5/2014 2:32 AM, Alexander Graf wrote:
>>>> On 05.11.14 08:13, Aravinda Prasad wrote:
> 
> [snip]
>>>>> +	/*
>>>>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>>>>> +	 * value is patched below
>>>>> +	 */
>>>>> +1:	ori     3,3,0
>>>
>>> Why do "li 3,0" followed by "ori 3,3,X"?  Isn't this just "li 3,X" ?  (aka "addi 3,0,X")
>>
>> I remember I first tried doing li r3,X but faced some problem (but not
>> able to exactly recall what was the problem) may be due to not familiar
>> with ppc assembly.
> 
> This would be because with the offset to the private hcalls, the
> actual hcall number is 0xf003, which means an li instruction will sign
> extend it incorrectly. So you will need two instructions to load the
> number.
> 

hmm.. ok

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-11  5:48           ` Aravinda Prasad
@ 2014-11-11  6:11             ` David Gibson
  2014-11-11  6:51               ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-11  6:11 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: Tom Musta, benh, aik, qemu-devel, qemu-ppc, paulus

[-- Attachment #1: Type: text/plain, Size: 1439 bytes --]

On Tue, Nov 11, 2014 at 11:18:05AM +0530, Aravinda Prasad wrote:
> 
> 
> On Tuesday 11 November 2014 08:49 AM, David Gibson wrote:
> > On Thu, Nov 06, 2014 at 03:30:01PM +0530, Aravinda Prasad wrote:
> >> On Wednesday 05 November 2014 09:16 PM, Tom Musta wrote:
> >>> On 11/5/2014 2:32 AM, Alexander Graf wrote:
> >>>> On 05.11.14 08:13, Aravinda Prasad wrote:
> > 
> > [snip]
> >>>>> +	/*
> >>>>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
> >>>>> +	 * value is patched below
> >>>>> +	 */
> >>>>> +1:	ori     3,3,0
> >>>
> >>> Why do "li 3,0" followed by "ori 3,3,X"?  Isn't this just "li 3,X" ?  (aka "addi 3,0,X")
> >>
> >> I remember I first tried doing li r3,X but faced some problem (but not
> >> able to exactly recall what was the problem) may be due to not familiar
> >> with ppc assembly.
> > 
> > This would be because with the offset to the private hcalls, the
> > actual hcall number is 0xf003, which means an li instruction will sign
> > extend it incorrectly. So you will need two instructions to load the
> > number.
> 
> hmm.. ok

At least, I think you'll need to instructions.  I don't remember for
certain if 'ori 3,0,X' will OR X with literal 0 or the contents of r0.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-11  3:16   ` [Qemu-devel] " David Gibson
@ 2014-11-11  6:44     ` Aravinda Prasad
  2014-11-13  3:52       ` David Gibson
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-11  6:44 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus



On Tuesday 11 November 2014 08:46 AM, David Gibson wrote:
> On Wed, Nov 05, 2014 at 12:43:15PM +0530, Aravinda Prasad wrote:
>> This patch adds FWNMI support in qemu for powerKVM
>> guests by handling the ibm,nmi-register rtas call.
>> Whenever OS issues ibm,nmi-register RTAS call, the
>> machine check notification address is saved and the
>> machine check interrupt vector 0x200 is patched to
>> issue a private hcall.
>>
>> This patch also handles the cases when multi-processors
>> experience machine check at or about the same time.
>> As per PAPR, subsequent processors serialize waiting
>> for the first processor to issue the ibm,nmi-interlock call.
>> The second processor retries if the first processor which
>> received a machine check is still reading the error log
>> and is yet to issue ibm,nmi-interlock call.
>>
>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
> 
> [snip]
>> +static void rtas_ibm_nmi_register(PowerPCCPU *cpu,
>> +                                  sPAPREnvironment *spapr,
>> +                                  uint32_t token, uint32_t nargs,
>> +                                  target_ulong args,
>> +                                  uint32_t nret, target_ulong rets)
>> +{
>> +    int i;
>> +    uint32_t ori_inst = 0x60630000;
>> +    uint32_t branch_inst = 0x48000002;
>> +    target_ulong guest_machine_check_addr;
>> +    uint32_t trampoline[TRAMPOLINE_INSTS];
>> +    int total_inst = sizeof(trampoline) / sizeof(uint32_t);
>> +    PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cpu);
> 
> You should sanity check the RTAS arguments before doing anything - in
> particular verify that nargs and nrets have the expected values.

ok.

> 
>> +
>> +    /* Store the system reset and machine check address */
>> +    guest_machine_check_addr = rtas_ld(args, 1);
>> +
>> +    /*
>> +     * Read the trampoline instructions from RTAS Blob and patch
>> +     * the KVMPPC_H_REPORT_MC_ERR hcall number and the guest
>> +     * machine check address before copying to 0x200 vector
>> +     */
>> +    cpu_physical_memory_read(spapr->rtas_addr + RTAS_TRAMPOLINE_OFFSET,
>> +                             trampoline, sizeof(trampoline));
>> +
>> +    /* Safety Check */
>> +    QEMU_BUILD_BUG_ON(sizeof(trampoline) > MC_INTERRUPT_VECTOR_SIZE);
>> +
>> +    /* Update the KVMPPC_H_REPORT_MC_ERR value in trampoline */
>> +    ori_inst |= KVMPPC_H_REPORT_MC_ERR;
>> +    memcpy(&trampoline[TRAMPOLINE_ORI_INST_INDEX], &ori_inst,
>> +            sizeof(ori_inst));
> 
> Given that we already code the KVMPPC_H_RTAS value directly into the
> .S file, I don't think it's worth the trouble of patching the
> H_REPORT_MC_ERR value.  As Alex says, it has to stay the same for
> migration anyway.

Yes. patching of KVMPPC_H_REPORT_MC_ERR value will go-off.

> 
>> +    /*
>> +     * Sanity check guest_machine_check_addr to prevent clobbering
>> +     * operator value in branch instruction
>> +     */
>> +    if (guest_machine_check_addr & BRANCH_INST_MASK) {
>> +        fprintf(stderr, "Unable to register ibm,nmi_register: "
>> +                "Invalid machine check handler address\n");
>> +        rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Update the branch instruction in trampoline
>> +     * with the absolute machine check address requested by OS.
>> +     */
>> +    branch_inst |= guest_machine_check_addr;
>> +    memcpy(&trampoline[TRAMPOLINE_BR_INST_INDEX], &branch_inst,
>> +            sizeof(branch_inst));
>> +
>> +    /* Handle all Host/Guest LE/BE combinations */
>> +    if ((*pcc->interrupts_big_endian)(cpu)) {
>> +        for (i = 0; i < total_inst; i++) {
>> +            trampoline[i] = cpu_to_be32(trampoline[i]);
>> +        }
>> +    } else {
>> +        for (i = 0; i < total_inst; i++) {
>> +            trampoline[i] = cpu_to_le32(trampoline[i]);
>> +        }
>> +    }
>> +
>> +    /* Patch 0x200 NMI interrupt vector memory area of guest */
>> +    cpu_physical_memory_write(MC_INTERRUPT_VECTOR, trampoline,
>> +                              sizeof(trampoline));
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +}
>> +
>> +static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
>> +                                   sPAPREnvironment *spapr,
>> +                                   uint32_t token, uint32_t nargs,
>> +                                   target_ulong args,
>> +                                   uint32_t nret, target_ulong rets)
>> +{
> 
> Again you should sanity check the arguments - at least check nargs and
> nrets.

will do

> 
>> +    /*
>> +     * VCPU issuing ibm,nmi-interlock is done with NMI handling,
>> +     * hence unset mc_in_progress.
>> +     */
>> +    mc_in_progress = 0;
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +}
>> +
>>  static struct rtas_call {
>>      const char *name;
>>      spapr_rtas_fn fn;
>> @@ -419,6 +506,12 @@ static void core_rtas_register_types(void)
>>                          rtas_ibm_set_system_parameter);
>>      spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
>>                          rtas_ibm_os_term);
>> +    spapr_rtas_register(RTAS_IBM_NMI_REGISTER,
>> +                        "ibm,nmi-register",
>> +                        rtas_ibm_nmi_register);
>> +    spapr_rtas_register(RTAS_IBM_NMI_INTERLOCK,
>> +                        "ibm,nmi-interlock",
>> +                        rtas_ibm_nmi_interlock);
>>  }
>>  
>>  type_init(core_rtas_register_types)
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index a2d67e9..98d0a6c 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -384,8 +384,10 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
>>  #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
>>  #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
>> +#define RTAS_IBM_NMI_REGISTER                   (RTAS_TOKEN_BASE + 0x20)
>> +#define RTAS_IBM_NMI_INTERLOCK                  (RTAS_TOKEN_BASE + 0x21)
>>  
>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x22)
>>  
>>  /* RTAS ibm,get-system-parameter token values */
>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>> @@ -488,4 +490,17 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>  #define RTAS_TRAMPOLINE_OFFSET   0x200
>>  #define RTAS_ERRLOG_OFFSET       0x800
>>  
>> +/* Machine Check Trampoline related macros
>> + *
>> + * These macros should co-relate to the code we
>> + * have in pc-bios/spapr-rtas/spapr-rtas.S
>> + */
>> +#define TRAMPOLINE_INSTS           17
>> +#define TRAMPOLINE_ORI_INST_INDEX  2
>> +#define TRAMPOLINE_BR_INST_INDEX   15
>> +
>> +/* Machine Check Interrupt related macros */
>> +#define MC_INTERRUPT_VECTOR           0x200
>> +#define MC_INTERRUPT_VECTOR_SIZE      0x100
>> +
>>  #endif /* !defined (__HW_SPAPR_H__) */
>> diff --git a/pc-bios/spapr-rtas/spapr-rtas.S b/pc-bios/spapr-rtas/spapr-rtas.S
>> index 903bec2..c315332 100644
>> --- a/pc-bios/spapr-rtas/spapr-rtas.S
>> +++ b/pc-bios/spapr-rtas/spapr-rtas.S
>> @@ -35,3 +35,41 @@ _start:
>>  	ori	3,3,KVMPPC_H_RTAS@l
>>  	sc	1
>>  	blr
>> +	. = 0x200
>> +	/*
>> +	 * Trampoline saves r3 in sprg2 and issues private hcall
>> +	 * to request qemu to build error log. QEMU builds the
>> +	 * error log, copies to rtas-blob and returns the address.
>> +	 * The initial 16 bytes in return adress consist of saved
>> +	 * srr0 and srr1 which we restore and pass on the actual error
>> +	 * log address to OS handled mcachine check notification
>> +	 * routine
>> +	 *
>> +	 * All the below instructions are copied to interrupt vector
>> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
>> +	 */
>> +	mtsprg  2,3
>> +	li      3,0
>> +	/*
>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>> +	 * value is patched below
>> +	 */
>> +1:	ori     3,3,0
>> +	sc      1               /* Issue H_CALL */
>> +	cmpdi   cr0,3,0
>> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
> 
> Having to retry the hcall from here seems very awkward.  This is a
> private hcall, so you can define it to do whatever retries are
> necessary internally (and I don't think your current implementation
> can fail anyway).

Retrying is required in the cases when multi-processors experience
machine check at or about the same time. As per PAPR, subsequent
processors should serialize and wait for the first processor to issue
the ibm,nmi-interlock call. The second processor retries if the first
processor which received a machine check is still reading the error log
and is yet to issue ibm,nmi-interlock call.

Retrying cannot be done internally in h_report_mc_err hcall: only one
thread can succeed entering qemu upon parallel hcall and hence retrying
inside the hcall will not allow the ibm,nmi-interlock from first CPU to
succeed.

> 
>> +	mtsprg  2,4
> 
> Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.

The r3 saved in SPRG2 is moved to rtas area in the private hcall and
hence it is fine to clobber r3 here

> 
>> +	ld      4,0(3)
>> +	mtsrr0  4               /* Restore srr0 */
>> +	ld      4,8(3)
>> +	mtsrr1  4               /* Restore srr1 */
>> +	ld      4,16(3)
>> +	mtcrf   0,4             /* Restore cr */
> 
> mtcrf?  aren't you restoring the whole CR?

No. I am moving only cr0. The 0 in mtcrf 0,4 represents the cr field
mask that is replaced.

> 
>> +	addi    3,3,24
>> +	mfsprg  4,2
>> +	/*
>> +	 * Branch to address registered by OS. The branch address is
>> +	 * patched in the ibm,nmi-register rtas call.
>> +	 */
>> +	ba      0x0
>> +	b       .
> 
> The branch to self is pointless.  Even if the instruction above is
> not patched, or patched incorrectly, it's a ba, so you're not likely
> to end up at the instruction underneath.

I added it to avoid speculative execution. Based on how it is used in
arch/powerpc/kernel/exceptions-64s.S

> 
> Actually, what would probably make more sense would be to just have a
> "b ." *instead* of the ba, and have the qemu patching replace it with
> the correct ba instruction.  That will limit the damage if it somehow
> gets executed without being patched.

good idea. Will do that.

> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-11  6:11             ` David Gibson
@ 2014-11-11  6:51               ` Aravinda Prasad
  2014-11-11 11:30                 ` David Gibson
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-11  6:51 UTC (permalink / raw)
  To: David Gibson; +Cc: Tom Musta, benh, aik, qemu-devel, qemu-ppc, paulus



On Tuesday 11 November 2014 11:41 AM, David Gibson wrote:
> On Tue, Nov 11, 2014 at 11:18:05AM +0530, Aravinda Prasad wrote:
>>
>>
>> On Tuesday 11 November 2014 08:49 AM, David Gibson wrote:
>>> On Thu, Nov 06, 2014 at 03:30:01PM +0530, Aravinda Prasad wrote:
>>>> On Wednesday 05 November 2014 09:16 PM, Tom Musta wrote:
>>>>> On 11/5/2014 2:32 AM, Alexander Graf wrote:
>>>>>> On 05.11.14 08:13, Aravinda Prasad wrote:
>>>
>>> [snip]
>>>>>>> +	/*
>>>>>>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>>>>>>> +	 * value is patched below
>>>>>>> +	 */
>>>>>>> +1:	ori     3,3,0
>>>>>
>>>>> Why do "li 3,0" followed by "ori 3,3,X"?  Isn't this just "li 3,X" ?  (aka "addi 3,0,X")
>>>>
>>>> I remember I first tried doing li r3,X but faced some problem (but not
>>>> able to exactly recall what was the problem) may be due to not familiar
>>>> with ppc assembly.
>>>
>>> This would be because with the offset to the private hcalls, the
>>> actual hcall number is 0xf003, which means an li instruction will sign
>>> extend it incorrectly. So you will need two instructions to load the
>>> number.
>>
>> hmm.. ok
> 
> At least, I think you'll need to instructions.  I don't remember for
> certain if 'ori 3,0,X' will OR X with literal 0 or the contents of r0.

It is ORed with r0

>From ISA:

ori RA,RS,UI

(RA) <- (RS) | (0 || UI)

The contents of register RS are ORed with 0 || UI and
the result is placed into register RA.

> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-11  3:24 ` [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests David Gibson
@ 2014-11-11  7:15   ` Aravinda Prasad
  2014-11-13  3:57     ` David Gibson
  2014-11-19  5:48   ` Aravinda Prasad
  1 sibling, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-11  7:15 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, benh, qemu-ppc, qemu-devel, paulus



On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> On Wed, Nov 05, 2014 at 12:42:03PM +0530, Aravinda Prasad wrote:
>> This series of patches add support for fwnmi in powerKVM guests.
>>
>> Currently upon machine check exception, if the address in
>> error belongs to guest then KVM invokes guest's NMI interrupt
>> vector 0x200.
>>
>> This patch series adds functionality where the guest's 0x200
>> interrupt vector is patched such that QEMU gets control. QEMU
>> then builds error log and reports the error to OS registered
>> machine check handlers through RTAS space.
>>
>> Apart from this, the patch series also takes care of synchronization
>> when multiple processors encounter machine check at or about the
>> same time.
>>
>> The patch set was tested by simulating a machine check error in
>> the guest.
>>
>> Changes in v3:
>>     - Incorporated review comments
>>     - Byte codes in patch 4/4 are now moved to
>>       pc-bios/spapr-rtas/spapr-rtas.S as instructions.
>>     - Defined the RTAS blob in-memory layout.
>>     - FIX: save and restore cr register in the trampoline
>>
>> Changes in v2:
>>     - Re-based to github.com/agraf/qemu.git  branch: ppc-next
>>     - Merged patches 4 and 5.
>>     - Incorporated other review comments
> 
> So, this may not still be possible depending on whether the KVM side
> of this is already merged, but it occurs to me that there's a simpler
> way.

The KVM part is already merged. Commit ID: 74845bc

> 
> Rather than mucking about with having to update the hypervisor on the
> RTAS location, they have qemu copy the code out of RTAS, patch it and
> copy it back into the vector, you could instead do this:

Though this is possible, I have coupe of comments below

> 
>   1. Make KVM instead of immediately delivering a 0x200 for a guest
> machine check, cause a special exit to qemu.
> 
>   2. Have the register-nmi RTAS call store the guest side MC handler
> address in the spapr structure, but perform no actual guest code
> patching.
> 
>   3. Allocate the error log buffer independently from the RTAS blob,
> so qemu always knows where it is.

As per PAPR, the error log buffer should be part of RTAS blob and the
guest kernel explicitly checks if error log is inside RTAS blob.
This requires qemu to know the updated RTAS location by the OS which is
handled in patch 2/4.

> 
>   4. When qemu gets the MC exit condition, instead of going via a
> patched 0x200 vector, just directly set the guest register state and
> jump straight into the guest side MC handler.

PAPR mentions:

"R1–7.3.14–8: Once the OS has registered for NMI notification, the
platform firmware must intercept all System Reset Interrupts on all of
the OS’s processors."

So do we need to go via 0x200?

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-11  6:51               ` Aravinda Prasad
@ 2014-11-11 11:30                 ` David Gibson
  0 siblings, 0 replies; 66+ messages in thread
From: David Gibson @ 2014-11-11 11:30 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: Tom Musta, benh, aik, qemu-devel, qemu-ppc, paulus

[-- Attachment #1: Type: text/plain, Size: 1972 bytes --]

On Tue, Nov 11, 2014 at 12:21:49PM +0530, Aravinda Prasad wrote:
> 
> 
> On Tuesday 11 November 2014 11:41 AM, David Gibson wrote:
> > On Tue, Nov 11, 2014 at 11:18:05AM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Tuesday 11 November 2014 08:49 AM, David Gibson wrote:
> >>> On Thu, Nov 06, 2014 at 03:30:01PM +0530, Aravinda Prasad wrote:
> >>>> On Wednesday 05 November 2014 09:16 PM, Tom Musta wrote:
> >>>>> On 11/5/2014 2:32 AM, Alexander Graf wrote:
> >>>>>> On 05.11.14 08:13, Aravinda Prasad wrote:
> >>>
> >>> [snip]
> >>>>>>> +	/*
> >>>>>>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
> >>>>>>> +	 * value is patched below
> >>>>>>> +	 */
> >>>>>>> +1:	ori     3,3,0
> >>>>>
> >>>>> Why do "li 3,0" followed by "ori 3,3,X"?  Isn't this just "li 3,X" ?  (aka "addi 3,0,X")
> >>>>
> >>>> I remember I first tried doing li r3,X but faced some problem (but not
> >>>> able to exactly recall what was the problem) may be due to not familiar
> >>>> with ppc assembly.
> >>>
> >>> This would be because with the offset to the private hcalls, the
> >>> actual hcall number is 0xf003, which means an li instruction will sign
> >>> extend it incorrectly. So you will need two instructions to load the
> >>> number.
> >>
> >> hmm.. ok
> > 
> > At least, I think you'll need to instructions.  I don't remember for
> > certain if 'ori 3,0,X' will OR X with literal 0 or the contents of r0.
> 
> It is ORed with r0
> 
> >From ISA:
> 
> ori RA,RS,UI
> 
> (RA) <- (RS) | (0 || UI)
> 
> The contents of register RS are ORed with 0 || UI and
> the result is placed into register RA.

Right, thought so.  So unless there's some instruction I haven't
thought of, you'll need two instructions to load the value.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-11  6:44     ` Aravinda Prasad
@ 2014-11-13  3:52       ` David Gibson
  2014-11-13  5:58         ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-13  3:52 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus

[-- Attachment #1: Type: text/plain, Size: 4374 bytes --]

On Tue, Nov 11, 2014 at 12:14:31PM +0530, Aravinda Prasad wrote:
> On Tuesday 11 November 2014 08:46 AM, David Gibson wrote:
> > On Wed, Nov 05, 2014 at 12:43:15PM +0530, Aravinda Prasad wrote:
[snip]
> >> +	. = 0x200
> >> +	/*
> >> +	 * Trampoline saves r3 in sprg2 and issues private hcall
> >> +	 * to request qemu to build error log. QEMU builds the
> >> +	 * error log, copies to rtas-blob and returns the address.
> >> +	 * The initial 16 bytes in return adress consist of saved
> >> +	 * srr0 and srr1 which we restore and pass on the actual error
> >> +	 * log address to OS handled mcachine check notification
> >> +	 * routine
> >> +	 *
> >> +	 * All the below instructions are copied to interrupt vector
> >> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
> >> +	 */
> >> +	mtsprg  2,3
> >> +	li      3,0
> >> +	/*
> >> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
> >> +	 * value is patched below
> >> +	 */
> >> +1:	ori     3,3,0
> >> +	sc      1               /* Issue H_CALL */
> >> +	cmpdi   cr0,3,0
> >> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
> > 
> > Having to retry the hcall from here seems very awkward.  This is a
> > private hcall, so you can define it to do whatever retries are
> > necessary internally (and I don't think your current implementation
> > can fail anyway).
> 
> Retrying is required in the cases when multi-processors experience
> machine check at or about the same time. As per PAPR, subsequent
> processors should serialize and wait for the first processor to issue
> the ibm,nmi-interlock call. The second processor retries if the first
> processor which received a machine check is still reading the error log
> and is yet to issue ibm,nmi-interlock call.

Hmm.. ok.  But I don't see any mechanism in the patches by which
H_REPORT_MC_ERR will report failure if another CPU has an MC in
progress.

> Retrying cannot be done internally in h_report_mc_err hcall: only one
> thread can succeed entering qemu upon parallel hcall and hence retrying
> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
> succeed.

It's possible, but would require some fiddling inside the h_call to
unlock and wait for the other CPUs to finish, so yes, it might be more
trouble than it's worth.

> 
> > 
> >> +	mtsprg  2,4
> > 
> > Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.
> 
> The r3 saved in SPRG2 is moved to rtas area in the private hcall and
> hence it is fine to clobber r3 here

Ok, if you're going to do some magic register saving inside the HCALL,
why not do the SRR[01] and CR restoration inside there as well.

> > 
> >> +	ld      4,0(3)
> >> +	mtsrr0  4               /* Restore srr0 */
> >> +	ld      4,8(3)
> >> +	mtsrr1  4               /* Restore srr1 */
> >> +	ld      4,16(3)
> >> +	mtcrf   0,4             /* Restore cr */
> > 
> > mtcrf?  aren't you restoring the whole CR?
> 
> No. I am moving only cr0. The 0 in mtcrf 0,4 represents the cr field
> mask that is replaced.

Uh, yes it is.  In which case a value of 0 means *no* condition
register fields are transferred.

> 
> > 
> >> +	addi    3,3,24
> >> +	mfsprg  4,2
> >> +	/*
> >> +	 * Branch to address registered by OS. The branch address is
> >> +	 * patched in the ibm,nmi-register rtas call.
> >> +	 */
> >> +	ba      0x0
> >> +	b       .
> > 
> > The branch to self is pointless.  Even if the instruction above is
> > not patched, or patched incorrectly, it's a ba, so you're not likely
> > to end up at the instruction underneath.
> 
> I added it to avoid speculative execution. Based on how it is used in
> arch/powerpc/kernel/exceptions-64s.S

Ah, I guess that makes sense, although surely any ba instruction would
also have to inhibit speculative execution.

> > Actually, what would probably make more sense would be to just have a
> > "b ." *instead* of the ba, and have the qemu patching replace it with
> > the correct ba instruction.  That will limit the damage if it somehow
> > gets executed without being patched.
> 
> good idea. Will do that.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-11  7:15   ` Aravinda Prasad
@ 2014-11-13  3:57     ` David Gibson
  2014-11-13  6:10       ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-13  3:57 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: aik, benh, qemu-ppc, qemu-devel, paulus

[-- Attachment #1: Type: text/plain, Size: 4176 bytes --]

On Tue, Nov 11, 2014 at 12:45:05PM +0530, Aravinda Prasad wrote:
> 
> 
> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> > On Wed, Nov 05, 2014 at 12:42:03PM +0530, Aravinda Prasad wrote:
> >> This series of patches add support for fwnmi in powerKVM guests.
> >>
> >> Currently upon machine check exception, if the address in
> >> error belongs to guest then KVM invokes guest's NMI interrupt
> >> vector 0x200.
> >>
> >> This patch series adds functionality where the guest's 0x200
> >> interrupt vector is patched such that QEMU gets control. QEMU
> >> then builds error log and reports the error to OS registered
> >> machine check handlers through RTAS space.
> >>
> >> Apart from this, the patch series also takes care of synchronization
> >> when multiple processors encounter machine check at or about the
> >> same time.
> >>
> >> The patch set was tested by simulating a machine check error in
> >> the guest.
> >>
> >> Changes in v3:
> >>     - Incorporated review comments
> >>     - Byte codes in patch 4/4 are now moved to
> >>       pc-bios/spapr-rtas/spapr-rtas.S as instructions.
> >>     - Defined the RTAS blob in-memory layout.
> >>     - FIX: save and restore cr register in the trampoline
> >>
> >> Changes in v2:
> >>     - Re-based to github.com/agraf/qemu.git  branch: ppc-next
> >>     - Merged patches 4 and 5.
> >>     - Incorporated other review comments
> > 
> > So, this may not still be possible depending on whether the KVM side
> > of this is already merged, but it occurs to me that there's a simpler
> > way.
> 
> The KVM part is already merged. Commit ID: 74845bc

Ok, that makes life harder, though I guess without the qemu code
merged, no-one would be using yet, so it's not impossible to change still.

> > Rather than mucking about with having to update the hypervisor on the
> > RTAS location, they have qemu copy the code out of RTAS, patch it and
> > copy it back into the vector, you could instead do this:
> 
> Though this is possible, I have coupe of comments below
> 
> > 
> >   1. Make KVM instead of immediately delivering a 0x200 for a guest
> > machine check, cause a special exit to qemu.
> > 
> >   2. Have the register-nmi RTAS call store the guest side MC handler
> > address in the spapr structure, but perform no actual guest code
> > patching.
> > 
> >   3. Allocate the error log buffer independently from the RTAS blob,
> > so qemu always knows where it is.
> 
> As per PAPR, the error log buffer should be part of RTAS blob and the
> guest kernel explicitly checks if error log is inside RTAS blob.
> This requires qemu to know the updated RTAS location by the OS which is
> handled in patch 2/4.

Ugh, ok.  That's a pretty stupid interface requirement, even by PAPR
standards, but I guess we're stuck with it.

> >   4. When qemu gets the MC exit condition, instead of going via a
> > patched 0x200 vector, just directly set the guest register state and
> > jump straight into the guest side MC handler.
> 
> PAPR mentions:
> 
> "R1–7.3.14–8: Once the OS has registered for NMI notification, the
> platform firmware must intercept all System Reset Interrupts on all of
> the OS’s processors."
> 
> So do we need to go via 0x200?

I don't see why.  The hypervisor is already intercepting system resets
and machine checks because it's a hypervisor, and from the PAPR
guest's point of view, all it cares about is that you enter its
registered handler with the expected information available.

I don't see that the guest cares whether you bounce via a vector in
guest space or directly enter the guest supplied handler using
hypervisor magic.  Patching the guest's vector actually seems a pretty
awful hack that would only be necessary to work around limitations in
the virtualization capabilities which I don't think we have as of POWER8.

Btw, isn't a "System Reset Interrupt" vector 0x100, not vector 0x200?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-13  3:52       ` David Gibson
@ 2014-11-13  5:58         ` Aravinda Prasad
  2014-11-13 10:32           ` David Gibson
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-13  5:58 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus



On Thursday 13 November 2014 09:22 AM, David Gibson wrote:
> On Tue, Nov 11, 2014 at 12:14:31PM +0530, Aravinda Prasad wrote:
>> On Tuesday 11 November 2014 08:46 AM, David Gibson wrote:
>>> On Wed, Nov 05, 2014 at 12:43:15PM +0530, Aravinda Prasad wrote:
> [snip]
>>>> +	. = 0x200
>>>> +	/*
>>>> +	 * Trampoline saves r3 in sprg2 and issues private hcall
>>>> +	 * to request qemu to build error log. QEMU builds the
>>>> +	 * error log, copies to rtas-blob and returns the address.
>>>> +	 * The initial 16 bytes in return adress consist of saved
>>>> +	 * srr0 and srr1 which we restore and pass on the actual error
>>>> +	 * log address to OS handled mcachine check notification
>>>> +	 * routine
>>>> +	 *
>>>> +	 * All the below instructions are copied to interrupt vector
>>>> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
>>>> +	 */
>>>> +	mtsprg  2,3
>>>> +	li      3,0
>>>> +	/*
>>>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>>>> +	 * value is patched below
>>>> +	 */
>>>> +1:	ori     3,3,0
>>>> +	sc      1               /* Issue H_CALL */
>>>> +	cmpdi   cr0,3,0
>>>> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
>>>
>>> Having to retry the hcall from here seems very awkward.  This is a
>>> private hcall, so you can define it to do whatever retries are
>>> necessary internally (and I don't think your current implementation
>>> can fail anyway).
>>
>> Retrying is required in the cases when multi-processors experience
>> machine check at or about the same time. As per PAPR, subsequent
>> processors should serialize and wait for the first processor to issue
>> the ibm,nmi-interlock call. The second processor retries if the first
>> processor which received a machine check is still reading the error log
>> and is yet to issue ibm,nmi-interlock call.
> 
> Hmm.. ok.  But I don't see any mechanism in the patches by which
> H_REPORT_MC_ERR will report failure if another CPU has an MC in
> progress.

h_report_mc_err returns 0 if another VCPU is processing machine check
and in that case we retry. h_report_mc_err returns error log address if
no other VCPU is processing machine check.

> 
>> Retrying cannot be done internally in h_report_mc_err hcall: only one
>> thread can succeed entering qemu upon parallel hcall and hence retrying
>> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
>> succeed.
> 
> It's possible, but would require some fiddling inside the h_call to
> unlock and wait for the other CPUs to finish, so yes, it might be more
> trouble than it's worth.
> 
>>
>>>
>>>> +	mtsprg  2,4
>>>
>>> Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.
>>
>> The r3 saved in SPRG2 is moved to rtas area in the private hcall and
>> hence it is fine to clobber r3 here
> 
> Ok, if you're going to do some magic register saving inside the HCALL,
> why not do the SRR[01] and CR restoration inside there as well.

SRR0/1 is clobbered while returning from HCALL and hence cannot be
restored in HCALL. For CR, we need to do the restoration here as we
clobber CR after returning from HCALL (the instruction checking the
return value of hcall clobbers CR).

> 
>>>
>>>> +	ld      4,0(3)
>>>> +	mtsrr0  4               /* Restore srr0 */
>>>> +	ld      4,8(3)
>>>> +	mtsrr1  4               /* Restore srr1 */
>>>> +	ld      4,16(3)
>>>> +	mtcrf   0,4             /* Restore cr */
>>>
>>> mtcrf?  aren't you restoring the whole CR?
>>
>> No. I am moving only cr0. The 0 in mtcrf 0,4 represents the cr field
>> mask that is replaced.
> 
> Uh, yes it is.  In which case a value of 0 means *no* condition
> register fields are transferred.

Hmm.. yes.. I will fix this.

Regards,
Aravinda

> 
>>
>>>
>>>> +	addi    3,3,24
>>>> +	mfsprg  4,2
>>>> +	/*
>>>> +	 * Branch to address registered by OS. The branch address is
>>>> +	 * patched in the ibm,nmi-register rtas call.
>>>> +	 */
>>>> +	ba      0x0
>>>> +	b       .
>>>
>>> The branch to self is pointless.  Even if the instruction above is
>>> not patched, or patched incorrectly, it's a ba, so you're not likely
>>> to end up at the instruction underneath.
>>
>> I added it to avoid speculative execution. Based on how it is used in
>> arch/powerpc/kernel/exceptions-64s.S
> 
> Ah, I guess that makes sense, although surely any ba instruction would
> also have to inhibit speculative execution.
> 
>>> Actually, what would probably make more sense would be to just have a
>>> "b ." *instead* of the ba, and have the qemu patching replace it with
>>> the correct ba instruction.  That will limit the damage if it somehow
>>> gets executed without being patched.
>>
>> good idea. Will do that.
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-13  3:57     ` David Gibson
@ 2014-11-13  6:10       ` Aravinda Prasad
  0 siblings, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-13  6:10 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, benh, qemu-ppc, qemu-devel, paulus



On Thursday 13 November 2014 09:27 AM, David Gibson wrote:
> On Tue, Nov 11, 2014 at 12:45:05PM +0530, Aravinda Prasad wrote:
>>
>>
>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>>> On Wed, Nov 05, 2014 at 12:42:03PM +0530, Aravinda Prasad wrote:
>>>> This series of patches add support for fwnmi in powerKVM guests.
>>>>
>>>> Currently upon machine check exception, if the address in
>>>> error belongs to guest then KVM invokes guest's NMI interrupt
>>>> vector 0x200.
>>>>
>>>> This patch series adds functionality where the guest's 0x200
>>>> interrupt vector is patched such that QEMU gets control. QEMU
>>>> then builds error log and reports the error to OS registered
>>>> machine check handlers through RTAS space.
>>>>
>>>> Apart from this, the patch series also takes care of synchronization
>>>> when multiple processors encounter machine check at or about the
>>>> same time.
>>>>
>>>> The patch set was tested by simulating a machine check error in
>>>> the guest.
>>>>
>>>> Changes in v3:
>>>>     - Incorporated review comments
>>>>     - Byte codes in patch 4/4 are now moved to
>>>>       pc-bios/spapr-rtas/spapr-rtas.S as instructions.
>>>>     - Defined the RTAS blob in-memory layout.
>>>>     - FIX: save and restore cr register in the trampoline
>>>>
>>>> Changes in v2:
>>>>     - Re-based to github.com/agraf/qemu.git  branch: ppc-next
>>>>     - Merged patches 4 and 5.
>>>>     - Incorporated other review comments
>>>
>>> So, this may not still be possible depending on whether the KVM side
>>> of this is already merged, but it occurs to me that there's a simpler
>>> way.
>>
>> The KVM part is already merged. Commit ID: 74845bc
> 
> Ok, that makes life harder, though I guess without the qemu code
> merged, no-one would be using yet, so it's not impossible to change still.
> 
>>> Rather than mucking about with having to update the hypervisor on the
>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>>> copy it back into the vector, you could instead do this:
>>
>> Though this is possible, I have coupe of comments below
>>
>>>
>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
>>> machine check, cause a special exit to qemu.
>>>
>>>   2. Have the register-nmi RTAS call store the guest side MC handler
>>> address in the spapr structure, but perform no actual guest code
>>> patching.
>>>
>>>   3. Allocate the error log buffer independently from the RTAS blob,
>>> so qemu always knows where it is.
>>
>> As per PAPR, the error log buffer should be part of RTAS blob and the
>> guest kernel explicitly checks if error log is inside RTAS blob.
>> This requires qemu to know the updated RTAS location by the OS which is
>> handled in patch 2/4.
> 
> Ugh, ok.  That's a pretty stupid interface requirement, even by PAPR
> standards, but I guess we're stuck with it.
> 
>>>   4. When qemu gets the MC exit condition, instead of going via a
>>> patched 0x200 vector, just directly set the guest register state and
>>> jump straight into the guest side MC handler.
>>
>> PAPR mentions:
>>
>> "R1–7.3.14–8: Once the OS has registered for NMI notification, the
>> platform firmware must intercept all System Reset Interrupts on all of
>> the OS’s processors."
>>
>> So do we need to go via 0x200?
> 
> I don't see why.  The hypervisor is already intercepting system resets
> and machine checks because it's a hypervisor, and from the PAPR
> guest's point of view, all it cares about is that you enter its
> registered handler with the expected information available.
> 
> I don't see that the guest cares whether you bounce via a vector in
> guest space or directly enter the guest supplied handler using
> hypervisor magic.  Patching the guest's vector actually seems a pretty
> awful hack that would only be necessary to work around limitations in
> the virtualization capabilities which I don't think we have as of POWER8.
> 

Agree.

> Btw, isn't a "System Reset Interrupt" vector 0x100, not vector 0x200?

"System Reset Interrupt" vector is 0x100. Machine Check Interrupt
is 0x200. The above "R1–7.3.14–8" extract was for System Reset. We have
one for Machine Check in R1–7.3.14–10.

> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-13  5:58         ` Aravinda Prasad
@ 2014-11-13 10:32           ` David Gibson
  2014-11-13 11:48             ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-13 10:32 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus

[-- Attachment #1: Type: text/plain, Size: 4129 bytes --]

On Thu, Nov 13, 2014 at 11:28:30AM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 13 November 2014 09:22 AM, David Gibson wrote:
> > On Tue, Nov 11, 2014 at 12:14:31PM +0530, Aravinda Prasad wrote:
> >> On Tuesday 11 November 2014 08:46 AM, David Gibson wrote:
> >>> On Wed, Nov 05, 2014 at 12:43:15PM +0530, Aravinda Prasad wrote:
> > [snip]
> >>>> +	. = 0x200
> >>>> +	/*
> >>>> +	 * Trampoline saves r3 in sprg2 and issues private hcall
> >>>> +	 * to request qemu to build error log. QEMU builds the
> >>>> +	 * error log, copies to rtas-blob and returns the address.
> >>>> +	 * The initial 16 bytes in return adress consist of saved
> >>>> +	 * srr0 and srr1 which we restore and pass on the actual error
> >>>> +	 * log address to OS handled mcachine check notification
> >>>> +	 * routine
> >>>> +	 *
> >>>> +	 * All the below instructions are copied to interrupt vector
> >>>> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
> >>>> +	 */
> >>>> +	mtsprg  2,3
> >>>> +	li      3,0
> >>>> +	/*
> >>>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
> >>>> +	 * value is patched below
> >>>> +	 */
> >>>> +1:	ori     3,3,0
> >>>> +	sc      1               /* Issue H_CALL */
> >>>> +	cmpdi   cr0,3,0
> >>>> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
> >>>
> >>> Having to retry the hcall from here seems very awkward.  This is a
> >>> private hcall, so you can define it to do whatever retries are
> >>> necessary internally (and I don't think your current implementation
> >>> can fail anyway).
> >>
> >> Retrying is required in the cases when multi-processors experience
> >> machine check at or about the same time. As per PAPR, subsequent
> >> processors should serialize and wait for the first processor to issue
> >> the ibm,nmi-interlock call. The second processor retries if the first
> >> processor which received a machine check is still reading the error log
> >> and is yet to issue ibm,nmi-interlock call.
> > 
> > Hmm.. ok.  But I don't see any mechanism in the patches by which
> > H_REPORT_MC_ERR will report failure if another CPU has an MC in
> > progress.
> 
> h_report_mc_err returns 0 if another VCPU is processing machine check
> and in that case we retry. h_report_mc_err returns error log address if
> no other VCPU is processing machine check.

Uh.. how?  I'm only seeing one return statement in the implementation
in 3/4.

> >> Retrying cannot be done internally in h_report_mc_err hcall: only one
> >> thread can succeed entering qemu upon parallel hcall and hence retrying
> >> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
> >> succeed.
> > 
> > It's possible, but would require some fiddling inside the h_call to
> > unlock and wait for the other CPUs to finish, so yes, it might be more
> > trouble than it's worth.
> > 
> >>
> >>>
> >>>> +	mtsprg  2,4
> >>>
> >>> Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.
> >>
> >> The r3 saved in SPRG2 is moved to rtas area in the private hcall and
> >> hence it is fine to clobber r3 here
> > 
> > Ok, if you're going to do some magic register saving inside the HCALL,
> > why not do the SRR[01] and CR restoration inside there as well.
> 
> SRR0/1 is clobbered while returning from HCALL and hence cannot be
> restored in HCALL. For CR, we need to do the restoration here as we
> clobber CR after returning from HCALL (the instruction checking the
> return value of hcall clobbers CR).

Hrm.  AFAICT SRR0/1 shouldn't be clobbered when returning from an
hcall.  You're right about CR though.  Or more precisely, you can't
both restore CR and use it to return the success state of the hcall.
Well.. actually, you could use crN to return the hcall success, since
you're only restoring cr0 anyway (although only restoring cr0 seems
odd to me in the first place).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-13 10:32           ` David Gibson
@ 2014-11-13 11:48             ` Aravinda Prasad
  2014-11-13 12:44               ` David Gibson
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-13 11:48 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus



On Thursday 13 November 2014 04:02 PM, David Gibson wrote:
> On Thu, Nov 13, 2014 at 11:28:30AM +0530, Aravinda Prasad wrote:
>>
>>
>> On Thursday 13 November 2014 09:22 AM, David Gibson wrote:
>>> On Tue, Nov 11, 2014 at 12:14:31PM +0530, Aravinda Prasad wrote:
>>>> On Tuesday 11 November 2014 08:46 AM, David Gibson wrote:
>>>>> On Wed, Nov 05, 2014 at 12:43:15PM +0530, Aravinda Prasad wrote:
>>> [snip]
>>>>>> +	. = 0x200
>>>>>> +	/*
>>>>>> +	 * Trampoline saves r3 in sprg2 and issues private hcall
>>>>>> +	 * to request qemu to build error log. QEMU builds the
>>>>>> +	 * error log, copies to rtas-blob and returns the address.
>>>>>> +	 * The initial 16 bytes in return adress consist of saved
>>>>>> +	 * srr0 and srr1 which we restore and pass on the actual error
>>>>>> +	 * log address to OS handled mcachine check notification
>>>>>> +	 * routine
>>>>>> +	 *
>>>>>> +	 * All the below instructions are copied to interrupt vector
>>>>>> +	 * 0x200 at the time of handling ibm,nmi-register rtas call.
>>>>>> +	 */
>>>>>> +	mtsprg  2,3
>>>>>> +	li      3,0
>>>>>> +	/*
>>>>>> +	 * ori r3,r3,KVMPPC_H_REPORT_MC_ERR. The KVMPPC_H_REPORT_MC_ERR
>>>>>> +	 * value is patched below
>>>>>> +	 */
>>>>>> +1:	ori     3,3,0
>>>>>> +	sc      1               /* Issue H_CALL */
>>>>>> +	cmpdi   cr0,3,0
>>>>>> +	beq     cr0,1b          /* retry KVMPPC_H_REPORT_MC_ERR */
>>>>>
>>>>> Having to retry the hcall from here seems very awkward.  This is a
>>>>> private hcall, so you can define it to do whatever retries are
>>>>> necessary internally (and I don't think your current implementation
>>>>> can fail anyway).
>>>>
>>>> Retrying is required in the cases when multi-processors experience
>>>> machine check at or about the same time. As per PAPR, subsequent
>>>> processors should serialize and wait for the first processor to issue
>>>> the ibm,nmi-interlock call. The second processor retries if the first
>>>> processor which received a machine check is still reading the error log
>>>> and is yet to issue ibm,nmi-interlock call.
>>>
>>> Hmm.. ok.  But I don't see any mechanism in the patches by which
>>> H_REPORT_MC_ERR will report failure if another CPU has an MC in
>>> progress.
>>
>> h_report_mc_err returns 0 if another VCPU is processing machine check
>> and in that case we retry. h_report_mc_err returns error log address if
>> no other VCPU is processing machine check.
> 
> Uh.. how?  I'm only seeing one return statement in the implementation
> in 3/4.

This part is in 4/4 which handles ibm,nmi-interlock call in
h_report_mc_err()

+    if (mc_in_progress == 1) {
+        return 0;
+    }


> 
>>>> Retrying cannot be done internally in h_report_mc_err hcall: only one
>>>> thread can succeed entering qemu upon parallel hcall and hence retrying
>>>> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
>>>> succeed.
>>>
>>> It's possible, but would require some fiddling inside the h_call to
>>> unlock and wait for the other CPUs to finish, so yes, it might be more
>>> trouble than it's worth.
>>>
>>>>
>>>>>
>>>>>> +	mtsprg  2,4
>>>>>
>>>>> Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.
>>>>
>>>> The r3 saved in SPRG2 is moved to rtas area in the private hcall and
>>>> hence it is fine to clobber r3 here
>>>
>>> Ok, if you're going to do some magic register saving inside the HCALL,
>>> why not do the SRR[01] and CR restoration inside there as well.
>>
>> SRR0/1 is clobbered while returning from HCALL and hence cannot be
>> restored in HCALL. For CR, we need to do the restoration here as we
>> clobber CR after returning from HCALL (the instruction checking the
>> return value of hcall clobbers CR).
> 
> Hrm.  AFAICT SRR0/1 shouldn't be clobbered when returning from an

As hcall is an interrupt, SRR0 is set to nip and SRR1 to msr just before
executing rfid.

> hcall.  You're right about CR though.  Or more precisely, you can't
> both restore CR and use it to return the success state of the hcall.
> Well.. actually, you could use crN to return the hcall success, since
> you're only restoring cr0 anyway (although only restoring cr0 seems
> odd to me in the first place).
> 

Yes it is possible to return the hcall success state through crN and
restore the entire cr register after checking.


-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-13 11:48             ` Aravinda Prasad
@ 2014-11-13 12:44               ` David Gibson
  2014-11-13 14:36                 ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-13 12:44 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus

[-- Attachment #1: Type: text/plain, Size: 3250 bytes --]

On Thu, Nov 13, 2014 at 05:18:16PM +0530, Aravinda Prasad wrote:
> On Thursday 13 November 2014 04:02 PM, David Gibson wrote:
> > On Thu, Nov 13, 2014 at 11:28:30AM +0530, Aravinda Prasad wrote:
[snip]
> >>>>> Having to retry the hcall from here seems very awkward.  This is a
> >>>>> private hcall, so you can define it to do whatever retries are
> >>>>> necessary internally (and I don't think your current implementation
> >>>>> can fail anyway).
> >>>>
> >>>> Retrying is required in the cases when multi-processors experience
> >>>> machine check at or about the same time. As per PAPR, subsequent
> >>>> processors should serialize and wait for the first processor to issue
> >>>> the ibm,nmi-interlock call. The second processor retries if the first
> >>>> processor which received a machine check is still reading the error log
> >>>> and is yet to issue ibm,nmi-interlock call.
> >>>
> >>> Hmm.. ok.  But I don't see any mechanism in the patches by which
> >>> H_REPORT_MC_ERR will report failure if another CPU has an MC in
> >>> progress.
> >>
> >> h_report_mc_err returns 0 if another VCPU is processing machine check
> >> and in that case we retry. h_report_mc_err returns error log address if
> >> no other VCPU is processing machine check.
> > 
> > Uh.. how?  I'm only seeing one return statement in the implementation
> > in 3/4.
> 
> This part is in 4/4 which handles ibm,nmi-interlock call in
> h_report_mc_err()
> 
> +    if (mc_in_progress == 1) {
> +        return 0;
> +    }

Ah, right, missed the change to h_report_mc_err() in the later patch.

> >>>> Retrying cannot be done internally in h_report_mc_err hcall: only one
> >>>> thread can succeed entering qemu upon parallel hcall and hence retrying
> >>>> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
> >>>> succeed.
> >>>
> >>> It's possible, but would require some fiddling inside the h_call to
> >>> unlock and wait for the other CPUs to finish, so yes, it might be more
> >>> trouble than it's worth.
> >>>
> >>>>
> >>>>>
> >>>>>> +	mtsprg  2,4
> >>>>>
> >>>>> Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.
> >>>>
> >>>> The r3 saved in SPRG2 is moved to rtas area in the private hcall and
> >>>> hence it is fine to clobber r3 here
> >>>
> >>> Ok, if you're going to do some magic register saving inside the HCALL,
> >>> why not do the SRR[01] and CR restoration inside there as well.
> >>
> >> SRR0/1 is clobbered while returning from HCALL and hence cannot be
> >> restored in HCALL. For CR, we need to do the restoration here as we
> >> clobber CR after returning from HCALL (the instruction checking the
> >> return value of hcall clobbers CR).
> > 
> > Hrm.  AFAICT SRR0/1 shouldn't be clobbered when returning from an
> 
> As hcall is an interrupt, SRR0 is set to nip and SRR1 to msr just before
> executing rfid.

AFAICT the return path from the hypervisor - including for hcalls -
uses HSSR0/1 and hrfid, so ordinary SRR0/SRR1 should be ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-13 12:44               ` David Gibson
@ 2014-11-13 14:36                 ` Aravinda Prasad
  2014-11-14  0:42                   ` David Gibson
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-13 14:36 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus



On Thursday 13 November 2014 06:14 PM, David Gibson wrote:
> On Thu, Nov 13, 2014 at 05:18:16PM +0530, Aravinda Prasad wrote:
>> On Thursday 13 November 2014 04:02 PM, David Gibson wrote:
>>> On Thu, Nov 13, 2014 at 11:28:30AM +0530, Aravinda Prasad wrote:
> [snip]
>>>>>>> Having to retry the hcall from here seems very awkward.  This is a
>>>>>>> private hcall, so you can define it to do whatever retries are
>>>>>>> necessary internally (and I don't think your current implementation
>>>>>>> can fail anyway).
>>>>>>
>>>>>> Retrying is required in the cases when multi-processors experience
>>>>>> machine check at or about the same time. As per PAPR, subsequent
>>>>>> processors should serialize and wait for the first processor to issue
>>>>>> the ibm,nmi-interlock call. The second processor retries if the first
>>>>>> processor which received a machine check is still reading the error log
>>>>>> and is yet to issue ibm,nmi-interlock call.
>>>>>
>>>>> Hmm.. ok.  But I don't see any mechanism in the patches by which
>>>>> H_REPORT_MC_ERR will report failure if another CPU has an MC in
>>>>> progress.
>>>>
>>>> h_report_mc_err returns 0 if another VCPU is processing machine check
>>>> and in that case we retry. h_report_mc_err returns error log address if
>>>> no other VCPU is processing machine check.
>>>
>>> Uh.. how?  I'm only seeing one return statement in the implementation
>>> in 3/4.
>>
>> This part is in 4/4 which handles ibm,nmi-interlock call in
>> h_report_mc_err()
>>
>> +    if (mc_in_progress == 1) {
>> +        return 0;
>> +    }
> 
> Ah, right, missed the change to h_report_mc_err() in the later patch.
> 
>>>>>> Retrying cannot be done internally in h_report_mc_err hcall: only one
>>>>>> thread can succeed entering qemu upon parallel hcall and hence retrying
>>>>>> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
>>>>>> succeed.
>>>>>
>>>>> It's possible, but would require some fiddling inside the h_call to
>>>>> unlock and wait for the other CPUs to finish, so yes, it might be more
>>>>> trouble than it's worth.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +	mtsprg  2,4
>>>>>>>
>>>>>>> Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.
>>>>>>
>>>>>> The r3 saved in SPRG2 is moved to rtas area in the private hcall and
>>>>>> hence it is fine to clobber r3 here
>>>>>
>>>>> Ok, if you're going to do some magic register saving inside the HCALL,
>>>>> why not do the SRR[01] and CR restoration inside there as well.
>>>>
>>>> SRR0/1 is clobbered while returning from HCALL and hence cannot be
>>>> restored in HCALL. For CR, we need to do the restoration here as we
>>>> clobber CR after returning from HCALL (the instruction checking the
>>>> return value of hcall clobbers CR).
>>>
>>> Hrm.  AFAICT SRR0/1 shouldn't be clobbered when returning from an
>>
>> As hcall is an interrupt, SRR0 is set to nip and SRR1 to msr just before
>> executing rfid.
> 
> AFAICT the return path from the hypervisor - including for hcalls -
> uses HSSR0/1 and hrfid, so ordinary SRR0/SRR1 should be ok.

I see SRR0 and SRR1 clobbered when the HCALL from guest returns.
Previous discussions on this is in the link below:

http://lists.nongnu.org/archive/html/qemu-devel/2014-09/msg01148.html

Further I searched QEMU source code but could not find whether it is
using rfid/hrfid. However, ISA for sc instruction mentions that SRR0 and
SRR1 are modified.


-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-13 14:36                 ` Aravinda Prasad
@ 2014-11-14  0:42                   ` David Gibson
  2014-11-14  8:24                     ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-14  0:42 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus

[-- Attachment #1: Type: text/plain, Size: 4475 bytes --]

On Thu, Nov 13, 2014 at 08:06:55PM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 13 November 2014 06:14 PM, David Gibson wrote:
> > On Thu, Nov 13, 2014 at 05:18:16PM +0530, Aravinda Prasad wrote:
> >> On Thursday 13 November 2014 04:02 PM, David Gibson wrote:
> >>> On Thu, Nov 13, 2014 at 11:28:30AM +0530, Aravinda Prasad wrote:
> > [snip]
> >>>>>>> Having to retry the hcall from here seems very awkward.  This is a
> >>>>>>> private hcall, so you can define it to do whatever retries are
> >>>>>>> necessary internally (and I don't think your current implementation
> >>>>>>> can fail anyway).
> >>>>>>
> >>>>>> Retrying is required in the cases when multi-processors experience
> >>>>>> machine check at or about the same time. As per PAPR, subsequent
> >>>>>> processors should serialize and wait for the first processor to issue
> >>>>>> the ibm,nmi-interlock call. The second processor retries if the first
> >>>>>> processor which received a machine check is still reading the error log
> >>>>>> and is yet to issue ibm,nmi-interlock call.
> >>>>>
> >>>>> Hmm.. ok.  But I don't see any mechanism in the patches by which
> >>>>> H_REPORT_MC_ERR will report failure if another CPU has an MC in
> >>>>> progress.
> >>>>
> >>>> h_report_mc_err returns 0 if another VCPU is processing machine check
> >>>> and in that case we retry. h_report_mc_err returns error log address if
> >>>> no other VCPU is processing machine check.
> >>>
> >>> Uh.. how?  I'm only seeing one return statement in the implementation
> >>> in 3/4.
> >>
> >> This part is in 4/4 which handles ibm,nmi-interlock call in
> >> h_report_mc_err()
> >>
> >> +    if (mc_in_progress == 1) {
> >> +        return 0;
> >> +    }
> > 
> > Ah, right, missed the change to h_report_mc_err() in the later patch.
> > 
> >>>>>> Retrying cannot be done internally in h_report_mc_err hcall: only one
> >>>>>> thread can succeed entering qemu upon parallel hcall and hence retrying
> >>>>>> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
> >>>>>> succeed.
> >>>>>
> >>>>> It's possible, but would require some fiddling inside the h_call to
> >>>>> unlock and wait for the other CPUs to finish, so yes, it might be more
> >>>>> trouble than it's worth.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> +	mtsprg  2,4
> >>>>>>>
> >>>>>>> Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.
> >>>>>>
> >>>>>> The r3 saved in SPRG2 is moved to rtas area in the private hcall and
> >>>>>> hence it is fine to clobber r3 here
> >>>>>
> >>>>> Ok, if you're going to do some magic register saving inside the HCALL,
> >>>>> why not do the SRR[01] and CR restoration inside there as well.
> >>>>
> >>>> SRR0/1 is clobbered while returning from HCALL and hence cannot be
> >>>> restored in HCALL. For CR, we need to do the restoration here as we
> >>>> clobber CR after returning from HCALL (the instruction checking the
> >>>> return value of hcall clobbers CR).
> >>>
> >>> Hrm.  AFAICT SRR0/1 shouldn't be clobbered when returning from an
> >>
> >> As hcall is an interrupt, SRR0 is set to nip and SRR1 to msr just before
> >> executing rfid.
> > 
> > AFAICT the return path from the hypervisor - including for hcalls -
> > uses HSSR0/1 and hrfid, so ordinary SRR0/SRR1 should be ok.
> 
> I see SRR0 and SRR1 clobbered when the HCALL from guest returns.
> Previous discussions on this is in the link below:
> 
> http://lists.nongnu.org/archive/html/qemu-devel/2014-09/msg01148.html

Hrm.  Well, I guess if it happened it happened, but Alex's explanation
for why doesn't make sense to me.

Did you execute cpu_synchronize_state() *before* attempting to set
SRR0/1 in the hcall?

> Further I searched QEMU source code but could not find whether it is
> using rfid/hrfid. However, ISA for sc instruction mentions that SRR0 and
> SRR1 are modified.

Well of course it isn't in the qemu source, the low-level return to
guest is within the host kernel, specifically fast_guest_return in
arch/powerpc/kvm/book3s_hv_rmhandlers.S which uses hrfid.

If I'm reading the ISA correctly then yes, SRR0/1 are clobbered on
entry, but that's on *entry* so can be overwritten by the hcall
handler itself.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
-http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call
  2014-11-14  0:42                   ` David Gibson
@ 2014-11-14  8:24                     ` Aravinda Prasad
  0 siblings, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-14  8:24 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, benh, aik, qemu-devel, paulus



On Friday 14 November 2014 06:12 AM, David Gibson wrote:
> On Thu, Nov 13, 2014 at 08:06:55PM +0530, Aravinda Prasad wrote:
>>
>>
>> On Thursday 13 November 2014 06:14 PM, David Gibson wrote:
>>> On Thu, Nov 13, 2014 at 05:18:16PM +0530, Aravinda Prasad wrote:
>>>> On Thursday 13 November 2014 04:02 PM, David Gibson wrote:
>>>>> On Thu, Nov 13, 2014 at 11:28:30AM +0530, Aravinda Prasad wrote:
>>> [snip]
>>>>>>>>> Having to retry the hcall from here seems very awkward.  This is a
>>>>>>>>> private hcall, so you can define it to do whatever retries are
>>>>>>>>> necessary internally (and I don't think your current implementation
>>>>>>>>> can fail anyway).
>>>>>>>>
>>>>>>>> Retrying is required in the cases when multi-processors experience
>>>>>>>> machine check at or about the same time. As per PAPR, subsequent
>>>>>>>> processors should serialize and wait for the first processor to issue
>>>>>>>> the ibm,nmi-interlock call. The second processor retries if the first
>>>>>>>> processor which received a machine check is still reading the error log
>>>>>>>> and is yet to issue ibm,nmi-interlock call.
>>>>>>>
>>>>>>> Hmm.. ok.  But I don't see any mechanism in the patches by which
>>>>>>> H_REPORT_MC_ERR will report failure if another CPU has an MC in
>>>>>>> progress.
>>>>>>
>>>>>> h_report_mc_err returns 0 if another VCPU is processing machine check
>>>>>> and in that case we retry. h_report_mc_err returns error log address if
>>>>>> no other VCPU is processing machine check.
>>>>>
>>>>> Uh.. how?  I'm only seeing one return statement in the implementation
>>>>> in 3/4.
>>>>
>>>> This part is in 4/4 which handles ibm,nmi-interlock call in
>>>> h_report_mc_err()
>>>>
>>>> +    if (mc_in_progress == 1) {
>>>> +        return 0;
>>>> +    }
>>>
>>> Ah, right, missed the change to h_report_mc_err() in the later patch.
>>>
>>>>>>>> Retrying cannot be done internally in h_report_mc_err hcall: only one
>>>>>>>> thread can succeed entering qemu upon parallel hcall and hence retrying
>>>>>>>> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
>>>>>>>> succeed.
>>>>>>>
>>>>>>> It's possible, but would require some fiddling inside the h_call to
>>>>>>> unlock and wait for the other CPUs to finish, so yes, it might be more
>>>>>>> trouble than it's worth.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +	mtsprg  2,4
>>>>>>>>>
>>>>>>>>> Um.. doesn't this clobber the value of r3 you saved in SPRG2 just above.
>>>>>>>>
>>>>>>>> The r3 saved in SPRG2 is moved to rtas area in the private hcall and
>>>>>>>> hence it is fine to clobber r3 here
>>>>>>>
>>>>>>> Ok, if you're going to do some magic register saving inside the HCALL,
>>>>>>> why not do the SRR[01] and CR restoration inside there as well.
>>>>>>
>>>>>> SRR0/1 is clobbered while returning from HCALL and hence cannot be
>>>>>> restored in HCALL. For CR, we need to do the restoration here as we
>>>>>> clobber CR after returning from HCALL (the instruction checking the
>>>>>> return value of hcall clobbers CR).
>>>>>
>>>>> Hrm.  AFAICT SRR0/1 shouldn't be clobbered when returning from an
>>>>
>>>> As hcall is an interrupt, SRR0 is set to nip and SRR1 to msr just before
>>>> executing rfid.
>>>
>>> AFAICT the return path from the hypervisor - including for hcalls -
>>> uses HSSR0/1 and hrfid, so ordinary SRR0/SRR1 should be ok.
>>
>> I see SRR0 and SRR1 clobbered when the HCALL from guest returns.
>> Previous discussions on this is in the link below:
>>
>> http://lists.nongnu.org/archive/html/qemu-devel/2014-09/msg01148.html
> 
> Hrm.  Well, I guess if it happened it happened, but Alex's explanation
> for why doesn't make sense to me.
> 
> Did you execute cpu_synchronize_state() *before* attempting to set
> SRR0/1 in the hcall?

Yes I did.

> 
>> Further I searched QEMU source code but could not find whether it is
>> using rfid/hrfid. However, ISA for sc instruction mentions that SRR0 and
>> SRR1 are modified.
> 
> Well of course it isn't in the qemu source, the low-level return to
> guest is within the host kernel, specifically fast_guest_return in
> arch/powerpc/kvm/book3s_hv_rmhandlers.S which uses hrfid.
> 
> If I'm reading the ISA correctly then yes, SRR0/1 are clobbered on
> entry, but that's on *entry* so can be overwritten by the hcall
> handler itself.

Hmm.. ok. I need to take a look into it in detail.

> 
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-11  3:24 ` [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests David Gibson
  2014-11-11  7:15   ` Aravinda Prasad
@ 2014-11-19  5:48   ` Aravinda Prasad
  2014-11-19 10:32     ` Alexander Graf
  2015-04-02  4:28     ` [Qemu-devel] [Qemu-ppc] " Alexey Kardashevskiy
  1 sibling, 2 replies; 66+ messages in thread
From: Aravinda Prasad @ 2014-11-19  5:48 UTC (permalink / raw)
  To: David Gibson, benh, aik, Alexander Graf; +Cc: paulus, qemu-ppc, qemu-devel



On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:

[..]

> 
> So, this may not still be possible depending on whether the KVM side
> of this is already merged, but it occurs to me that there's a simpler
> way.
> 
> Rather than mucking about with having to update the hypervisor on the
> RTAS location, they have qemu copy the code out of RTAS, patch it and
> copy it back into the vector, you could instead do this:
> 
>   1. Make KVM instead of immediately delivering a 0x200 for a guest
> machine check, cause a special exit to qemu.
> 
>   2. Have the register-nmi RTAS call store the guest side MC handler
> address in the spapr structure, but perform no actual guest code
> patching.
> 
>   3. Allocate the error log buffer independently from the RTAS blob,
> so qemu always knows where it is.
> 
>   4. When qemu gets the MC exit condition, instead of going via a
> patched 0x200 vector, just directly set the guest register state and
> jump straight into the guest side MC handler.
>

Before I proceed further I would like to know what others think about
the approach proposed above (except for step 3 - as per PAPR the error
log buffer should be part of RTAS blob and hence we cannot have error
log buffer independent of RTAS blob).

Alex, Alexey, Ben: Any thoughts?

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-19  5:48   ` Aravinda Prasad
@ 2014-11-19 10:32     ` Alexander Graf
  2014-11-19 11:44       ` David Gibson
  2015-04-02  4:28     ` [Qemu-devel] [Qemu-ppc] " Alexey Kardashevskiy
  1 sibling, 1 reply; 66+ messages in thread
From: Alexander Graf @ 2014-11-19 10:32 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: benh, aik, qemu-devel, qemu-ppc, paulus, David Gibson




> Am 19.11.2014 um 06:48 schrieb Aravinda Prasad <aravinda@linux.vnet.ibm.com>:
> 
> 
> 
> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> 
> [..]
> 
>> 
>> So, this may not still be possible depending on whether the KVM side
>> of this is already merged, but it occurs to me that there's a simpler
>> way.
>> 
>> Rather than mucking about with having to update the hypervisor on the
>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>> copy it back into the vector, you could instead do this:
>> 
>>  1. Make KVM instead of immediately delivering a 0x200 for a guest
>> machine check, cause a special exit to qemu.
>> 
>>  2. Have the register-nmi RTAS call store the guest side MC handler
>> address in the spapr structure, but perform no actual guest code
>> patching.
>> 
>>  3. Allocate the error log buffer independently from the RTAS blob,
>> so qemu always knows where it is.
>> 
>>  4. When qemu gets the MC exit condition, instead of going via a
>> patched 0x200 vector, just directly set the guest register state and
>> jump straight into the guest side MC handler.
> 
> Before I proceed further I would like to know what others think about
> the approach proposed above (except for step 3 - as per PAPR the error
> log buffer should be part of RTAS blob and hence we cannot have error
> log buffer independent of RTAS blob).
> 
> Alex, Alexey, Ben: Any thoughts?

If in doubt, stick to PAPR please.

Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-19 10:32     ` Alexander Graf
@ 2014-11-19 11:44       ` David Gibson
  2014-11-19 12:22         ` Alexander Graf
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2014-11-19 11:44 UTC (permalink / raw)
  To: Alexander Graf; +Cc: benh, aik, qemu-devel, qemu-ppc, Aravinda Prasad, paulus

[-- Attachment #1: Type: text/plain, Size: 1998 bytes --]

On Wed, Nov 19, 2014 at 11:32:56AM +0100, Alexander Graf wrote:
> 
> 
> 
> > Am 19.11.2014 um 06:48 schrieb Aravinda Prasad <aravinda@linux.vnet.ibm.com>:
> > 
> > 
> > 
> > On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> > 
> > [..]
> > 
> >> 
> >> So, this may not still be possible depending on whether the KVM side
> >> of this is already merged, but it occurs to me that there's a simpler
> >> way.
> >> 
> >> Rather than mucking about with having to update the hypervisor on the
> >> RTAS location, they have qemu copy the code out of RTAS, patch it and
> >> copy it back into the vector, you could instead do this:
> >> 
> >>  1. Make KVM instead of immediately delivering a 0x200 for a guest
> >> machine check, cause a special exit to qemu.
> >> 
> >>  2. Have the register-nmi RTAS call store the guest side MC handler
> >> address in the spapr structure, but perform no actual guest code
> >> patching.
> >> 
> >>  3. Allocate the error log buffer independently from the RTAS blob,
> >> so qemu always knows where it is.
> >> 
> >>  4. When qemu gets the MC exit condition, instead of going via a
> >> patched 0x200 vector, just directly set the guest register state and
> >> jump straight into the guest side MC handler.
> > 
> > Before I proceed further I would like to know what others think about
> > the approach proposed above (except for step 3 - as per PAPR the error
> > log buffer should be part of RTAS blob and hence we cannot have error
> > log buffer independent of RTAS blob).
> > 
> > Alex, Alexey, Ben: Any thoughts?
> 
> If in doubt, stick to PAPR please.

Apart from (3), which was a misunderstanding on my part, this doesn't
diverge from PAPR - it's just a question of how we're implementing the
PAPR behaviour.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-19 11:44       ` David Gibson
@ 2014-11-19 12:22         ` Alexander Graf
  2014-11-19 12:42           ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
  2014-11-19 12:57           ` [Qemu-devel] " David Gibson
  0 siblings, 2 replies; 66+ messages in thread
From: Alexander Graf @ 2014-11-19 12:22 UTC (permalink / raw)
  To: David Gibson; +Cc: benh, aik, qemu-devel, qemu-ppc, Aravinda Prasad, paulus




> Am 19.11.2014 um 12:44 schrieb David Gibson <david@gibson.dropbear.id.au>:
> 
>> On Wed, Nov 19, 2014 at 11:32:56AM +0100, Alexander Graf wrote:
>> 
>> 
>> 
>>> Am 19.11.2014 um 06:48 schrieb Aravinda Prasad <aravinda@linux.vnet.ibm.com>:
>>> 
>>> 
>>> 
>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>>> 
>>> [..]
>>> 
>>>> 
>>>> So, this may not still be possible depending on whether the KVM side
>>>> of this is already merged, but it occurs to me that there's a simpler
>>>> way.
>>>> 
>>>> Rather than mucking about with having to update the hypervisor on the
>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>>>> copy it back into the vector, you could instead do this:
>>>> 
>>>> 1. Make KVM instead of immediately delivering a 0x200 for a guest
>>>> machine check, cause a special exit to qemu.
>>>> 
>>>> 2. Have the register-nmi RTAS call store the guest side MC handler
>>>> address in the spapr structure, but perform no actual guest code
>>>> patching.
>>>> 
>>>> 3. Allocate the error log buffer independently from the RTAS blob,
>>>> so qemu always knows where it is.
>>>> 
>>>> 4. When qemu gets the MC exit condition, instead of going via a
>>>> patched 0x200 vector, just directly set the guest register state and
>>>> jump straight into the guest side MC handler.
>>> 
>>> Before I proceed further I would like to know what others think about
>>> the approach proposed above (except for step 3 - as per PAPR the error
>>> log buffer should be part of RTAS blob and hence we cannot have error
>>> log buffer independent of RTAS blob).
>>> 
>>> Alex, Alexey, Ben: Any thoughts?
>> 
>> If in doubt, stick to PAPR please.
> 
> Apart from (3), which was a misunderstanding on my part, this doesn't
> diverge from PAPR - it's just a question of how we're implementing the
> PAPR behaviour.

Do we need a guest handler at all? Couldn't we make MCs a new exit type and handle it all straight from QEMU?


Alex

> 
> -- 
> David Gibson            | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au    | minimalist, thank you.  NOT _the_ _other_
>                | _way_ _around_!
> http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-19 12:22         ` Alexander Graf
@ 2014-11-19 12:42           ` Alexander Graf
  2014-11-19 12:57           ` [Qemu-devel] " David Gibson
  1 sibling, 0 replies; 66+ messages in thread
From: Alexander Graf @ 2014-11-19 12:42 UTC (permalink / raw)
  To: David Gibson; +Cc: benh, aik, qemu-devel, qemu-ppc, Aravinda Prasad, paulus




> Am 19.11.2014 um 13:22 schrieb Alexander Graf <agraf@suse.de>:
> 
> 
> 
> 
>>> Am 19.11.2014 um 12:44 schrieb David Gibson <david@gibson.dropbear.id.au>:
>>> 
>>> On Wed, Nov 19, 2014 at 11:32:56AM +0100, Alexander Graf wrote:
>>> 
>>> 
>>> 
>>>> Am 19.11.2014 um 06:48 schrieb Aravinda Prasad <aravinda@linux.vnet.ibm.com>:
>>>> 
>>>> 
>>>> 
>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>>>> 
>>>> [..]
>>>> 
>>>>> 
>>>>> So, this may not still be possible depending on whether the KVM side
>>>>> of this is already merged, but it occurs to me that there's a simpler
>>>>> way.
>>>>> 
>>>>> Rather than mucking about with having to update the hypervisor on the
>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>>>>> copy it back into the vector, you could instead do this:
>>>>> 
>>>>> 1. Make KVM instead of immediately delivering a 0x200 for a guest
>>>>> machine check, cause a special exit to qemu.
>>>>> 
>>>>> 2. Have the register-nmi RTAS call store the guest side MC handler
>>>>> address in the spapr structure, but perform no actual guest code
>>>>> patching.
>>>>> 
>>>>> 3. Allocate the error log buffer independently from the RTAS blob,
>>>>> so qemu always knows where it is.
>>>>> 
>>>>> 4. When qemu gets the MC exit condition, instead of going via a
>>>>> patched 0x200 vector, just directly set the guest register state and
>>>>> jump straight into the guest side MC handler.
>>>> 
>>>> Before I proceed further I would like to know what others think about
>>>> the approach proposed above (except for step 3 - as per PAPR the error
>>>> log buffer should be part of RTAS blob and hence we cannot have error
>>>> log buffer independent of RTAS blob).
>>>> 
>>>> Alex, Alexey, Ben: Any thoughts?
>>> 
>>> If in doubt, stick to PAPR please.
>> 
>> Apart from (3), which was a misunderstanding on my part, this doesn't
>> diverge from PAPR - it's just a question of how we're implementing the
>> PAPR behaviour.
> 
> Do we need a guest handler at all? Couldn't we make MCs a new exit type and handle it all straight from QEMU?

Ah, that was your proposal ;). Sure, works for me.

Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-19 12:22         ` Alexander Graf
  2014-11-19 12:42           ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
@ 2014-11-19 12:57           ` David Gibson
  1 sibling, 0 replies; 66+ messages in thread
From: David Gibson @ 2014-11-19 12:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: benh, aik, qemu-devel, qemu-ppc, Aravinda Prasad, paulus

[-- Attachment #1: Type: text/plain, Size: 3041 bytes --]

On Wed, Nov 19, 2014 at 01:22:01PM +0100, Alexander Graf wrote:
> 
> 
> 
> > Am 19.11.2014 um 12:44 schrieb David Gibson <david@gibson.dropbear.id.au>:
> > 
> >> On Wed, Nov 19, 2014 at 11:32:56AM +0100, Alexander Graf wrote:
> >> 
> >> 
> >> 
> >>> Am 19.11.2014 um 06:48 schrieb Aravinda Prasad <aravinda@linux.vnet.ibm.com>:
> >>> 
> >>> 
> >>> 
> >>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> >>> 
> >>> [..]
> >>> 
> >>>> 
> >>>> So, this may not still be possible depending on whether the KVM side
> >>>> of this is already merged, but it occurs to me that there's a simpler
> >>>> way.
> >>>> 
> >>>> Rather than mucking about with having to update the hypervisor on the
> >>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
> >>>> copy it back into the vector, you could instead do this:
> >>>> 
> >>>> 1. Make KVM instead of immediately delivering a 0x200 for a guest
> >>>> machine check, cause a special exit to qemu.
> >>>> 
> >>>> 2. Have the register-nmi RTAS call store the guest side MC handler
> >>>> address in the spapr structure, but perform no actual guest code
> >>>> patching.
> >>>> 
> >>>> 3. Allocate the error log buffer independently from the RTAS blob,
> >>>> so qemu always knows where it is.
> >>>> 
> >>>> 4. When qemu gets the MC exit condition, instead of going via a
> >>>> patched 0x200 vector, just directly set the guest register state and
> >>>> jump straight into the guest side MC handler.
> >>> 
> >>> Before I proceed further I would like to know what others think about
> >>> the approach proposed above (except for step 3 - as per PAPR the error
> >>> log buffer should be part of RTAS blob and hence we cannot have error
> >>> log buffer independent of RTAS blob).
> >>> 
> >>> Alex, Alexey, Ben: Any thoughts?
> >> 
> >> If in doubt, stick to PAPR please.
> > 
> > Apart from (3), which was a misunderstanding on my part, this doesn't
> > diverge from PAPR - it's just a question of how we're implementing the
> > PAPR behaviour.
> 
> Do we need a guest handler at all? Couldn't we make MCs a new exit
> type and handle it all straight from QEMU?

Well, PAPR allows the OS to register a handler, which existing guests
will expect to be able to do.  The registered handler expects various
information collated for it though, so it isn't a "raw" 0x200 vector.

IIUC, traditionally pHyp implemented this by patching the guests 0x200
vector to collate the necessary information then jump to the supplied
handler.

I'm suggesting that instead we indeed make a new exit type, have qemu
collate the information internally then jump directly back into the
guest registered handler.

I'm not sure if that's quite what you were suggesting, but I think we
have pretty close to the same idea here.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2014-11-19  5:48   ` Aravinda Prasad
  2014-11-19 10:32     ` Alexander Graf
@ 2015-04-02  4:28     ` Alexey Kardashevskiy
  2015-04-02  4:46       ` David Gibson
  1 sibling, 1 reply; 66+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-02  4:28 UTC (permalink / raw)
  To: Aravinda Prasad, aik, Alexander Graf
  Cc: qemu-ppc, benh, paulus, qemu-devel, David Gibson

On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
>
>
> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>
> [..]
>
>>
>> So, this may not still be possible depending on whether the KVM side
>> of this is already merged, but it occurs to me that there's a simpler
>> way.
>>
>> Rather than mucking about with having to update the hypervisor on the
>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>> copy it back into the vector, you could instead do this:
>>
>>    1. Make KVM instead of immediately delivering a 0x200 for a guest
>> machine check, cause a special exit to qemu.
>>
>>    2. Have the register-nmi RTAS call store the guest side MC handler
>> address in the spapr structure, but perform no actual guest code
>> patching.
>>
>>    3. Allocate the error log buffer independently from the RTAS blob,
>> so qemu always knows where it is.
>>
>>    4. When qemu gets the MC exit condition, instead of going via a
>> patched 0x200 vector, just directly set the guest register state and
>> jump straight into the guest side MC handler.
>>
>
> Before I proceed further I would like to know what others think about
> the approach proposed above (except for step 3 - as per PAPR the error
> log buffer should be part of RTAS blob and hence we cannot have error
> log buffer independent of RTAS blob).
>
> Alex, Alexey, Ben: Any thoughts?


Any updates about FWNMI? Thanks


-- 
Alexey

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-04-02  4:28     ` [Qemu-devel] [Qemu-ppc] " Alexey Kardashevskiy
@ 2015-04-02  4:46       ` David Gibson
  2015-07-02  9:11         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2015-04-02  4:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: benh, aik, Alexander Graf, qemu-devel, qemu-ppc, Aravinda Prasad, paulus

[-- Attachment #1: Type: text/plain, Size: 1870 bytes --]

On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
> >
> >
> >On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> >
> >[..]
> >
> >>
> >>So, this may not still be possible depending on whether the KVM side
> >>of this is already merged, but it occurs to me that there's a simpler
> >>way.
> >>
> >>Rather than mucking about with having to update the hypervisor on the
> >>RTAS location, they have qemu copy the code out of RTAS, patch it and
> >>copy it back into the vector, you could instead do this:
> >>
> >>   1. Make KVM instead of immediately delivering a 0x200 for a guest
> >>machine check, cause a special exit to qemu.
> >>
> >>   2. Have the register-nmi RTAS call store the guest side MC handler
> >>address in the spapr structure, but perform no actual guest code
> >>patching.
> >>
> >>   3. Allocate the error log buffer independently from the RTAS blob,
> >>so qemu always knows where it is.
> >>
> >>   4. When qemu gets the MC exit condition, instead of going via a
> >>patched 0x200 vector, just directly set the guest register state and
> >>jump straight into the guest side MC handler.
> >>
> >
> >Before I proceed further I would like to know what others think about
> >the approach proposed above (except for step 3 - as per PAPR the error
> >log buffer should be part of RTAS blob and hence we cannot have error
> >log buffer independent of RTAS blob).
> >
> >Alex, Alexey, Ben: Any thoughts?
> 
> 
> Any updates about FWNMI? Thanks

Huh.. I'd completely forgotten about this.  Aravinda, can you repost
your latest work on this?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-04-02  4:46       ` David Gibson
@ 2015-07-02  9:11         ` Alexey Kardashevskiy
  2015-07-03  6:01           ` David Gibson
  0 siblings, 1 reply; 66+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-02  9:11 UTC (permalink / raw)
  To: David Gibson
  Cc: benh, aik, Alexander Graf, qemu-devel, qemu-ppc, Aravinda Prasad, paulus

On 04/02/2015 03:46 PM, David Gibson wrote:
> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
>>>
>>>
>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>>>
>>> [..]
>>>
>>>>
>>>> So, this may not still be possible depending on whether the KVM side
>>>> of this is already merged, but it occurs to me that there's a simpler
>>>> way.
>>>>
>>>> Rather than mucking about with having to update the hypervisor on the
>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>>>> copy it back into the vector, you could instead do this:
>>>>
>>>>    1. Make KVM instead of immediately delivering a 0x200 for a guest
>>>> machine check, cause a special exit to qemu.
>>>>
>>>>    2. Have the register-nmi RTAS call store the guest side MC handler
>>>> address in the spapr structure, but perform no actual guest code
>>>> patching.
>>>>
>>>>    3. Allocate the error log buffer independently from the RTAS blob,
>>>> so qemu always knows where it is.
>>>>
>>>>    4. When qemu gets the MC exit condition, instead of going via a
>>>> patched 0x200 vector, just directly set the guest register state and
>>>> jump straight into the guest side MC handler.
>>>>
>>>
>>> Before I proceed further I would like to know what others think about
>>> the approach proposed above (except for step 3 - as per PAPR the error
>>> log buffer should be part of RTAS blob and hence we cannot have error
>>> log buffer independent of RTAS blob).
>>>
>>> Alex, Alexey, Ben: Any thoughts?
>>
>>
>> Any updates about FWNMI? Thanks
>
> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
> your latest work on this?


Aravinda disappeared...



-- 
Alexey

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-07-02  9:11         ` Alexey Kardashevskiy
@ 2015-07-03  6:01           ` David Gibson
  2015-07-08  8:28             ` Aravinda Prasad
  0 siblings, 1 reply; 66+ messages in thread
From: David Gibson @ 2015-07-03  6:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: benh, aik, Alexander Graf, qemu-devel, qemu-ppc, Aravinda Prasad, paulus

[-- Attachment #1: Type: text/plain, Size: 2223 bytes --]

On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
> On 04/02/2015 03:46 PM, David Gibson wrote:
> >On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
> >>On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
> >>>
> >>>
> >>>On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> >>>
> >>>[..]
> >>>
> >>>>
> >>>>So, this may not still be possible depending on whether the KVM side
> >>>>of this is already merged, but it occurs to me that there's a simpler
> >>>>way.
> >>>>
> >>>>Rather than mucking about with having to update the hypervisor on the
> >>>>RTAS location, they have qemu copy the code out of RTAS, patch it and
> >>>>copy it back into the vector, you could instead do this:
> >>>>
> >>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
> >>>>machine check, cause a special exit to qemu.
> >>>>
> >>>>   2. Have the register-nmi RTAS call store the guest side MC handler
> >>>>address in the spapr structure, but perform no actual guest code
> >>>>patching.
> >>>>
> >>>>   3. Allocate the error log buffer independently from the RTAS blob,
> >>>>so qemu always knows where it is.
> >>>>
> >>>>   4. When qemu gets the MC exit condition, instead of going via a
> >>>>patched 0x200 vector, just directly set the guest register state and
> >>>>jump straight into the guest side MC handler.
> >>>>
> >>>
> >>>Before I proceed further I would like to know what others think about
> >>>the approach proposed above (except for step 3 - as per PAPR the error
> >>>log buffer should be part of RTAS blob and hence we cannot have error
> >>>log buffer independent of RTAS blob).
> >>>
> >>>Alex, Alexey, Ben: Any thoughts?
> >>
> >>
> >>Any updates about FWNMI? Thanks
> >
> >Huh.. I'd completely forgotten about this.  Aravinda, can you repost
> >your latest work on this?
> 
> 
> Aravinda disappeared...

Ok, well someone who cares about FWNMI is going to have to start
sending something, or it won't happen.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-07-03  6:01           ` David Gibson
@ 2015-07-08  8:28             ` Aravinda Prasad
  2015-08-07  3:37               ` Sam Bobroff
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2015-07-08  8:28 UTC (permalink / raw)
  To: David Gibson
  Cc: benh, aik, Alexey Kardashevskiy, Alexander Graf, qemu-devel,
	paulus, qemu-ppc



On Friday 03 July 2015 11:31 AM, David Gibson wrote:
> On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
>> On 04/02/2015 03:46 PM, David Gibson wrote:
>>> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
>>>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
>>>>>
>>>>>
>>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>>>>>
>>>>> [..]
>>>>>
>>>>>>
>>>>>> So, this may not still be possible depending on whether the KVM side
>>>>>> of this is already merged, but it occurs to me that there's a simpler
>>>>>> way.
>>>>>>
>>>>>> Rather than mucking about with having to update the hypervisor on the
>>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>>>>>> copy it back into the vector, you could instead do this:
>>>>>>
>>>>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
>>>>>> machine check, cause a special exit to qemu.
>>>>>>
>>>>>>   2. Have the register-nmi RTAS call store the guest side MC handler
>>>>>> address in the spapr structure, but perform no actual guest code
>>>>>> patching.
>>>>>>
>>>>>>   3. Allocate the error log buffer independently from the RTAS blob,
>>>>>> so qemu always knows where it is.
>>>>>>
>>>>>>   4. When qemu gets the MC exit condition, instead of going via a
>>>>>> patched 0x200 vector, just directly set the guest register state and
>>>>>> jump straight into the guest side MC handler.
>>>>>>
>>>>>
>>>>> Before I proceed further I would like to know what others think about
>>>>> the approach proposed above (except for step 3 - as per PAPR the error
>>>>> log buffer should be part of RTAS blob and hence we cannot have error
>>>>> log buffer independent of RTAS blob).
>>>>>
>>>>> Alex, Alexey, Ben: Any thoughts?
>>>>
>>>>
>>>> Any updates about FWNMI? Thanks
>>>
>>> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
>>> your latest work on this?
>>
>>
>> Aravinda disappeared...
> 
> Ok, well someone who cares about FWNMI is going to have to start
> sending something, or it won't happen.

I am yet to work on the new approach proposed above. I will start
looking into that this week.

> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-07-08  8:28             ` Aravinda Prasad
@ 2015-08-07  3:37               ` Sam Bobroff
  2015-08-09 13:53                 ` Alexander Graf
  2015-09-01  6:21                 ` Aravinda Prasad
  0 siblings, 2 replies; 66+ messages in thread
From: Sam Bobroff @ 2015-08-07  3:37 UTC (permalink / raw)
  To: qemu-ppc
  Cc: benh, Alexey Kardashevskiy, qemu-devel, paulus, aravinda, David Gibson

Hello Aravinda and all,

On Wed, Jul 08, 2015 at 01:58:13PM +0530, Aravinda Prasad wrote:
> On Friday 03 July 2015 11:31 AM, David Gibson wrote:
> > On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
> >> On 04/02/2015 03:46 PM, David Gibson wrote:
> >>> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
> >>>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
> >>>>>
> >>>>>
> >>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> >>>>>
> >>>>> [..]
> >>>>>
> >>>>>>
> >>>>>> So, this may not still be possible depending on whether the KVM side
> >>>>>> of this is already merged, but it occurs to me that there's a simpler
> >>>>>> way.
> >>>>>>
> >>>>>> Rather than mucking about with having to update the hypervisor on the
> >>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
> >>>>>> copy it back into the vector, you could instead do this:
> >>>>>>
> >>>>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
> >>>>>> machine check, cause a special exit to qemu.
> >>>>>>
> >>>>>>   2. Have the register-nmi RTAS call store the guest side MC handler
> >>>>>> address in the spapr structure, but perform no actual guest code
> >>>>>> patching.
> >>>>>>
> >>>>>>   3. Allocate the error log buffer independently from the RTAS blob,
> >>>>>> so qemu always knows where it is.
> >>>>>>
> >>>>>>   4. When qemu gets the MC exit condition, instead of going via a
> >>>>>> patched 0x200 vector, just directly set the guest register state and
> >>>>>> jump straight into the guest side MC handler.
> >>>>>>
> >>>>>
> >>>>> Before I proceed further I would like to know what others think about
> >>>>> the approach proposed above (except for step 3 - as per PAPR the error
> >>>>> log buffer should be part of RTAS blob and hence we cannot have error
> >>>>> log buffer independent of RTAS blob).
> >>>>>
> >>>>> Alex, Alexey, Ben: Any thoughts?
> >>>>
> >>>>
> >>>> Any updates about FWNMI? Thanks
> >>>
> >>> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
> >>> your latest work on this?
> >>
> >>
> >> Aravinda disappeared...
> > 
> > Ok, well someone who cares about FWNMI is going to have to start
> > sending something, or it won't happen.
> 
> I am yet to work on the new approach proposed above. I will start
> looking into that this week.

The RTAS call being discussed in this thread actually has two vectors to patch
(System Reset and Machine Check), and the patches so far only address the
Machine Check part. I've been looking at filling in the System Reset part and
that will mean basing my code on top of this set.  I would like to keep the
same style of solution for both vectors, so I'd like to get the discussion
started again :-)

So (1) do we use a trampoline in guest memory, and if so (2) how is the
trampoline code handled?

(1) It does seem simpler to me to deliver directly to the handler, but I'm
worried about a few things:

If a guest were to call ibm,nmi-register and then kexec to a new kernel that
does not call ibm,nmi-register, would the exception cause a jump to a stale
address?

Because we're adding a new exit condition, presumably an upgraded KVM would
require an upgraded QEMU: is this much of a problem?

>From some investigation it looks like the current upstream KVM already
forwards (some) host machine checks to the guest by sending it directly to
0x200 and that Linux guests expect this, regardless of support in the host for
ibm,nmi-register (although they do call ibm,nmi-register if it's present).

(2) If we are using trampolines:

About the trampoline code in the v3 patches: I like producing the code using
the assembler, but I'm not sure that the spapr-rtas blob is the right place to
store it. The spapr-rtas blob is loaded into guest memory but it's only QEMU
that needs it. It seems messy to me and means that the guest could corrupt it.

Some other other options might be:

(a) Create a new blob (spapr-rtas-trampoline?) just like the spapr-rtas one but
only load it when ibm,nmi-register is called, and only into QEMU not the guest
memory. There would be another "BIOS" blob to install, and it wouldn't really
actually be BIOS but it seems like it would work easily.  Since we need a
second, different, trampoline for System Reset, I would then need to add yet
another blob for that... Still, this doesn't seem so bad. I suppose we could
add some structure to the blob (e.g. a table of contents at the start) and fit
both trampolines in, but that's inventing yet another file format... ugh.

(b) As above but assemble the trampoline code into an ELF dynamic library
rather than stripping it down to a raw binary: we could use known symbols to
find the trampolines, even the patch locations, so at least we wouldn't be
inventing our own format (using dlopen()/dlsym()... I wonder if this would be
OK for all platforms...).

(c) Assemble it (as above) but include it directly in the QEMU binary by
objcopying it in or hexdumping into a C string or something similar. This seems
fairly neat but I'm not sure how people would feel about including "binaries"
into QEMU this way.  Although it would take some work in the build system, it
seems like a fairly neat solution to me.

Cheers,
Sam.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-08-07  3:37               ` Sam Bobroff
@ 2015-08-09 13:53                 ` Alexander Graf
  2015-08-10  4:05                   ` Sam Bobroff
  2015-09-03  2:02                   ` Paul Mackerras
  2015-09-01  6:21                 ` Aravinda Prasad
  1 sibling, 2 replies; 66+ messages in thread
From: Alexander Graf @ 2015-08-09 13:53 UTC (permalink / raw)
  To: Sam Bobroff, qemu-ppc; +Cc: aravinda, benh, paulus, qemu-devel, David Gibson



On 07.08.15 05:37, Sam Bobroff wrote:
> Hello Aravinda and all,
> 
> On Wed, Jul 08, 2015 at 01:58:13PM +0530, Aravinda Prasad wrote:
>> On Friday 03 July 2015 11:31 AM, David Gibson wrote:
>>> On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/02/2015 03:46 PM, David Gibson wrote:
>>>>> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>>>>>>>
>>>>>>> [..]
>>>>>>>
>>>>>>>>
>>>>>>>> So, this may not still be possible depending on whether the KVM side
>>>>>>>> of this is already merged, but it occurs to me that there's a simpler
>>>>>>>> way.
>>>>>>>>
>>>>>>>> Rather than mucking about with having to update the hypervisor on the
>>>>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>>>>>>>> copy it back into the vector, you could instead do this:
>>>>>>>>
>>>>>>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
>>>>>>>> machine check, cause a special exit to qemu.
>>>>>>>>
>>>>>>>>   2. Have the register-nmi RTAS call store the guest side MC handler
>>>>>>>> address in the spapr structure, but perform no actual guest code
>>>>>>>> patching.
>>>>>>>>
>>>>>>>>   3. Allocate the error log buffer independently from the RTAS blob,
>>>>>>>> so qemu always knows where it is.
>>>>>>>>
>>>>>>>>   4. When qemu gets the MC exit condition, instead of going via a
>>>>>>>> patched 0x200 vector, just directly set the guest register state and
>>>>>>>> jump straight into the guest side MC handler.
>>>>>>>>
>>>>>>>
>>>>>>> Before I proceed further I would like to know what others think about
>>>>>>> the approach proposed above (except for step 3 - as per PAPR the error
>>>>>>> log buffer should be part of RTAS blob and hence we cannot have error
>>>>>>> log buffer independent of RTAS blob).
>>>>>>>
>>>>>>> Alex, Alexey, Ben: Any thoughts?
>>>>>>
>>>>>>
>>>>>> Any updates about FWNMI? Thanks
>>>>>
>>>>> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
>>>>> your latest work on this?
>>>>
>>>>
>>>> Aravinda disappeared...
>>>
>>> Ok, well someone who cares about FWNMI is going to have to start
>>> sending something, or it won't happen.
>>
>> I am yet to work on the new approach proposed above. I will start
>> looking into that this week.
> 
> The RTAS call being discussed in this thread actually has two vectors to patch
> (System Reset and Machine Check), and the patches so far only address the
> Machine Check part. I've been looking at filling in the System Reset part and
> that will mean basing my code on top of this set.  I would like to keep the
> same style of solution for both vectors, so I'd like to get the discussion
> started again :-)
> 
> So (1) do we use a trampoline in guest memory, and if so (2) how is the
> trampoline code handled?
> 
> (1) It does seem simpler to me to deliver directly to the handler, but I'm
> worried about a few things:
> 
> If a guest were to call ibm,nmi-register and then kexec to a new kernel that
> does not call ibm,nmi-register, would the exception cause a jump to a stale
> address?

Probably - how does that get handled today with pHyp? Does pHyp just
override the actual exception vector code and thus the kexec'ed code
path gets overwritten?

I don't remember the original patch set fully, but if all we need is to
override 0x200, why can't we replace the code with

  mtsprg scratch, r0
  li r0, HCALL_KVM_MC
  sc 1

then there is no complexity in that code at all with dynamically patched
bits. Or am I missing the obvious?

> 
> Because we're adding a new exit condition, presumably an upgraded KVM would
> require an upgraded QEMU: is this much of a problem?

Well, you would keep default behavior identical. On nmi-register QEMU
would send an ioctl to KVM, telling it to route 0x200 to QEMU instead
(just like with breakpoints). So old QEMU would still work the same way
and new QEMU with old KVM would simply get non-working MC intercepts.

> 
> From some investigation it looks like the current upstream KVM already
> forwards (some) host machine checks to the guest by sending it directly to
> 0x200 and that Linux guests expect this, regardless of support in the host for
> ibm,nmi-register (although they do call ibm,nmi-register if it's present).
> 
> (2) If we are using trampolines:
> 
> About the trampoline code in the v3 patches: I like producing the code using
> the assembler, but I'm not sure that the spapr-rtas blob is the right place to
> store it. The spapr-rtas blob is loaded into guest memory but it's only QEMU
> that needs it. It seems messy to me and means that the guest could corrupt it.

If you like, rename the blob. My original proposal was to just use
well-known offsets inside the blob that get indicated through a function
pointer table at the beginning/end/known location.

> 
> Some other other options might be:
> 
> (a) Create a new blob (spapr-rtas-trampoline?) just like the spapr-rtas one but
> only load it when ibm,nmi-register is called, and only into QEMU not the guest
> memory. There would be another "BIOS" blob to install, and it wouldn't really
> actually be BIOS but it seems like it would work easily.  Since we need a
> second, different, trampoline for System Reset, I would then need to add yet
> another blob for that... Still, this doesn't seem so bad. I suppose we could
> add some structure to the blob (e.g. a table of contents at the start) and fit
> both trampolines in, but that's inventing yet another file format... ugh.

Yes, I think inventing our own file format is the best way forward. It
shouldn't be too bad. Just reserve say 10 64bit values somewhere are use
then as function table.

> (b) As above but assemble the trampoline code into an ELF dynamic library
> rather than stripping it down to a raw binary: we could use known symbols to
> find the trampolines, even the patch locations, so at least we wouldn't be
> inventing our own format (using dlopen()/dlsym()... I wonder if this would be
> OK for all platforms...).

We have our own ELF loader in QEMU so it's workable, but I think it's
actually more complicated and harder at the end of the day than (a).

> 
> (c) Assemble it (as above) but include it directly in the QEMU binary by
> objcopying it in or hexdumping into a C string or something similar. This seems
> fairly neat but I'm not sure how people would feel about including "binaries"
> into QEMU this way.  Although it would take some work in the build system, it
> seems like a fairly neat solution to me.

We tried to move away from code as hex arrays in QEMU to make it easier
for people to patch things when they want to. But then again if we're
talking 3 instructions it might not be the worst option.


Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-08-09 13:53                 ` Alexander Graf
@ 2015-08-10  4:05                   ` Sam Bobroff
  2015-09-01 11:07                     ` Aravinda Prasad
  2015-09-03  2:02                   ` Paul Mackerras
  1 sibling, 1 reply; 66+ messages in thread
From: Sam Bobroff @ 2015-08-10  4:05 UTC (permalink / raw)
  To: Alexander Graf; +Cc: benh, qemu-devel, paulus, aravinda, qemu-ppc, David Gibson

On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
> 
> 
> On 07.08.15 05:37, Sam Bobroff wrote:
> > Hello Aravinda and all,
> > 
> > On Wed, Jul 08, 2015 at 01:58:13PM +0530, Aravinda Prasad wrote:
> >> On Friday 03 July 2015 11:31 AM, David Gibson wrote:
> >>> On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
> >>>> On 04/02/2015 03:46 PM, David Gibson wrote:
> >>>>> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
> >>>>>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> >>>>>>>
> >>>>>>> [..]
> >>>>>>>
> >>>>>>>>
> >>>>>>>> So, this may not still be possible depending on whether the KVM side
> >>>>>>>> of this is already merged, but it occurs to me that there's a simpler
> >>>>>>>> way.
> >>>>>>>>
> >>>>>>>> Rather than mucking about with having to update the hypervisor on the
> >>>>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
> >>>>>>>> copy it back into the vector, you could instead do this:
> >>>>>>>>
> >>>>>>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
> >>>>>>>> machine check, cause a special exit to qemu.
> >>>>>>>>
> >>>>>>>>   2. Have the register-nmi RTAS call store the guest side MC handler
> >>>>>>>> address in the spapr structure, but perform no actual guest code
> >>>>>>>> patching.
> >>>>>>>>
> >>>>>>>>   3. Allocate the error log buffer independently from the RTAS blob,
> >>>>>>>> so qemu always knows where it is.
> >>>>>>>>
> >>>>>>>>   4. When qemu gets the MC exit condition, instead of going via a
> >>>>>>>> patched 0x200 vector, just directly set the guest register state and
> >>>>>>>> jump straight into the guest side MC handler.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Before I proceed further I would like to know what others think about
> >>>>>>> the approach proposed above (except for step 3 - as per PAPR the error
> >>>>>>> log buffer should be part of RTAS blob and hence we cannot have error
> >>>>>>> log buffer independent of RTAS blob).
> >>>>>>>
> >>>>>>> Alex, Alexey, Ben: Any thoughts?
> >>>>>>
> >>>>>>
> >>>>>> Any updates about FWNMI? Thanks
> >>>>>
> >>>>> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
> >>>>> your latest work on this?
> >>>>
> >>>>
> >>>> Aravinda disappeared...
> >>>
> >>> Ok, well someone who cares about FWNMI is going to have to start
> >>> sending something, or it won't happen.
> >>
> >> I am yet to work on the new approach proposed above. I will start
> >> looking into that this week.
> > 
> > The RTAS call being discussed in this thread actually has two vectors to patch
> > (System Reset and Machine Check), and the patches so far only address the
> > Machine Check part. I've been looking at filling in the System Reset part and
> > that will mean basing my code on top of this set.  I would like to keep the
> > same style of solution for both vectors, so I'd like to get the discussion
> > started again :-)
> > 
> > So (1) do we use a trampoline in guest memory, and if so (2) how is the
> > trampoline code handled?
> > 
> > (1) It does seem simpler to me to deliver directly to the handler, but I'm
> > worried about a few things:
> > 
> > If a guest were to call ibm,nmi-register and then kexec to a new kernel that
> > does not call ibm,nmi-register, would the exception cause a jump to a stale
> > address?
> 
> Probably - how does that get handled today with pHyp? Does pHyp just
> override the actual exception vector code and thus the kexec'ed code
> path gets overwritten?

Yes. According to PAPR, when ibm,nmi-register is called the guest
"relinquishes" the whole 256 bytes of vector at 0x100 and 0x200 to the
hypervisor. It never mentions a way to get them back but it does jump via the
vector so if a guest were to rewrite it, it should work the way we expect.

> I don't remember the original patch set fully, but if all we need is to
> override 0x200, why can't we replace the code with
> 
>   mtsprg scratch, r0
>   li r0, HCALL_KVM_MC
>   sc 1
> 
> then there is no complexity in that code at all with dynamically patched
> bits. Or am I missing the obvious?

There is more complexity in the patches because PAPR requires the hypervisor do
some work before invoking the guest's handler. The patch set does this by
writing a trampoline (roughly) like this:
* Call a new private hcall to set up the required state (re-trying if necessary).
* Return to the trampoline code.
* Jump via a patched branch instruction to the guest's handler.

So it's a bit roundabout but gets the job done.

If what you're suggesting is that we replace this by a single (new, private)
hcall that sets up the state and jumps the guest to the handler then I think it
might be a good compromise. It simplifies the trampoline code and doesn't
suffer from the kexec problem. The only issue would be that it would be an odd
hcall: rather than returning to the caller like a normal hcall, it would jump
out to some other address, and this jump would be by QEMU manipulating the
guest state.

If we followed this approach for both 0x100 and 0x200, maybe we should re-use
the hcall: it could either take a parameter or switch based on where it was
called from (since it's only going to be valid to call it from either the 0x100
or 0x200 vectors).

Maybe call it "HCALL_KVM_FWNMI_TRAMPOLINE"?

> > Because we're adding a new exit condition, presumably an upgraded KVM would
> > require an upgraded QEMU: is this much of a problem?
> 
> Well, you would keep default behavior identical. On nmi-register QEMU
> would send an ioctl to KVM, telling it to route 0x200 to QEMU instead
> (just like with breakpoints). So old QEMU would still work the same way
> and new QEMU with old KVM would simply get non-working MC intercepts.

Great, doesn't sound like a problem :-)

> > From some investigation it looks like the current upstream KVM already
> > forwards (some) host machine checks to the guest by sending it directly to
> > 0x200 and that Linux guests expect this, regardless of support in the host for
> > ibm,nmi-register (although they do call ibm,nmi-register if it's present).
> > 
> > (2) If we are using trampolines:
> > 
> > About the trampoline code in the v3 patches: I like producing the code using
> > the assembler, but I'm not sure that the spapr-rtas blob is the right place to
> > store it. The spapr-rtas blob is loaded into guest memory but it's only QEMU
> > that needs it. It seems messy to me and means that the guest could corrupt it.
> 
> If you like, rename the blob. My original proposal was to just use
> well-known offsets inside the blob that get indicated through a function
> pointer table at the beginning/end/known location.
> 

OK.

> > Some other other options might be:
> > 
> > (a) Create a new blob (spapr-rtas-trampoline?) just like the spapr-rtas one but
> > only load it when ibm,nmi-register is called, and only into QEMU not the guest
> > memory. There would be another "BIOS" blob to install, and it wouldn't really
> > actually be BIOS but it seems like it would work easily.  Since we need a
> > second, different, trampoline for System Reset, I would then need to add yet
> > another blob for that... Still, this doesn't seem so bad. I suppose we could
> > add some structure to the blob (e.g. a table of contents at the start) and fit
> > both trampolines in, but that's inventing yet another file format... ugh.
> 
> Yes, I think inventing our own file format is the best way forward. It
> shouldn't be too bad. Just reserve say 10 64bit values somewhere are use
> then as function table.

OK.

> > (b) As above but assemble the trampoline code into an ELF dynamic library
> > rather than stripping it down to a raw binary: we could use known symbols to
> > find the trampolines, even the patch locations, so at least we wouldn't be
> > inventing our own format (using dlopen()/dlsym()... I wonder if this would be
> > OK for all platforms...).
> 
> We have our own ELF loader in QEMU so it's workable, but I think it's
> actually more complicated and harder at the end of the day than (a).

OK, fair enough.

> > (c) Assemble it (as above) but include it directly in the QEMU binary by
> > objcopying it in or hexdumping into a C string or something similar. This seems
> > fairly neat but I'm not sure how people would feel about including "binaries"
> > into QEMU this way.  Although it would take some work in the build system, it
> > seems like a fairly neat solution to me.
> 
> We tried to move away from code as hex arrays in QEMU to make it easier
> for people to patch things when they want to. But then again if we're
> talking 3 instructions it might not be the worst option.

Sounds sensible.

So, in summary, it sounds like a decent approach would be:
* store the guest's handlers in QEMU's spapr structure,
* simplify the trampolines down to a single, non-returning, hcall,
* implement the new hcall in QEMU, where it has the handler addresses,
* include the (now tiny) trampoline directly in the code,
* no new blob or blob code,
* no changes to KVM,
* no problems with kexec.

Aravinda: what do you think?

> Alex

Thanks Alex,
Sam.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-08-07  3:37               ` Sam Bobroff
  2015-08-09 13:53                 ` Alexander Graf
@ 2015-09-01  6:21                 ` Aravinda Prasad
  1 sibling, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2015-09-01  6:21 UTC (permalink / raw)
  To: Sam Bobroff
  Cc: benh, Alexey Kardashevskiy, qemu-devel, paulus, qemu-ppc, David Gibson



On Friday 07 August 2015 09:07 AM, Sam Bobroff wrote:
> Hello Aravinda and all,
> 
> On Wed, Jul 08, 2015 at 01:58:13PM +0530, Aravinda Prasad wrote:
>> On Friday 03 July 2015 11:31 AM, David Gibson wrote:
>>> On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/02/2015 03:46 PM, David Gibson wrote:
>>>>> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>>>>>>>
>>>>>>> [..]
>>>>>>>
>>>>>>>>
>>>>>>>> So, this may not still be possible depending on whether the KVM side
>>>>>>>> of this is already merged, but it occurs to me that there's a simpler
>>>>>>>> way.
>>>>>>>>
>>>>>>>> Rather than mucking about with having to update the hypervisor on the
>>>>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>>>>>>>> copy it back into the vector, you could instead do this:
>>>>>>>>
>>>>>>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
>>>>>>>> machine check, cause a special exit to qemu.
>>>>>>>>
>>>>>>>>   2. Have the register-nmi RTAS call store the guest side MC handler
>>>>>>>> address in the spapr structure, but perform no actual guest code
>>>>>>>> patching.
>>>>>>>>
>>>>>>>>   3. Allocate the error log buffer independently from the RTAS blob,
>>>>>>>> so qemu always knows where it is.
>>>>>>>>
>>>>>>>>   4. When qemu gets the MC exit condition, instead of going via a
>>>>>>>> patched 0x200 vector, just directly set the guest register state and
>>>>>>>> jump straight into the guest side MC handler.
>>>>>>>>
>>>>>>>
>>>>>>> Before I proceed further I would like to know what others think about
>>>>>>> the approach proposed above (except for step 3 - as per PAPR the error
>>>>>>> log buffer should be part of RTAS blob and hence we cannot have error
>>>>>>> log buffer independent of RTAS blob).
>>>>>>>
>>>>>>> Alex, Alexey, Ben: Any thoughts?
>>>>>>
>>>>>>
>>>>>> Any updates about FWNMI? Thanks
>>>>>
>>>>> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
>>>>> your latest work on this?
>>>>
>>>>
>>>> Aravinda disappeared...
>>>
>>> Ok, well someone who cares about FWNMI is going to have to start
>>> sending something, or it won't happen.
>>
>> I am yet to work on the new approach proposed above. I will start
>> looking into that this week.
> 
> The RTAS call being discussed in this thread actually has two vectors to patch
> (System Reset and Machine Check), and the patches so far only address the
> Machine Check part. I've been looking at filling in the System Reset part and
> that will mean basing my code on top of this set.  I would like to keep the
> same style of solution for both vectors, so I'd like to get the discussion
> started again :-)
> 
> So (1) do we use a trampoline in guest memory, and if so (2) how is the
> trampoline code handled?
> 
> (1) It does seem simpler to me to deliver directly to the handler, but I'm
> worried about a few things:
> 
> If a guest were to call ibm,nmi-register and then kexec to a new kernel that
> does not call ibm,nmi-register, would the exception cause a jump to a stale
> address?

If a kexec kernel does not call ibm,nmi-register, then an exception can
lead to a jump to stale address in the kexec kernel. This can happen
with the v3 patches also i.e., it can happen even if we don't take the
approach of delivering directly to the handler. Or is there something
else which I am missing?

> 
> Because we're adding a new exit condition, presumably an upgraded KVM would
> require an upgraded QEMU: is this much of a problem?
> 
> From some investigation it looks like the current upstream KVM already
> forwards (some) host machine checks to the guest by sending it directly to
> 0x200 and that Linux guests expect this, regardless of support in the host for
> ibm,nmi-register (although they do call ibm,nmi-register if it's present).

Upstream KVM was modified to route MCE to guest 0x200 as a part of
handling machine check work. AFAIR, earlier, MCE error was directly
delivered to QEMU.

Regards,
Aravinda

> 
> (2) If we are using trampolines:
> 
> About the trampoline code in the v3 patches: I like producing the code using
> the assembler, but I'm not sure that the spapr-rtas blob is the right place to
> store it. The spapr-rtas blob is loaded into guest memory but it's only QEMU
> that needs it. It seems messy to me and means that the guest could corrupt it.
> 
> Some other other options might be:
> 
> (a) Create a new blob (spapr-rtas-trampoline?) just like the spapr-rtas one but
> only load it when ibm,nmi-register is called, and only into QEMU not the guest
> memory. There would be another "BIOS" blob to install, and it wouldn't really
> actually be BIOS but it seems like it would work easily.  Since we need a
> second, different, trampoline for System Reset, I would then need to add yet
> another blob for that... Still, this doesn't seem so bad. I suppose we could
> add some structure to the blob (e.g. a table of contents at the start) and fit
> both trampolines in, but that's inventing yet another file format... ugh.
> 
> (b) As above but assemble the trampoline code into an ELF dynamic library
> rather than stripping it down to a raw binary: we could use known symbols to
> find the trampolines, even the patch locations, so at least we wouldn't be
> inventing our own format (using dlopen()/dlsym()... I wonder if this would be
> OK for all platforms...).
> 
> (c) Assemble it (as above) but include it directly in the QEMU binary by
> objcopying it in or hexdumping into a C string or something similar. This seems
> fairly neat but I'm not sure how people would feel about including "binaries"
> into QEMU this way.  Although it would take some work in the build system, it
> seems like a fairly neat solution to me.
> 
> Cheers,
> Sam.
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-08-10  4:05                   ` Sam Bobroff
@ 2015-09-01 11:07                     ` Aravinda Prasad
  2015-09-02  6:34                       ` Sam Bobroff
  0 siblings, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2015-09-01 11:07 UTC (permalink / raw)
  To: Sam Bobroff
  Cc: benh, Alexander Graf, qemu-devel, paulus, qemu-ppc, David Gibson



On Monday 10 August 2015 09:35 AM, Sam Bobroff wrote:
> On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
>>
>>
>> On 07.08.15 05:37, Sam Bobroff wrote:
>>> Hello Aravinda and all,
>>>
>>> On Wed, Jul 08, 2015 at 01:58:13PM +0530, Aravinda Prasad wrote:
>>>> On Friday 03 July 2015 11:31 AM, David Gibson wrote:
>>>>> On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
>>>>>> On 04/02/2015 03:46 PM, David Gibson wrote:
>>>>>>> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
>>>>>>>>>
>>>>>>>>> [..]
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So, this may not still be possible depending on whether the KVM side
>>>>>>>>>> of this is already merged, but it occurs to me that there's a simpler
>>>>>>>>>> way.
>>>>>>>>>>
>>>>>>>>>> Rather than mucking about with having to update the hypervisor on the
>>>>>>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
>>>>>>>>>> copy it back into the vector, you could instead do this:
>>>>>>>>>>
>>>>>>>>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
>>>>>>>>>> machine check, cause a special exit to qemu.
>>>>>>>>>>
>>>>>>>>>>   2. Have the register-nmi RTAS call store the guest side MC handler
>>>>>>>>>> address in the spapr structure, but perform no actual guest code
>>>>>>>>>> patching.
>>>>>>>>>>
>>>>>>>>>>   3. Allocate the error log buffer independently from the RTAS blob,
>>>>>>>>>> so qemu always knows where it is.
>>>>>>>>>>
>>>>>>>>>>   4. When qemu gets the MC exit condition, instead of going via a
>>>>>>>>>> patched 0x200 vector, just directly set the guest register state and
>>>>>>>>>> jump straight into the guest side MC handler.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Before I proceed further I would like to know what others think about
>>>>>>>>> the approach proposed above (except for step 3 - as per PAPR the error
>>>>>>>>> log buffer should be part of RTAS blob and hence we cannot have error
>>>>>>>>> log buffer independent of RTAS blob).
>>>>>>>>>
>>>>>>>>> Alex, Alexey, Ben: Any thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>> Any updates about FWNMI? Thanks
>>>>>>>
>>>>>>> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
>>>>>>> your latest work on this?
>>>>>>
>>>>>>
>>>>>> Aravinda disappeared...
>>>>>
>>>>> Ok, well someone who cares about FWNMI is going to have to start
>>>>> sending something, or it won't happen.
>>>>
>>>> I am yet to work on the new approach proposed above. I will start
>>>> looking into that this week.
>>>
>>> The RTAS call being discussed in this thread actually has two vectors to patch
>>> (System Reset and Machine Check), and the patches so far only address the
>>> Machine Check part. I've been looking at filling in the System Reset part and
>>> that will mean basing my code on top of this set.  I would like to keep the
>>> same style of solution for both vectors, so I'd like to get the discussion
>>> started again :-)
>>>
>>> So (1) do we use a trampoline in guest memory, and if so (2) how is the
>>> trampoline code handled?
>>>
>>> (1) It does seem simpler to me to deliver directly to the handler, but I'm
>>> worried about a few things:
>>>
>>> If a guest were to call ibm,nmi-register and then kexec to a new kernel that
>>> does not call ibm,nmi-register, would the exception cause a jump to a stale
>>> address?
>>
>> Probably - how does that get handled today with pHyp? Does pHyp just
>> override the actual exception vector code and thus the kexec'ed code
>> path gets overwritten?
> 
> Yes. According to PAPR, when ibm,nmi-register is called the guest
> "relinquishes" the whole 256 bytes of vector at 0x100 and 0x200 to the
> hypervisor. It never mentions a way to get them back but it does jump via the
> vector so if a guest were to rewrite it, it should work the way we expect.
> 
>> I don't remember the original patch set fully, but if all we need is to
>> override 0x200, why can't we replace the code with
>>
>>   mtsprg scratch, r0
>>   li r0, HCALL_KVM_MC
>>   sc 1
>>
>> then there is no complexity in that code at all with dynamically patched
>> bits. Or am I missing the obvious?
> 
> There is more complexity in the patches because PAPR requires the hypervisor do
> some work before invoking the guest's handler. The patch set does this by
> writing a trampoline (roughly) like this:
> * Call a new private hcall to set up the required state (re-trying if necessary).
> * Return to the trampoline code.
> * Jump via a patched branch instruction to the guest's handler.
> 
> So it's a bit roundabout but gets the job done.
> 
> If what you're suggesting is that we replace this by a single (new, private)
> hcall that sets up the state and jumps the guest to the handler then I think it
> might be a good compromise. It simplifies the trampoline code and doesn't
> suffer from the kexec problem. The only issue would be that it would be an odd
> hcall: rather than returning to the caller like a normal hcall, it would jump
> out to some other address, and this jump would be by QEMU manipulating the
> guest state.
> 
> If we followed this approach for both 0x100 and 0x200, maybe we should re-use
> the hcall: it could either take a parameter or switch based on where it was
> called from (since it's only going to be valid to call it from either the 0x100
> or 0x200 vectors).
> 
> Maybe call it "HCALL_KVM_FWNMI_TRAMPOLINE"?
> 
>>> Because we're adding a new exit condition, presumably an upgraded KVM would
>>> require an upgraded QEMU: is this much of a problem?
>>
>> Well, you would keep default behavior identical. On nmi-register QEMU
>> would send an ioctl to KVM, telling it to route 0x200 to QEMU instead
>> (just like with breakpoints). So old QEMU would still work the same way
>> and new QEMU with old KVM would simply get non-working MC intercepts.
> 
> Great, doesn't sound like a problem :-)
> 
>>> From some investigation it looks like the current upstream KVM already
>>> forwards (some) host machine checks to the guest by sending it directly to
>>> 0x200 and that Linux guests expect this, regardless of support in the host for
>>> ibm,nmi-register (although they do call ibm,nmi-register if it's present).
>>>
>>> (2) If we are using trampolines:
>>>
>>> About the trampoline code in the v3 patches: I like producing the code using
>>> the assembler, but I'm not sure that the spapr-rtas blob is the right place to
>>> store it. The spapr-rtas blob is loaded into guest memory but it's only QEMU
>>> that needs it. It seems messy to me and means that the guest could corrupt it.
>>
>> If you like, rename the blob. My original proposal was to just use
>> well-known offsets inside the blob that get indicated through a function
>> pointer table at the beginning/end/known location.
>>
> 
> OK.
> 
>>> Some other other options might be:
>>>
>>> (a) Create a new blob (spapr-rtas-trampoline?) just like the spapr-rtas one but
>>> only load it when ibm,nmi-register is called, and only into QEMU not the guest
>>> memory. There would be another "BIOS" blob to install, and it wouldn't really
>>> actually be BIOS but it seems like it would work easily.  Since we need a
>>> second, different, trampoline for System Reset, I would then need to add yet
>>> another blob for that... Still, this doesn't seem so bad. I suppose we could
>>> add some structure to the blob (e.g. a table of contents at the start) and fit
>>> both trampolines in, but that's inventing yet another file format... ugh.
>>
>> Yes, I think inventing our own file format is the best way forward. It
>> shouldn't be too bad. Just reserve say 10 64bit values somewhere are use
>> then as function table.
> 
> OK.
> 
>>> (b) As above but assemble the trampoline code into an ELF dynamic library
>>> rather than stripping it down to a raw binary: we could use known symbols to
>>> find the trampolines, even the patch locations, so at least we wouldn't be
>>> inventing our own format (using dlopen()/dlsym()... I wonder if this would be
>>> OK for all platforms...).
>>
>> We have our own ELF loader in QEMU so it's workable, but I think it's
>> actually more complicated and harder at the end of the day than (a).
> 
> OK, fair enough.
> 
>>> (c) Assemble it (as above) but include it directly in the QEMU binary by
>>> objcopying it in or hexdumping into a C string or something similar. This seems
>>> fairly neat but I'm not sure how people would feel about including "binaries"
>>> into QEMU this way.  Although it would take some work in the build system, it
>>> seems like a fairly neat solution to me.
>>
>> We tried to move away from code as hex arrays in QEMU to make it easier
>> for people to patch things when they want to. But then again if we're
>> talking 3 instructions it might not be the worst option.
> 
> Sounds sensible.
> 
> So, in summary, it sounds like a decent approach would be:
> * store the guest's handlers in QEMU's spapr structure,
> * simplify the trampolines down to a single, non-returning, hcall,

However, other instructions such as saving r3 and re-trying hcall are
still required.

> * implement the new hcall in QEMU, where it has the handler addresses,
> * include the (now tiny) trampoline directly in the code,

This was my first approach...

If we simplify the trampoline, we can directly include the trampoline in
the code. With the assumption on the new hcall, trampoline should be
reduced by half. But it will still have 7 to 8 instructions.

> * no new blob or blob code,

Yes. no new blob or blob code. However, we still need to put error log
in RTAS blob as per PAPR.

> * no changes to KVM,
> * no problems with kexec.

Not sure if this solves kexec. What if the kexec kernel does not call
nmi-register? Upon system-reset/machine-check we still jump to the old
stale address.

> 
> Aravinda: what do you think?

Also, we need different trampolines for system reset and machine check
as retrying is not required for system reset.

Apart from that does system reset requires any error log to be passed on
to the guest?

This approach sounds good. I will modify the code and post it soon.

Regards,
Aravinda

> 
>> Alex
> 
> Thanks Alex,
> Sam.
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-01 11:07                     ` Aravinda Prasad
@ 2015-09-02  6:34                       ` Sam Bobroff
  2015-09-02 10:37                         ` Aravinda Prasad
  2015-09-02 23:53                         ` David Gibson
  0 siblings, 2 replies; 66+ messages in thread
From: Sam Bobroff @ 2015-09-02  6:34 UTC (permalink / raw)
  To: Aravinda Prasad
  Cc: benh, Alexander Graf, qemu-devel, paulus, qemu-ppc, David Gibson

On Tue, Sep 01, 2015 at 04:37:51PM +0530, Aravinda Prasad wrote:
> 
> 
> On Monday 10 August 2015 09:35 AM, Sam Bobroff wrote:
> > On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
> >>
> >>
> >> On 07.08.15 05:37, Sam Bobroff wrote:
> >>> Hello Aravinda and all,
> >>>
> >>> On Wed, Jul 08, 2015 at 01:58:13PM +0530, Aravinda Prasad wrote:
> >>>> On Friday 03 July 2015 11:31 AM, David Gibson wrote:
> >>>>> On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
> >>>>>> On 04/02/2015 03:46 PM, David Gibson wrote:
> >>>>>>> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> >>>>>>>>>
> >>>>>>>>> [..]
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> So, this may not still be possible depending on whether the KVM side
> >>>>>>>>>> of this is already merged, but it occurs to me that there's a simpler
> >>>>>>>>>> way.
> >>>>>>>>>>
> >>>>>>>>>> Rather than mucking about with having to update the hypervisor on the
> >>>>>>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it and
> >>>>>>>>>> copy it back into the vector, you could instead do this:
> >>>>>>>>>>
> >>>>>>>>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
> >>>>>>>>>> machine check, cause a special exit to qemu.
> >>>>>>>>>>
> >>>>>>>>>>   2. Have the register-nmi RTAS call store the guest side MC handler
> >>>>>>>>>> address in the spapr structure, but perform no actual guest code
> >>>>>>>>>> patching.
> >>>>>>>>>>
> >>>>>>>>>>   3. Allocate the error log buffer independently from the RTAS blob,
> >>>>>>>>>> so qemu always knows where it is.
> >>>>>>>>>>
> >>>>>>>>>>   4. When qemu gets the MC exit condition, instead of going via a
> >>>>>>>>>> patched 0x200 vector, just directly set the guest register state and
> >>>>>>>>>> jump straight into the guest side MC handler.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Before I proceed further I would like to know what others think about
> >>>>>>>>> the approach proposed above (except for step 3 - as per PAPR the error
> >>>>>>>>> log buffer should be part of RTAS blob and hence we cannot have error
> >>>>>>>>> log buffer independent of RTAS blob).
> >>>>>>>>>
> >>>>>>>>> Alex, Alexey, Ben: Any thoughts?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Any updates about FWNMI? Thanks
> >>>>>>>
> >>>>>>> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
> >>>>>>> your latest work on this?
> >>>>>>
> >>>>>>
> >>>>>> Aravinda disappeared...
> >>>>>
> >>>>> Ok, well someone who cares about FWNMI is going to have to start
> >>>>> sending something, or it won't happen.
> >>>>
> >>>> I am yet to work on the new approach proposed above. I will start
> >>>> looking into that this week.
> >>>
> >>> The RTAS call being discussed in this thread actually has two vectors to patch
> >>> (System Reset and Machine Check), and the patches so far only address the
> >>> Machine Check part. I've been looking at filling in the System Reset part and
> >>> that will mean basing my code on top of this set.  I would like to keep the
> >>> same style of solution for both vectors, so I'd like to get the discussion
> >>> started again :-)
> >>>
> >>> So (1) do we use a trampoline in guest memory, and if so (2) how is the
> >>> trampoline code handled?
> >>>
> >>> (1) It does seem simpler to me to deliver directly to the handler, but I'm
> >>> worried about a few things:
> >>>
> >>> If a guest were to call ibm,nmi-register and then kexec to a new kernel that
> >>> does not call ibm,nmi-register, would the exception cause a jump to a stale
> >>> address?
> >>
> >> Probably - how does that get handled today with pHyp? Does pHyp just
> >> override the actual exception vector code and thus the kexec'ed code
> >> path gets overwritten?
> > 
> > Yes. According to PAPR, when ibm,nmi-register is called the guest
> > "relinquishes" the whole 256 bytes of vector at 0x100 and 0x200 to the
> > hypervisor. It never mentions a way to get them back but it does jump via the
> > vector so if a guest were to rewrite it, it should work the way we expect.
> > 
> >> I don't remember the original patch set fully, but if all we need is to
> >> override 0x200, why can't we replace the code with
> >>
> >>   mtsprg scratch, r0
> >>   li r0, HCALL_KVM_MC
> >>   sc 1
> >>
> >> then there is no complexity in that code at all with dynamically patched
> >> bits. Or am I missing the obvious?
> > 
> > There is more complexity in the patches because PAPR requires the hypervisor do
> > some work before invoking the guest's handler. The patch set does this by
> > writing a trampoline (roughly) like this:
> > * Call a new private hcall to set up the required state (re-trying if necessary).
> > * Return to the trampoline code.
> > * Jump via a patched branch instruction to the guest's handler.
> > 
> > So it's a bit roundabout but gets the job done.
> > 
> > If what you're suggesting is that we replace this by a single (new, private)
> > hcall that sets up the state and jumps the guest to the handler then I think it
> > might be a good compromise. It simplifies the trampoline code and doesn't
> > suffer from the kexec problem. The only issue would be that it would be an odd
> > hcall: rather than returning to the caller like a normal hcall, it would jump
> > out to some other address, and this jump would be by QEMU manipulating the
> > guest state.
> > 
> > If we followed this approach for both 0x100 and 0x200, maybe we should re-use
> > the hcall: it could either take a parameter or switch based on where it was
> > called from (since it's only going to be valid to call it from either the 0x100
> > or 0x200 vectors).
> > 
> > Maybe call it "HCALL_KVM_FWNMI_TRAMPOLINE"?
> > 
> >>> Because we're adding a new exit condition, presumably an upgraded KVM would
> >>> require an upgraded QEMU: is this much of a problem?
> >>
> >> Well, you would keep default behavior identical. On nmi-register QEMU
> >> would send an ioctl to KVM, telling it to route 0x200 to QEMU instead
> >> (just like with breakpoints). So old QEMU would still work the same way
> >> and new QEMU with old KVM would simply get non-working MC intercepts.
> > 
> > Great, doesn't sound like a problem :-)
> > 
> >>> From some investigation it looks like the current upstream KVM already
> >>> forwards (some) host machine checks to the guest by sending it directly to
> >>> 0x200 and that Linux guests expect this, regardless of support in the host for
> >>> ibm,nmi-register (although they do call ibm,nmi-register if it's present).
> >>>
> >>> (2) If we are using trampolines:
> >>>
> >>> About the trampoline code in the v3 patches: I like producing the code using
> >>> the assembler, but I'm not sure that the spapr-rtas blob is the right place to
> >>> store it. The spapr-rtas blob is loaded into guest memory but it's only QEMU
> >>> that needs it. It seems messy to me and means that the guest could corrupt it.
> >>
> >> If you like, rename the blob. My original proposal was to just use
> >> well-known offsets inside the blob that get indicated through a function
> >> pointer table at the beginning/end/known location.
> >>
> > 
> > OK.
> > 
> >>> Some other other options might be:
> >>>
> >>> (a) Create a new blob (spapr-rtas-trampoline?) just like the spapr-rtas one but
> >>> only load it when ibm,nmi-register is called, and only into QEMU not the guest
> >>> memory. There would be another "BIOS" blob to install, and it wouldn't really
> >>> actually be BIOS but it seems like it would work easily.  Since we need a
> >>> second, different, trampoline for System Reset, I would then need to add yet
> >>> another blob for that... Still, this doesn't seem so bad. I suppose we could
> >>> add some structure to the blob (e.g. a table of contents at the start) and fit
> >>> both trampolines in, but that's inventing yet another file format... ugh.
> >>
> >> Yes, I think inventing our own file format is the best way forward. It
> >> shouldn't be too bad. Just reserve say 10 64bit values somewhere are use
> >> then as function table.
> > 
> > OK.
> > 
> >>> (b) As above but assemble the trampoline code into an ELF dynamic library
> >>> rather than stripping it down to a raw binary: we could use known symbols to
> >>> find the trampolines, even the patch locations, so at least we wouldn't be
> >>> inventing our own format (using dlopen()/dlsym()... I wonder if this would be
> >>> OK for all platforms...).
> >>
> >> We have our own ELF loader in QEMU so it's workable, but I think it's
> >> actually more complicated and harder at the end of the day than (a).
> > 
> > OK, fair enough.
> > 
> >>> (c) Assemble it (as above) but include it directly in the QEMU binary by
> >>> objcopying it in or hexdumping into a C string or something similar. This seems
> >>> fairly neat but I'm not sure how people would feel about including "binaries"
> >>> into QEMU this way.  Although it would take some work in the build system, it
> >>> seems like a fairly neat solution to me.
> >>
> >> We tried to move away from code as hex arrays in QEMU to make it easier
> >> for people to patch things when they want to. But then again if we're
> >> talking 3 instructions it might not be the worst option.
> > 
> > Sounds sensible.
> > 
> > So, in summary, it sounds like a decent approach would be:
> > * store the guest's handlers in QEMU's spapr structure,
> > * simplify the trampolines down to a single, non-returning, hcall,
> 
> However, other instructions such as saving r3 and re-trying hcall are
> still required.

Ah yes, that's true. I was thinking that the retrying could happen inside the
hcall but it can't.

> > * implement the new hcall in QEMU, where it has the handler addresses,
> > * include the (now tiny) trampoline directly in the code,
> 
> This was my first approach...
> 
> If we simplify the trampoline, we can directly include the trampoline in
> the code. With the assumption on the new hcall, trampoline should be
> reduced by half. But it will still have 7 to 8 instructions.

Hmm. That's not as small as I'd hoped.

> > * no new blob or blob code,
> 
> Yes. no new blob or blob code. However, we still need to put error log
> in RTAS blob as per PAPR.
> 
> > * no changes to KVM,
> > * no problems with kexec.
> 
> Not sure if this solves kexec. What if the kexec kernel does not call
> nmi-register? Upon system-reset/machine-check we still jump to the old
> stale address.

I thought that the kexec would overwrite the trampoline code with the new
kernel's default handler, making it safe to use. Is that not the case?

> > Aravinda: what do you think?
> 
> Also, we need different trampolines for system reset and machine check
> as retrying is not required for system reset.

Yes.

> Apart from that does system reset requires any error log to be passed on
> to the guest?

No it doesn't. It just needs to jump to the handler :-)

> This approach sounds good. I will modify the code and post it soon.

Since we can't get the trampoline code as small as I'd hoped, I wonder if we
could go the other way and implement all of h_report_mc_err() directly in the
vector? (The RTAS blob address could be patched in like the handler address.)

We would obviously need to generate this code from the assembler and probably
store it in a new blob, but we could remove the private hcall. Perhaps this
would be better overall. Is there some reason this won't work, and we need it
to be an hcall?

> Regards,
> Aravinda
> 
> > 
> >> Alex
> > 
> > Thanks Alex,
> > Sam.
> > 
> 
> -- 
> Regards,
> Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-02  6:34                       ` Sam Bobroff
@ 2015-09-02 10:37                         ` Aravinda Prasad
  2015-09-02 23:53                         ` David Gibson
  1 sibling, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2015-09-02 10:37 UTC (permalink / raw)
  To: Sam Bobroff
  Cc: benh, Alexander Graf, qemu-devel, qemu-ppc, paulus, David Gibson



On Wednesday 02 September 2015 12:04 PM, Sam Bobroff wrote:
> On Tue, Sep 01, 2015 at 04:37:51PM +0530, Aravinda Prasad wrote:

[...]

>>>
>>> So, in summary, it sounds like a decent approach would be:
>>> * store the guest's handlers in QEMU's spapr structure,
>>> * simplify the trampolines down to a single, non-returning, hcall,
>>
>> However, other instructions such as saving r3 and re-trying hcall are
>> still required.
> 
> Ah yes, that's true. I was thinking that the retrying could happen inside the
> hcall but it can't.
> 
>>> * implement the new hcall in QEMU, where it has the handler addresses,
>>> * include the (now tiny) trampoline directly in the code,
>>
>> This was my first approach...
>>
>> If we simplify the trampoline, we can directly include the trampoline in
>> the code. With the assumption on the new hcall, trampoline should be
>> reduced by half. But it will still have 7 to 8 instructions.
> 
> Hmm. That's not as small as I'd hoped.
> 
>>> * no new blob or blob code,
>>
>> Yes. no new blob or blob code. However, we still need to put error log
>> in RTAS blob as per PAPR.
>>
>>> * no changes to KVM,
>>> * no problems with kexec.
>>
>> Not sure if this solves kexec. What if the kexec kernel does not call
>> nmi-register? Upon system-reset/machine-check we still jump to the old
>> stale address.
> 
> I thought that the kexec would overwrite the trampoline code with the new
> kernel's default handler, making it safe to use. Is that not the case?

I am not sure. I need to check.

It should be fine if kexec kernel is overwriting trampoline code and
calling ibm,nmi-register.

> 
>>> Aravinda: what do you think?
>>
>> Also, we need different trampolines for system reset and machine check
>> as retrying is not required for system reset.
> 
> Yes.

A second thought. A common trampoline can work for both, but for system
reset "retrying" is never exercised.

> 
>> Apart from that does system reset requires any error log to be passed on
>> to the guest?
> 
> No it doesn't. It just needs to jump to the handler :-)

ok.

> 
>> This approach sounds good. I will modify the code and post it soon.
> 
> Since we can't get the trampoline code as small as I'd hoped, I wonder if we
> could go the other way and implement all of h_report_mc_err() directly in the
> vector? (The RTAS blob address could be patched in like the handler address.)

We can't do that for two reasons at least for machine-check. One is that
we have limited space (0x100 bytes, i.e., 64 instructions) in these
vectors. The second reason is we still need a private hcall. Details below.

> 
> We would obviously need to generate this code from the assembler and probably
> store it in a new blob, but we could remove the private hcall. Perhaps this
> would be better overall. Is there some reason this won't work, and we need it
> to be an hcall?

Yes. We still need a private hcall. As per PAPR, QEMU should build error
log and report the error log to guest via RTAS blob and guests are not
supposed to directly access RTAS blob. This requires h_report_mc_err()
to be part of QEMU.

As David mentioned earlier, there is a simpler way which can eliminate
private hcall and tramploine code: KVM instead of delivering
machine-check or system-reset to 0x200 or 0x100 of guest, performs a
special exit to QEMU. Upon such exit QEMU builds error log (for
machine-check) and directly invokes guest handler.

Regards,
Aravinda

> 
>> Regards,
>> Aravinda
>>
>>>
>>>> Alex
>>>
>>> Thanks Alex,
>>> Sam.
>>>
>>
>> -- 
>> Regards,
>> Aravinda
> 
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-02  6:34                       ` Sam Bobroff
  2015-09-02 10:37                         ` Aravinda Prasad
@ 2015-09-02 23:53                         ` David Gibson
  2015-09-03  3:24                           ` Sam Bobroff
  1 sibling, 1 reply; 66+ messages in thread
From: David Gibson @ 2015-09-02 23:53 UTC (permalink / raw)
  To: Sam Bobroff
  Cc: benh, Alexander Graf, qemu-devel, qemu-ppc, Aravinda Prasad, paulus

[-- Attachment #1: Type: text/plain, Size: 2142 bytes --]

On Wed, Sep 02, 2015 at 04:34:01PM +1000, Sam Bobroff wrote:
> On Tue, Sep 01, 2015 at 04:37:51PM +0530, Aravinda Prasad wrote:
> > 
> > 
> > On Monday 10 August 2015 09:35 AM, Sam Bobroff wrote:
> > > On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
> > >>
> > >>
> > >> On 07.08.15 05:37, Sam Bobroff wrote:
[snip]
> > >>> (c) Assemble it (as above) but include it directly in the QEMU binary by
> > >>> objcopying it in or hexdumping into a C string or something similar. This seems
> > >>> fairly neat but I'm not sure how people would feel about including "binaries"
> > >>> into QEMU this way.  Although it would take some work in the build system, it
> > >>> seems like a fairly neat solution to me.
> > >>
> > >> We tried to move away from code as hex arrays in QEMU to make it easier
> > >> for people to patch things when they want to. But then again if we're
> > >> talking 3 instructions it might not be the worst option.
> > > 
> > > Sounds sensible.
> > > 
> > > So, in summary, it sounds like a decent approach would be:
> > > * store the guest's handlers in QEMU's spapr structure,
> > > * simplify the trampolines down to a single, non-returning, hcall,
> > 
> > However, other instructions such as saving r3 and re-trying hcall are
> > still required.
> 
> Ah yes, that's true. I was thinking that the retrying could happen inside the
> hcall but it can't.

Sorry, I may have missed something here.  What does the code in the
vector need to retry?

Also, it looks like the vector will need at least one scratch register
(for the hcall number, if nothing else).  Does PAPR specify what SPRGs
the vector can clobber?  Obviously it can't be anything the guest
kernel uses.


Btw, does anyone know what happens with the VPA (and dispatch trace
log and so forth) on kexec() - it could be subject to the same stale
address problem, and rewriting vectors won't save us there.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-08-09 13:53                 ` Alexander Graf
  2015-08-10  4:05                   ` Sam Bobroff
@ 2015-09-03  2:02                   ` Paul Mackerras
  2015-09-03 17:49                     ` Aravinda Prasad
  1 sibling, 1 reply; 66+ messages in thread
From: Paul Mackerras @ 2015-09-03  2:02 UTC (permalink / raw)
  To: Alexander Graf
  Cc: benh, qemu-devel, qemu-ppc, aravinda, Sam Bobroff, David Gibson

On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
> 
> 
> On 07.08.15 05:37, Sam Bobroff wrote:
> > The RTAS call being discussed in this thread actually has two vectors to patch
> > (System Reset and Machine Check), and the patches so far only address the
> > Machine Check part. I've been looking at filling in the System Reset part and
> > that will mean basing my code on top of this set.  I would like to keep the
> > same style of solution for both vectors, so I'd like to get the discussion
> > started again :-)
> > 
> > So (1) do we use a trampoline in guest memory, and if so (2) how is the
> > trampoline code handled?
> > 
> > (1) It does seem simpler to me to deliver directly to the handler, but I'm
> > worried about a few things:
> > 
> > If a guest were to call ibm,nmi-register and then kexec to a new kernel that
> > does not call ibm,nmi-register, would the exception cause a jump to a stale
> > address?
> 
> Probably - how does that get handled today with pHyp? Does pHyp just
> override the actual exception vector code and thus the kexec'ed code
> path gets overwritten?
> 
> I don't remember the original patch set fully, but if all we need is to
> override 0x200, why can't we replace the code with
> 
>   mtsprg scratch, r0
>   li r0, HCALL_KVM_MC
>   sc 1
> 
> then there is no complexity in that code at all with dynamically patched
> bits. Or am I missing the obvious?

Well, sc 1 will overwrite SRR0/1, and as far as I can see SRR0/1 have
the only record of where the machine check occurred.  So we can't use
sc 1 unless we first save SRR0/1 somewhere.  We could instead use some
specific illegal instruction, which will cause a hypervisor emulation
assist interrupt using HSRR0/1.

Paul.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-02 23:53                         ` David Gibson
@ 2015-09-03  3:24                           ` Sam Bobroff
  2015-09-03  5:05                             ` David Gibson
  0 siblings, 1 reply; 66+ messages in thread
From: Sam Bobroff @ 2015-09-03  3:24 UTC (permalink / raw)
  To: David Gibson
  Cc: benh, Alexander Graf, qemu-devel, qemu-ppc, Aravinda Prasad, paulus

On Thu, Sep 03, 2015 at 09:53:20AM +1000, David Gibson wrote:
> On Wed, Sep 02, 2015 at 04:34:01PM +1000, Sam Bobroff wrote:
> > On Tue, Sep 01, 2015 at 04:37:51PM +0530, Aravinda Prasad wrote:
> > > 
> > > 
> > > On Monday 10 August 2015 09:35 AM, Sam Bobroff wrote:
> > > > On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
> > > >>
> > > >>
> > > >> On 07.08.15 05:37, Sam Bobroff wrote:
> [snip]
> > > >>> (c) Assemble it (as above) but include it directly in the QEMU binary by
> > > >>> objcopying it in or hexdumping into a C string or something similar. This seems
> > > >>> fairly neat but I'm not sure how people would feel about including "binaries"
> > > >>> into QEMU this way.  Although it would take some work in the build system, it
> > > >>> seems like a fairly neat solution to me.
> > > >>
> > > >> We tried to move away from code as hex arrays in QEMU to make it easier
> > > >> for people to patch things when they want to. But then again if we're
> > > >> talking 3 instructions it might not be the worst option.
> > > > 
> > > > Sounds sensible.
> > > > 
> > > > So, in summary, it sounds like a decent approach would be:
> > > > * store the guest's handlers in QEMU's spapr structure,
> > > > * simplify the trampolines down to a single, non-returning, hcall,
> > > 
> > > However, other instructions such as saving r3 and re-trying hcall are
> > > still required.
> > 
> > Ah yes, that's true. I was thinking that the retrying could happen inside the
> > hcall but it can't.
> 
> Sorry, I may have missed something here.  What does the code in the
> vector need to retry?

It's due to having to handle simtaneous machine checks and there being a single
shared buffer for reporting the error. PAPR isn't very specific but here is
what it says (from section 7.3.14):

Multiple processors of the same OS image may experi- ence fatal events at, or
about, the same time. The first processor to enter the machine check handling
firmware reports the fatal error. Subsequent processors serialize waiting for
the first processor to issue the ibm,nmi-interlock call. These subsequent
processors report “fatal error previously reported”. If, after the firmware
makes a Machine Check call back, and before the OS issues the ibm,nmi-interlock
call, the same processor that is currently holding the storage containing the
error log structure receives another Machine Check NMI, the firmware has no
choice but to declare the condition fatal, log the result and execute the
partition’s reboot policy.

So it needs to retry setting up the error buffer until it succeeds.

> Also, it looks like the vector will need at least one scratch register
> (for the hcall number, if nothing else).  Does PAPR specify what SPRGs
> the vector can clobber?  Obviously it can't be anything the guest
> kernel uses.

PAPR only says SPRGs 0 to 3 are for software use, but the kernel (see
arch/powerpc/include/asm/reg.h) defines SPRG2 as an exception scratch register
so it should be the right one to use here.

> Btw, does anyone know what happens with the VPA (and dispatch trace
> log and so forth) on kexec() - it could be subject to the same stale
> address problem, and rewriting vectors won't save us there.

I asked Michael Ellerman this one and he thinks kexec probably frees and
re-allocates the VPA.

Sam.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-03  3:24                           ` Sam Bobroff
@ 2015-09-03  5:05                             ` David Gibson
  2015-09-03  5:18                               ` Paul Mackerras
  2015-09-03  6:22                               ` Sam Bobroff
  0 siblings, 2 replies; 66+ messages in thread
From: David Gibson @ 2015-09-03  5:05 UTC (permalink / raw)
  To: Sam Bobroff
  Cc: benh, Alexander Graf, qemu-devel, qemu-ppc, Aravinda Prasad, paulus

[-- Attachment #1: Type: text/plain, Size: 4634 bytes --]

On Thu, Sep 03, 2015 at 01:24:21PM +1000, Sam Bobroff wrote:
> On Thu, Sep 03, 2015 at 09:53:20AM +1000, David Gibson wrote:
> > On Wed, Sep 02, 2015 at 04:34:01PM +1000, Sam Bobroff wrote:
> > > On Tue, Sep 01, 2015 at 04:37:51PM +0530, Aravinda Prasad wrote:
> > > > 
> > > > 
> > > > On Monday 10 August 2015 09:35 AM, Sam Bobroff wrote:
> > > > > On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
> > > > >>
> > > > >>
> > > > >> On 07.08.15 05:37, Sam Bobroff wrote:
> > [snip]
> > > > >>> (c) Assemble it (as above) but include it directly in the QEMU binary by
> > > > >>> objcopying it in or hexdumping into a C string or something similar. This seems
> > > > >>> fairly neat but I'm not sure how people would feel about including "binaries"
> > > > >>> into QEMU this way.  Although it would take some work in the build system, it
> > > > >>> seems like a fairly neat solution to me.
> > > > >>
> > > > >> We tried to move away from code as hex arrays in QEMU to make it easier
> > > > >> for people to patch things when they want to. But then again if we're
> > > > >> talking 3 instructions it might not be the worst option.
> > > > > 
> > > > > Sounds sensible.
> > > > > 
> > > > > So, in summary, it sounds like a decent approach would be:
> > > > > * store the guest's handlers in QEMU's spapr structure,
> > > > > * simplify the trampolines down to a single, non-returning, hcall,
> > > > 
> > > > However, other instructions such as saving r3 and re-trying hcall are
> > > > still required.
> > > 
> > > Ah yes, that's true. I was thinking that the retrying could happen inside the
> > > hcall but it can't.
> > 
> > Sorry, I may have missed something here.  What does the code in the
> > vector need to retry?
> 
> It's due to having to handle simtaneous machine checks and there being a single
> shared buffer for reporting the error. PAPR isn't very specific but here is
> what it says (from section 7.3.14):
> 
> Multiple processors of the same OS image may experi- ence fatal events at, or
> about, the same time. The first processor to enter the machine check handling
> firmware reports the fatal error. Subsequent processors serialize waiting for
> the first processor to issue the ibm,nmi-interlock call. These subsequent
> processors report “fatal error previously reported”. If, after the firmware
> makes a Machine Check call back, and before the OS issues the ibm,nmi-interlock
> call, the same processor that is currently holding the storage containing the
> error log structure receives another Machine Check NMI, the firmware has no
> choice but to declare the condition fatal, log the result and execute the
> partition’s reboot policy.
> 
> So it needs to retry setting up the error buffer until it succeeds.

Hm.. so why can't the hypervisor code do the retrying?

> > Also, it looks like the vector will need at least one scratch register
> > (for the hcall number, if nothing else).  Does PAPR specify what SPRGs
> > the vector can clobber?  Obviously it can't be anything the guest
> > kernel uses.
> 
> PAPR only says SPRGs 0 to 3 are for software use, but the kernel (see
> arch/powerpc/include/asm/reg.h) defines SPRG2 as an exception scratch register
> so it should be the right one to use here.

Uh.. no.  If 0..3 are for software (i.e. OS) use, then this needs to
use a different one, since it's being used as a firmware resource
here.  Linux might treat SPRG2 as scratch, but another OS would be
within its rights to use it for something persistent.

Although, as paulus points out, sc 1 will clobber SRR0/1 anyway, and
if we use a special illegal instruction, then you no longer need a
scratch register.

> > Btw, does anyone know what happens with the VPA (and dispatch trace
> > log and so forth) on kexec() - it could be subject to the same stale
> > address problem, and rewriting vectors won't save us there.
> 
> I asked Michael Ellerman this one and he thinks kexec probably frees and
> re-allocates the VPA.

Ok.  So the question is: if an explicit deregister is good enough for
the VPA, is it also good enough for the FWNMI vector, in which case
doing it with just a qemu exit and not bouncing through the guest space
is back on the table.

I guess that's still problematic because there are existing guests
that assume a kexec() will magically wipe the fwnmi vectors away.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-03  5:05                             ` David Gibson
@ 2015-09-03  5:18                               ` Paul Mackerras
  2015-09-03  6:22                               ` Sam Bobroff
  1 sibling, 0 replies; 66+ messages in thread
From: Paul Mackerras @ 2015-09-03  5:18 UTC (permalink / raw)
  To: David Gibson
  Cc: benh, Alexander Graf, qemu-devel, qemu-ppc, Aravinda Prasad, Sam Bobroff

On Thu, Sep 03, 2015 at 03:05:21PM +1000, David Gibson wrote:
> On Thu, Sep 03, 2015 at 01:24:21PM +1000, Sam Bobroff wrote:
> > PAPR only says SPRGs 0 to 3 are for software use, but the kernel (see
> > arch/powerpc/include/asm/reg.h) defines SPRG2 as an exception scratch register
> > so it should be the right one to use here.
> 
> Uh.. no.  If 0..3 are for software (i.e. OS) use, then this needs to
> use a different one, since it's being used as a firmware resource
> here.  Linux might treat SPRG2 as scratch, but another OS would be
> within its rights to use it for something persistent.

PAPR says in requirement R1-14.1.2-3 "To avoid conflict with the
platform’s hypervisor, the OS must be prepared to share use of SPRG2
as the interrupt scratch register whenever an hcall() is made, or a
machine check or reset interrupt is taken."  So, SPRG2 is the one to
use here.

Paul.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-03  5:05                             ` David Gibson
  2015-09-03  5:18                               ` Paul Mackerras
@ 2015-09-03  6:22                               ` Sam Bobroff
  2015-09-03 18:30                                 ` Aravinda Prasad
  2015-09-04  5:01                                 ` David Gibson
  1 sibling, 2 replies; 66+ messages in thread
From: Sam Bobroff @ 2015-09-03  6:22 UTC (permalink / raw)
  To: David Gibson; +Cc: Aravinda Prasad, benh, qemu-ppc, qemu-devel, paulus

On Thu, Sep 03, 2015 at 03:05:21PM +1000, David Gibson wrote:

[snip]

> Hm.. so why can't the hypervisor code do the retrying?

Aravinda replied to this earlier in the thread:

"Retrying cannot be done internally in h_report_mc_err hcall: only one
thread can succeed entering qemu upon parallel hcall and hence retrying
inside the hcall will not allow the ibm,nmi-interlock from first CPU to
succeed."

I assume that this means that the big QEMU lock is held while an hcall is
processed by QEMU, but I haven't checked the code myself. Actually, even if the
lock is normally held, I don't see why these particular hcalls couldn't release
the lock. I'll look into this.

> > > Also, it looks like the vector will need at least one scratch register
> > > (for the hcall number, if nothing else).  Does PAPR specify what SPRGs
> > > the vector can clobber?  Obviously it can't be anything the guest
> > > kernel uses.
> > 
> > PAPR only says SPRGs 0 to 3 are for software use, but the kernel (see
> > arch/powerpc/include/asm/reg.h) defines SPRG2 as an exception scratch register
> > so it should be the right one to use here.
> 
> Uh.. no.  If 0..3 are for software (i.e. OS) use, then this needs to
> use a different one, since it's being used as a firmware resource
> here.  Linux might treat SPRG2 as scratch, but another OS would be
> within its rights to use it for something persistent.
> 
> Although, as paulus points out, sc 1 will clobber SRR0/1 anyway, and
> if we use a special illegal instruction, then you no longer need a
> scratch register.
> 
> > > Btw, does anyone know what happens with the VPA (and dispatch trace
> > > log and so forth) on kexec() - it could be subject to the same stale
> > > address problem, and rewriting vectors won't save us there.
> > 
> > I asked Michael Ellerman this one and he thinks kexec probably frees and
> > re-allocates the VPA.
> 
> Ok.  So the question is: if an explicit deregister is good enough for
> the VPA, is it also good enough for the FWNMI vector, in which case
> doing it with just a qemu exit and not bouncing through the guest space
> is back on the table.
> 
> I guess that's still problematic because there are existing guests
> that assume a kexec() will magically wipe the fwnmi vectors away.

Yes, but I think we could handle this separately if necessary: even if we don't
need to write anything to the vector, we could still insert a magic value and
check for it later. If it's been clobbered by a kexec, go back to the old
method.

Sam.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-03  2:02                   ` Paul Mackerras
@ 2015-09-03 17:49                     ` Aravinda Prasad
  0 siblings, 0 replies; 66+ messages in thread
From: Aravinda Prasad @ 2015-09-03 17:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: benh, Alexander Graf, qemu-devel, qemu-ppc, Sam Bobroff, David Gibson



On Thursday 03 September 2015 07:32 AM, Paul Mackerras wrote:
> On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
>>
>>
>> On 07.08.15 05:37, Sam Bobroff wrote:
>>> The RTAS call being discussed in this thread actually has two vectors to patch
>>> (System Reset and Machine Check), and the patches so far only address the
>>> Machine Check part. I've been looking at filling in the System Reset part and
>>> that will mean basing my code on top of this set.  I would like to keep the
>>> same style of solution for both vectors, so I'd like to get the discussion
>>> started again :-)
>>>
>>> So (1) do we use a trampoline in guest memory, and if so (2) how is the
>>> trampoline code handled?
>>>
>>> (1) It does seem simpler to me to deliver directly to the handler, but I'm
>>> worried about a few things:
>>>
>>> If a guest were to call ibm,nmi-register and then kexec to a new kernel that
>>> does not call ibm,nmi-register, would the exception cause a jump to a stale
>>> address?
>>
>> Probably - how does that get handled today with pHyp? Does pHyp just
>> override the actual exception vector code and thus the kexec'ed code
>> path gets overwritten?
>>
>> I don't remember the original patch set fully, but if all we need is to
>> override 0x200, why can't we replace the code with
>>
>>   mtsprg scratch, r0
>>   li r0, HCALL_KVM_MC
>>   sc 1
>>
>> then there is no complexity in that code at all with dynamically patched
>> bits. Or am I missing the obvious?
> 
> Well, sc 1 will overwrite SRR0/1, and as far as I can see SRR0/1 have
> the only record of where the machine check occurred.  So we can't use
> sc 1 unless we first save SRR0/1 somewhere.  We could instead use some
> specific illegal instruction, which will cause a hypervisor emulation
> assist interrupt using HSRR0/1.

I now see that I am not saving SRR0/1 which contains information (nip,
msr) on machine check in 0x200 vector.

I am restoring SRR0/1 in the private hcall h_report_mc_err() which is wrong:

...
+    CPUPPCState *env = &cpu->env;
...
+    mc_log.srr0 = env->spr[SPR_SRR0];
+    mc_log.srr1 = env->spr[SPR_SRR1];
...

SRR0/1 above contains the values at the time when private hcall is
invoked, not the values at the time when machine check exception occurred.

Regards,
Aravinda

> 
> Paul.
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-03  6:22                               ` Sam Bobroff
@ 2015-09-03 18:30                                 ` Aravinda Prasad
  2015-09-04  5:02                                   ` David Gibson
  2015-09-04  5:01                                 ` David Gibson
  1 sibling, 1 reply; 66+ messages in thread
From: Aravinda Prasad @ 2015-09-03 18:30 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: paulus, benh, qemu-ppc, qemu-devel, David Gibson



On Thursday 03 September 2015 11:52 AM, Sam Bobroff wrote:
> On Thu, Sep 03, 2015 at 03:05:21PM +1000, David Gibson wrote:
> 
> [snip]
> 
>> Hm.. so why can't the hypervisor code do the retrying?
> 
> Aravinda replied to this earlier in the thread:
> 
> "Retrying cannot be done internally in h_report_mc_err hcall: only one
> thread can succeed entering qemu upon parallel hcall and hence retrying
> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
> succeed."
> 
> I assume that this means that the big QEMU lock is held while an hcall is
> processed by QEMU, but I haven't checked the code myself. Actually, even if the
> lock is normally held, I don't see why these particular hcalls couldn't release
> the lock. I'll look into this.

I am not sure whether we can release this lock inside an hcall. I need
to check.

> 
>>>> Also, it looks like the vector will need at least one scratch register
>>>> (for the hcall number, if nothing else).  Does PAPR specify what SPRGs
>>>> the vector can clobber?  Obviously it can't be anything the guest
>>>> kernel uses.
>>>
>>> PAPR only says SPRGs 0 to 3 are for software use, but the kernel (see
>>> arch/powerpc/include/asm/reg.h) defines SPRG2 as an exception scratch register
>>> so it should be the right one to use here.
>>
>> Uh.. no.  If 0..3 are for software (i.e. OS) use, then this needs to
>> use a different one, since it's being used as a firmware resource
>> here.  Linux might treat SPRG2 as scratch, but another OS would be
>> within its rights to use it for something persistent.
>>
>> Although, as paulus points out, sc 1 will clobber SRR0/1 anyway, and
>> if we use a special illegal instruction, then you no longer need a
>> scratch register.
>>
>>>> Btw, does anyone know what happens with the VPA (and dispatch trace
>>>> log and so forth) on kexec() - it could be subject to the same stale
>>>> address problem, and rewriting vectors won't save us there.
>>>
>>> I asked Michael Ellerman this one and he thinks kexec probably frees and
>>> re-allocates the VPA.
>>
>> Ok.  So the question is: if an explicit deregister is good enough for
>> the VPA, is it also good enough for the FWNMI vector, in which case
>> doing it with just a qemu exit and not bouncing through the guest space
>> is back on the table.
>>
>> I guess that's still problematic because there are existing guests
>> that assume a kexec() will magically wipe the fwnmi vectors away.
> 
> Yes, but I think we could handle this separately if necessary: even if we don't
> need to write anything to the vector, we could still insert a magic value and
> check for it later. If it's been clobbered by a kexec, go back to the old
> method.

"> check for it later" - But does QEMU is informed or get to know when
kexec() is issued?

Regards,
Aravinda

> 
> Sam.
> 

-- 
Regards,
Aravinda

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-03  6:22                               ` Sam Bobroff
  2015-09-03 18:30                                 ` Aravinda Prasad
@ 2015-09-04  5:01                                 ` David Gibson
  1 sibling, 0 replies; 66+ messages in thread
From: David Gibson @ 2015-09-04  5:01 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: Aravinda Prasad, benh, qemu-ppc, qemu-devel, paulus

[-- Attachment #1: Type: text/plain, Size: 3238 bytes --]

On Thu, Sep 03, 2015 at 04:22:22PM +1000, Sam Bobroff wrote:
> On Thu, Sep 03, 2015 at 03:05:21PM +1000, David Gibson wrote:
> 
> [snip]
> 
> > Hm.. so why can't the hypervisor code do the retrying?
> 
> Aravinda replied to this earlier in the thread:
> 
> "Retrying cannot be done internally in h_report_mc_err hcall: only one
> thread can succeed entering qemu upon parallel hcall and hence retrying
> inside the hcall will not allow the ibm,nmi-interlock from first CPU to
> succeed."
> 
> I assume that this means that the big QEMU lock is held while an hcall is
> processed by QEMU, but I haven't checked the code myself. Actually, even if the
> lock is normally held, I don't see why these particular hcalls couldn't release
> the lock. I'll look into this.

Yes, you should be able to release the BQL in the hcall in order to do
retries internally.  Thomas Huth's draft H_RANDOM implementation does
something similar, since it can block

> > > > Also, it looks like the vector will need at least one scratch register
> > > > (for the hcall number, if nothing else).  Does PAPR specify what SPRGs
> > > > the vector can clobber?  Obviously it can't be anything the guest
> > > > kernel uses.
> > > 
> > > PAPR only says SPRGs 0 to 3 are for software use, but the kernel (see
> > > arch/powerpc/include/asm/reg.h) defines SPRG2 as an exception scratch register
> > > so it should be the right one to use here.
> > 
> > Uh.. no.  If 0..3 are for software (i.e. OS) use, then this needs to
> > use a different one, since it's being used as a firmware resource
> > here.  Linux might treat SPRG2 as scratch, but another OS would be
> > within its rights to use it for something persistent.
> > 
> > Although, as paulus points out, sc 1 will clobber SRR0/1 anyway, and
> > if we use a special illegal instruction, then you no longer need a
> > scratch register.
> > 
> > > > Btw, does anyone know what happens with the VPA (and dispatch trace
> > > > log and so forth) on kexec() - it could be subject to the same stale
> > > > address problem, and rewriting vectors won't save us there.
> > > 
> > > I asked Michael Ellerman this one and he thinks kexec probably frees and
> > > re-allocates the VPA.
> > 
> > Ok.  So the question is: if an explicit deregister is good enough for
> > the VPA, is it also good enough for the FWNMI vector, in which case
> > doing it with just a qemu exit and not bouncing through the guest space
> > is back on the table.
> > 
> > I guess that's still problematic because there are existing guests
> > that assume a kexec() will magically wipe the fwnmi vectors away.
> 
> Yes, but I think we could handle this separately if necessary: even if we don't
> need to write anything to the vector, we could still insert a magic value and
> check for it later. If it's been clobbered by a kexec, go back to the old
> method.

True.  Of course if you're going to do that, it makes sense to make
the value a a distinguishable illegal instrucion anyway.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
  2015-09-03 18:30                                 ` Aravinda Prasad
@ 2015-09-04  5:02                                   ` David Gibson
  0 siblings, 0 replies; 66+ messages in thread
From: David Gibson @ 2015-09-04  5:02 UTC (permalink / raw)
  To: Aravinda Prasad; +Cc: paulus, benh, qemu-ppc, qemu-devel, Sam Bobroff

[-- Attachment #1: Type: text/plain, Size: 3475 bytes --]

On Fri, Sep 04, 2015 at 12:00:25AM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 03 September 2015 11:52 AM, Sam Bobroff wrote:
> > On Thu, Sep 03, 2015 at 03:05:21PM +1000, David Gibson wrote:
> > 
> > [snip]
> > 
> >> Hm.. so why can't the hypervisor code do the retrying?
> > 
> > Aravinda replied to this earlier in the thread:
> > 
> > "Retrying cannot be done internally in h_report_mc_err hcall: only one
> > thread can succeed entering qemu upon parallel hcall and hence retrying
> > inside the hcall will not allow the ibm,nmi-interlock from first CPU to
> > succeed."
> > 
> > I assume that this means that the big QEMU lock is held while an hcall is
> > processed by QEMU, but I haven't checked the code myself. Actually, even if the
> > lock is normally held, I don't see why these particular hcalls couldn't release
> > the lock. I'll look into this.
> 
> I am not sure whether we can release this lock inside an hcall. I need
> to check.

I don't see any reason that won't work.  As long as you only touch
most qemu data structures while the lock is held, of course.

> 
> > 
> >>>> Also, it looks like the vector will need at least one scratch register
> >>>> (for the hcall number, if nothing else).  Does PAPR specify what SPRGs
> >>>> the vector can clobber?  Obviously it can't be anything the guest
> >>>> kernel uses.
> >>>
> >>> PAPR only says SPRGs 0 to 3 are for software use, but the kernel (see
> >>> arch/powerpc/include/asm/reg.h) defines SPRG2 as an exception scratch register
> >>> so it should be the right one to use here.
> >>
> >> Uh.. no.  If 0..3 are for software (i.e. OS) use, then this needs to
> >> use a different one, since it's being used as a firmware resource
> >> here.  Linux might treat SPRG2 as scratch, but another OS would be
> >> within its rights to use it for something persistent.
> >>
> >> Although, as paulus points out, sc 1 will clobber SRR0/1 anyway, and
> >> if we use a special illegal instruction, then you no longer need a
> >> scratch register.
> >>
> >>>> Btw, does anyone know what happens with the VPA (and dispatch trace
> >>>> log and so forth) on kexec() - it could be subject to the same stale
> >>>> address problem, and rewriting vectors won't save us there.
> >>>
> >>> I asked Michael Ellerman this one and he thinks kexec probably frees and
> >>> re-allocates the VPA.
> >>
> >> Ok.  So the question is: if an explicit deregister is good enough for
> >> the VPA, is it also good enough for the FWNMI vector, in which case
> >> doing it with just a qemu exit and not bouncing through the guest space
> >> is back on the table.
> >>
> >> I guess that's still problematic because there are existing guests
> >> that assume a kexec() will magically wipe the fwnmi vectors away.
> > 
> > Yes, but I think we could handle this separately if necessary: even if we don't
> > need to write anything to the vector, we could still insert a magic value and
> > check for it later. If it's been clobbered by a kexec, go back to the old
> > method.
> 
> "> check for it later" - But does QEMU is informed or get to know when
> kexec() is issued?

No, but I think Sam is suggesting just rechecking the value when you
catch an MC exception.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2015-09-04  5:02 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-05  7:12 [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests Aravinda Prasad
2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 1/4] target-ppc: Extend rtas-blob Aravinda Prasad
2014-11-05  8:11   ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
2014-11-05  8:46     ` Aravinda Prasad
2014-11-05  9:00       ` Alexander Graf
2014-11-05  9:07         ` Alexander Graf
2014-11-05 10:41           ` Aravinda Prasad
2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 2/4] target-ppc: Register and handle HCALL to receive updated RTAS region Aravinda Prasad
2014-11-05  7:12 ` [Qemu-devel] [PATCH v3 3/4] target-ppc: Build error log Aravinda Prasad
2014-11-05  7:13 ` [Qemu-devel] [PATCH v3 4/4] target-ppc: Handle ibm, nmi-register RTAS call Aravinda Prasad
2014-11-05  8:32   ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
2014-11-05 10:37     ` Aravinda Prasad
2014-11-05 11:07       ` Alexander Graf
2014-11-05 11:24         ` Aravinda Prasad
2014-11-05 11:27           ` Alexander Graf
2014-11-05 15:46     ` Tom Musta
2014-11-06 10:00       ` Aravinda Prasad
2014-11-06 10:29         ` Alexander Graf
2014-11-06 10:36           ` Aravinda Prasad
2014-11-11  3:19         ` David Gibson
2014-11-11  5:48           ` Aravinda Prasad
2014-11-11  6:11             ` David Gibson
2014-11-11  6:51               ` Aravinda Prasad
2014-11-11 11:30                 ` David Gibson
2014-11-11  3:16   ` [Qemu-devel] " David Gibson
2014-11-11  6:44     ` Aravinda Prasad
2014-11-13  3:52       ` David Gibson
2014-11-13  5:58         ` Aravinda Prasad
2014-11-13 10:32           ` David Gibson
2014-11-13 11:48             ` Aravinda Prasad
2014-11-13 12:44               ` David Gibson
2014-11-13 14:36                 ` Aravinda Prasad
2014-11-14  0:42                   ` David Gibson
2014-11-14  8:24                     ` Aravinda Prasad
2014-11-11  3:24 ` [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests David Gibson
2014-11-11  7:15   ` Aravinda Prasad
2014-11-13  3:57     ` David Gibson
2014-11-13  6:10       ` Aravinda Prasad
2014-11-19  5:48   ` Aravinda Prasad
2014-11-19 10:32     ` Alexander Graf
2014-11-19 11:44       ` David Gibson
2014-11-19 12:22         ` Alexander Graf
2014-11-19 12:42           ` [Qemu-devel] [Qemu-ppc] " Alexander Graf
2014-11-19 12:57           ` [Qemu-devel] " David Gibson
2015-04-02  4:28     ` [Qemu-devel] [Qemu-ppc] " Alexey Kardashevskiy
2015-04-02  4:46       ` David Gibson
2015-07-02  9:11         ` Alexey Kardashevskiy
2015-07-03  6:01           ` David Gibson
2015-07-08  8:28             ` Aravinda Prasad
2015-08-07  3:37               ` Sam Bobroff
2015-08-09 13:53                 ` Alexander Graf
2015-08-10  4:05                   ` Sam Bobroff
2015-09-01 11:07                     ` Aravinda Prasad
2015-09-02  6:34                       ` Sam Bobroff
2015-09-02 10:37                         ` Aravinda Prasad
2015-09-02 23:53                         ` David Gibson
2015-09-03  3:24                           ` Sam Bobroff
2015-09-03  5:05                             ` David Gibson
2015-09-03  5:18                               ` Paul Mackerras
2015-09-03  6:22                               ` Sam Bobroff
2015-09-03 18:30                                 ` Aravinda Prasad
2015-09-04  5:02                                   ` David Gibson
2015-09-04  5:01                                 ` David Gibson
2015-09-03  2:02                   ` Paul Mackerras
2015-09-03 17:49                     ` Aravinda Prasad
2015-09-01  6:21                 ` Aravinda Prasad

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.