All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH qemu v2] spapr: Kill SLOF
@ 2020-01-05 23:42 Alexey Kardashevskiy
  2020-01-06  4:19 ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-05 23:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson

The Petitboot bootloader is way more advanced than SLOF is ever going to
be as Petitboot comes with the full-featured Linux kernel with all
the drivers, and initramdisk with quite user friendly interface.
The problem with ditching SLOF is that an unmodified pseries kernel can
either start via:
1. kexec, this requires presence of RTAS and skips
ibm,client-architecture-support entirely;
2. normal boot, this heavily relies on the OF1275 client interface to
fetch the device tree and do early setup (claim memory).

This adds a new bios-less mode to the pseries machine: "bios=on|off".
When enabled, QEMU does not load SLOF and jumps to the kernel from
"-kernel".

The client interface is implemented exactly as RTAS - a 20 bytes blob,
right next after the RTAS blob. The entry point is passed to the kernel
via GPR5.

This implements a handful of client interface methods just to get going.
In particular, this implements the device tree fetching,
ibm,client-architecture-support and instantiate-rtas.

This implements changing FDT properties only for "linux,rtas-base" and
"linux,rtas-entry", just to get early boot going.

This assigns "phandles" to device tree nodes as there is no more SLOF
and OF nodes addresses of which served as phandle values.
This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
phandles are regenerated at every FDT rebuild.

When bios=off, this adds "/chosen" every time QEMU builds a tree.

This implements "claim" which the client (Linux) uses for memory
allocation; this is also  used by QEMU for claiming kernel/initrd images,
client interface entry point, RTAS and the initial stack.

While at this, add a "kernel-addr" machine parameter to allow moving
the kernel in memory. This is useful for debugging if the kernel is
loaded at @0, although not necessary.

This does not implement instances properly but in order to boot a VM,
we only really need a stdout instance and the "/" instance for
"call-method", we fake these.

The test command line:

qemu-system-ppc64 \
-nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline \
-nographic \
-vga none \
-kernel vmldbg \
-machine pseries,bios=off \
-m 4G \
-enable-kvm \
-initrd pb/rootfs.cpio.xz \
img/u1804-64le.qcow2 \
-snapshot \
-smp 8,threads=8 \
-L /home/aik/t/qemu-ppc64-bios/ \
-trace events=qemu_trace_events \
-d guest_errors \
-chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh37874 \
-mon chardev=SOCKET0,mode=control

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* fixed claim()
* added "setprop"
* cleaner client interface and RTAS blobs management
* boots to petitboot and further to the target system
* more trace points
---
 hw/ppc/Makefile.objs   |   1 +
 include/hw/loader.h    |   1 +
 include/hw/ppc/spapr.h |  25 ++-
 hw/ppc/spapr.c         | 231 ++++++++++++++++++--
 hw/ppc/spapr_client.c  | 464 +++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_hcall.c   |  49 +++--
 hw/ppc/trace-events    |   9 +
 7 files changed, 743 insertions(+), 37 deletions(-)
 create mode 100644 hw/ppc/spapr_client.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 101e9fc59185..ce31c0acd2fb 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
 obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
 obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
 obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
+obj-$(CONFIG_PSERIES) += spapr_client.o
 obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
 # IBM PowerNV
 obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
diff --git a/include/hw/loader.h b/include/hw/loader.h
index 48a96cd55904..a2f58077a47e 100644
--- a/include/hw/loader.h
+++ b/include/hw/loader.h
@@ -262,6 +262,7 @@ MemoryRegion *rom_add_blob(const char *name, const void *blob, size_t len,
 int rom_add_elf_program(const char *name, GMappedFile *mapped_file, void *data,
                         size_t datasize, size_t romsize, hwaddr addr,
                         AddressSpace *as);
+bool rom_intersect(uint64_t start, uint64_t size);
 int rom_check_and_register_reset(void);
 void rom_set_fw(FWCfgState *f);
 void rom_set_order_override(int order);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 61f005c6f686..876879d12029 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -105,6 +105,11 @@ struct SpaprCapabilities {
     uint8_t caps[SPAPR_CAP_NUM];
 };
 
+typedef struct {
+    uint64_t start;
+    uint64_t size;
+} spapr_of_claimed;
+
 /**
  * SpaprMachineClass:
  */
@@ -158,8 +163,13 @@ struct SpaprMachineState {
     uint32_t fdt_size;
     uint32_t fdt_initial_size;
     void *fdt_blob;
+    uint32_t rtas_base;
     long kernel_size;
     bool kernel_le;
+    uint64_t kernel_addr;
+    bool bios_enabled;
+    GArray *claimed; /* array of spapr_of_claimed */
+    uint64_t claimed_base;
     uint32_t initrd_base;
     long initrd_size;
     uint64_t rtc_offset; /* Now used only during incoming migration */
@@ -510,7 +520,8 @@ struct SpaprMachineState {
 /* Client Architecture support */
 #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
 #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
-#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
+#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
+#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
 
 /*
  * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
@@ -538,6 +549,12 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
 target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
                              target_ulong *args);
 
+target_ulong do_client_architecture_support(PowerPCCPU *cpu,
+                                            SpaprMachineState *spapr,
+                                            target_ulong addr,
+                                            target_ulong fdt_buf,
+                                            target_ulong fdt_bufsize);
+
 /* Virtual Processor Area structure constants */
 #define VPA_MIN_SIZE           640
 #define VPA_SIZE_OFFSET        0x4
@@ -769,6 +786,9 @@ struct SpaprEventLogEntry {
 void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
 void spapr_events_init(SpaprMachineState *sm);
 void spapr_dt_events(SpaprMachineState *sm, void *fdt);
+uint64_t spapr_do_client_claim(SpaprMachineState *sm, uint64_t virt,
+                               uint64_t size, uint64_t align);
+int spapr_h_client(SpaprMachineState *sm, target_ulong client_args);
 void close_htab_fd(SpaprMachineState *spapr);
 void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
 void spapr_free_hpt(SpaprMachineState *spapr);
@@ -891,4 +911,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
 #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
 
 void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
+
+void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
+
 #endif /* HW_SPAPR_H */
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e62c89b3dd40..1c97534a0aea 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -896,6 +896,51 @@ out:
     return ret;
 }
 
+/*
+ * Below is a compiled version of RTAS blob and OF client interface entry point.
+ *
+ * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
+ * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
+ *
+ *   .globl  _start
+ *   _start:
+ *           mr      4,3
+ *           lis     3,KVMPPC_H_RTAS@h
+ *           ori     3,3,KVMPPC_H_RTAS@l
+ *           sc      1
+ *           blr
+ *           mr      4,3
+ *           lis     3,KVMPPC_H_CLIENT@h
+ *           ori     3,3,KVMPPC_H_CLIENT@l
+ *           sc      1
+ *           blr
+ */
+static struct {
+    uint8_t rtas[20], client[20];
+} QEMU_PACKED rtas_client_blob = {
+    .rtas = {
+        0x7c, 0x64, 0x1b, 0x78,
+        0x3c, 0x60, 0x00, 0x00,
+        0x60, 0x63, 0xf0, 0x00,
+        0x44, 0x00, 0x00, 0x22,
+        0x4e, 0x80, 0x00, 0x20
+    },
+    .client = {
+        0x7c, 0x64, 0x1b, 0x78,
+        0x3c, 0x60, 0x00, 0x00,
+        0x60, 0x63, 0xf0, 0x05,
+        0x44, 0x00, 0x00, 0x22,
+        0x4e, 0x80, 0x00, 0x20
+    }
+};
+
+void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
+{
+    spapr->rtas_base = base;
+    cpu_physical_memory_write(base, rtas_client_blob.rtas,
+                              sizeof(rtas_client_blob.rtas));
+}
+
 static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
 {
     MachineState *ms = MACHINE(spapr);
@@ -980,6 +1025,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
     _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
                      lrdr_capacity, sizeof(lrdr_capacity)));
 
+    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-base", spapr->rtas_base));
+    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-entry", spapr->rtas_base));
+    _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
+                          sizeof(rtas_client_blob.rtas)));
+    _FDT(fdt_add_mem_rsv(fdt, spapr->rtas_base,
+                         sizeof(rtas_client_blob.rtas)));
+
     spapr_dt_rtas_tokens(fdt, rtas);
 }
 
@@ -1057,7 +1109,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
     }
 
     if (spapr->kernel_size) {
-        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
+        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
                               cpu_to_be64(spapr->kernel_size) };
 
         _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
@@ -1104,6 +1156,9 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
         _FDT(fdt_setprop_cell(fdt, chosen, "linux,pci-probe-only", 0));
     }
 
+    if (!spapr->bios_enabled) {
+        _FDT(fdt_setprop_cell(fdt, chosen, "cpu", 0));
+    }
     spapr_dt_ov5_platform_support(spapr, fdt, chosen);
 
     g_free(stdout_path);
@@ -1133,6 +1188,28 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
     }
 }
 
+static void spapr_fdt_add_phandles(void *fdt, int parent, uint32_t *phandle)
+{
+    int proplen = 0, node;
+    const void *prop;
+    char tmp[256] = "BAAD";
+
+    fdt_get_path(fdt, parent, tmp, sizeof(tmp));
+    prop = fdt_getprop_namelen(fdt, parent, "phandle", 7, &proplen);
+    if (!prop) {
+        _FDT(fdt_setprop_cell(fdt, parent, "phandle", *phandle));
+        while (1) {
+            ++*phandle;
+            if (fdt_node_offset_by_phandle(fdt, *phandle) < 0) {
+                break;
+            }
+        }
+    }
+    fdt_for_each_subnode(node, fdt, parent) {
+        spapr_fdt_add_phandles(fdt, node, phandle);
+    }
+}
+
 void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
 {
     MachineState *machine = MACHINE(spapr);
@@ -1245,7 +1322,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
     /* Build memory reserve map */
     if (reset) {
         if (spapr->kernel_size) {
-            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
+            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
+                                  spapr->kernel_size)));
         }
         if (spapr->initrd_size) {
             _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
@@ -1268,12 +1346,19 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
         }
     }
 
+    if (!spapr->bios_enabled) {
+        uint32_t phandle = 1;
+
+        spapr_fdt_add_phandles(fdt, fdt_path_offset(fdt, "/"), &phandle);
+    }
+
     return fdt;
 }
 
 static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
 {
-    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
+    SpaprMachineState *spapr = opaque;
+    return (addr & 0x0fffffff) + spapr->kernel_addr;
 }
 
 static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
@@ -1660,6 +1745,60 @@ static void spapr_machine_reset(MachineState *machine)
      */
     fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
 
+    /* Set up the entry state */
+    if (!spapr->bios_enabled) {
+        if (spapr->claimed) {
+            g_array_unref(spapr->claimed);
+        }
+        spapr->claimed = g_array_new(false, false, sizeof(spapr_of_claimed));
+        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
+
+        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr, fdt_addr);
+
+        if (spapr_do_client_claim(spapr, spapr->kernel_addr,
+                                  spapr->kernel_size, 0) == -1) {
+            error_report("Memory for kernel is in use");
+            exit(1);
+        }
+        if (spapr_do_client_claim(spapr, spapr->initrd_base,
+                                  spapr->initrd_size, 0) == -1) {
+            error_report("Memory for initramdisk is in use");
+            exit(1);
+        }
+        first_ppc_cpu->env.gpr[1] = spapr_do_client_claim(spapr, 0, 0x40000,
+                                                          0x10000);
+        if (first_ppc_cpu->env.gpr[1] == -1) {
+            error_report("Memory for stack is in use");
+            exit(1);
+        }
+
+        first_ppc_cpu->env.gpr[5] =
+            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
+                                  sizeof(rtas_client_blob.client));
+        if (first_ppc_cpu->env.gpr[5] == -1) {
+            error_report("Memory for OF client is in use");
+            exit(1);
+        }
+        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
+                                  rtas_client_blob.client,
+                                  sizeof(rtas_client_blob.client));
+
+        /* Allocate space for RTAS here so spapr_build_fdt() below picks it */
+        spapr->rtas_base =
+            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.rtas),
+                                  0x100);
+        if (spapr->rtas_base == -1) {
+            error_report("Memory for RTAS is in use");
+            exit(1);
+        }
+        cpu_physical_memory_write(spapr->rtas_base,
+                                  rtas_client_blob.rtas,
+                                  sizeof(rtas_client_blob.rtas));
+    } else {
+        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
+        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
+    }
+
     fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
 
     rc = fdt_pack(fdt);
@@ -1675,10 +1814,14 @@ static void spapr_machine_reset(MachineState *machine)
     spapr->fdt_initial_size = spapr->fdt_size;
     spapr->fdt_blob = fdt;
 
-    /* Set up the entry state */
-    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
-    first_ppc_cpu->env.gpr[5] = 0;
-
+    if (!spapr->bios_enabled) {
+        /*
+         * Claim FDT space so initramdisk/zImage are not unpacked over it.
+         * In practice Linux claim additional memory for FDT and fetches it
+         * using client interface's "getprop" call.
+         */
+        spapr_do_client_claim(spapr, fdt_addr, fdt_totalsize(fdt), 0);
+    }
     spapr->cas_reboot = false;
 }
 
@@ -2897,12 +3040,12 @@ static void spapr_machine_init(MachineState *machine)
         uint64_t lowaddr = 0;
 
         spapr->kernel_size = load_elf(kernel_filename, NULL,
-                                      translate_kernel_address, NULL,
+                                      translate_kernel_address, spapr,
                                       NULL, &lowaddr, NULL, 1,
                                       PPC_ELF_MACHINE, 0, 0);
         if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
             spapr->kernel_size = load_elf(kernel_filename, NULL,
-                                          translate_kernel_address, NULL, NULL,
+                                          translate_kernel_address, spapr, NULL,
                                           &lowaddr, NULL, 0, PPC_ELF_MACHINE,
                                           0, 0);
             spapr->kernel_le = spapr->kernel_size > 0;
@@ -2918,7 +3061,7 @@ static void spapr_machine_init(MachineState *machine)
             /* Try to locate the initrd in the gap between the kernel
              * and the firmware. Add a bit of space just in case
              */
-            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
+            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
                                   + 0x1ffff) & ~0xffff;
             spapr->initrd_size = load_image_targphys(initrd_filename,
                                                      spapr->initrd_base,
@@ -2932,20 +3075,22 @@ static void spapr_machine_init(MachineState *machine)
         }
     }
 
-    if (bios_name == NULL) {
-        bios_name = FW_FILE_NAME;
+    if (spapr->bios_enabled) {
+        if (bios_name == NULL) {
+            bios_name = FW_FILE_NAME;
+        }
+        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
+        if (!filename) {
+            error_report("Could not find LPAR firmware '%s'", bios_name);
+            exit(1);
+        }
+        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
+        if (fw_size <= 0) {
+            error_report("Could not load LPAR firmware '%s'", filename);
+            exit(1);
+        }
+        g_free(filename);
     }
-    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
-    if (!filename) {
-        error_report("Could not find LPAR firmware '%s'", bios_name);
-        exit(1);
-    }
-    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
-    if (fw_size <= 0) {
-        error_report("Could not load LPAR firmware '%s'", filename);
-        exit(1);
-    }
-    g_free(filename);
 
     /* FIXME: Should register things through the MachineState's qdev
      * interface, this is a legacy from the sPAPREnvironment structure
@@ -3162,6 +3307,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
     visit_type_uint32(v, name, (uint32_t *)opaque, errp);
 }
 
+static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
+                                  void *opaque, Error **errp)
+{
+    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
+}
+
+static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
+                                  void *opaque, Error **errp)
+{
+    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
+}
+
+static bool spapr_get_bios_enabled(Object *obj, Error **errp)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
+
+    return spapr->bios_enabled;
+}
+
+static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
+
+    spapr->bios_enabled = value;
+}
+
 static char *spapr_get_ic_mode(Object *obj, Error **errp)
 {
     SpaprMachineState *spapr = SPAPR_MACHINE(obj);
@@ -3267,6 +3438,20 @@ static void spapr_instance_init(Object *obj)
     object_property_add_bool(obj, "vfio-no-msix-emulation",
                              spapr_get_msix_emulation, NULL, NULL);
 
+    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
+                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
+                        &error_abort);
+    object_property_set_description(obj, "kernel-addr",
+                                    stringify(KERNEL_LOAD_ADDR)
+                                    " for -kernel is the default",
+                                    NULL);
+    spapr->kernel_addr = KERNEL_LOAD_ADDR;
+    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
+                            spapr_set_bios_enabled, NULL);
+    object_property_set_description(obj, "bios", "Conrols whether to load bios",
+                                    NULL);
+    spapr->bios_enabled = true;
+
     /* The machine class defines the default interrupt controller mode */
     spapr->irq = smc->irq;
     object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
diff --git a/hw/ppc/spapr_client.c b/hw/ppc/spapr_client.c
new file mode 100644
index 000000000000..a318eaae3c40
--- /dev/null
+++ b/hw/ppc/spapr_client.c
@@ -0,0 +1,464 @@
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qapi/error.h"
+#include "exec/memory.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_vio.h"
+#include "trace.h"
+
+struct prom_args {
+        __be32 service;
+        __be32 nargs;
+        __be32 nret;
+        __be32 args[10];
+};
+
+#define CLI_PH_MASK     0x0FFFFFFF
+#define CLI_INST_PREFIX 0x20000000
+
+#define readstr(pa, buf) cpu_physical_memory_read((pa), (buf), sizeof(buf))
+#define readuint32(pa) ({ \
+    uint32_t __tmp__; \
+    cpu_physical_memory_read((pa), &__tmp__, sizeof(__tmp__)); \
+    be32_to_cpu(__tmp__); })
+
+static bool _cmpservice(const char *s, size_t len,
+                        unsigned nargs, unsigned nret,
+                        const char *s1, size_t len1,
+                        unsigned nargscheck, unsigned nretcheck)
+{
+    if (strncmp(s, s1, MAX(len, len1))) {
+        return false;
+    }
+    if (nargscheck == 0 && nretcheck == 0) {
+        return true;
+    }
+    if (nargs != nargscheck || nret != nretcheck) {
+        trace_spapr_client_error_param(s, nargscheck, nretcheck, nargs, nret);
+        return false;
+    }
+
+    return true;
+}
+
+static uint32_t client_finddevice(const void *fdt, uint32_t nodeaddr)
+{
+    char node[256];
+    int ret;
+
+    readstr(nodeaddr, node);
+    ret = fdt_path_offset(fdt, node);
+    if (ret >= 0) {
+        ret = fdt_get_phandle(fdt, ret);
+    }
+
+    return (uint32_t) ret;
+}
+
+static uint32_t client_getprop(const void *fdt, uint32_t nodeph, uint32_t pname,
+            uint32_t valaddr, uint32_t vallen)
+{
+    char propname[64];
+    uint32_t ret = 0;
+    int proplen = 0;
+    const void *prop;
+
+    readstr(pname, propname);
+
+    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
+                               propname, strlen(propname), &proplen);
+    if (prop) {
+        int cb = MIN(proplen, vallen);
+
+        cpu_physical_memory_write(valaddr, prop, cb);
+        ret = cb;
+    } else if (strncmp(propname, "stdout", 6) == 0 && vallen == sizeof(ret)) {
+        ret = cpu_to_be32(1);
+        cpu_physical_memory_write(valaddr, &ret, MIN(vallen, sizeof(ret)));
+    } else {
+        ret = -1;
+    }
+    trace_spapr_client_getprop(nodeph, propname, ret);
+
+    return ret;
+}
+
+static uint32_t client_setprop(SpaprMachineState *sm,
+                               uint32_t nodeph, uint32_t pname,
+                               uint32_t valaddr, uint32_t vallen)
+{
+    char propname[64];
+    uint32_t ret = -1;
+    int proplen = 0;
+    const void *prop;
+
+    readstr(pname, propname);
+    if (vallen == sizeof(uint32_t) &&
+        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
+         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
+
+        sm->rtas_base = readuint32(valaddr);
+        prop = fdt_getprop_namelen(sm->fdt_blob,
+                                   fdt_node_offset_by_phandle(sm->fdt_blob,
+                                                              nodeph),
+                                   propname, strlen(propname), &proplen);
+        if (proplen == vallen) {
+            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
+            ret = proplen;
+        }
+    }
+    trace_spapr_client_setprop(nodeph, propname, ret);
+
+    return ret;
+}
+
+static uint32_t client_getproplen(const void *fdt, uint32_t nodeph,
+                                  uint32_t pname)
+{
+    char propname[64];
+    uint32_t ret = 0;
+    int proplen = 0;
+    const void *prop;
+
+    readstr(pname, propname);
+
+    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
+                               propname, strlen(propname), &proplen);
+    if (prop) {
+        ret = proplen;
+    } else if (strncmp(propname, "stdout", 6) == 0) {
+        ret = 4;
+    } else {
+        ret = -1;
+    }
+
+    return ret;
+}
+
+static uint32_t client_peer(const void *fdt, uint32_t phandle)
+{
+    int ret;
+
+    if (phandle == 0) {
+        ret = fdt_path_offset(fdt, "/");
+    } else {
+        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
+    }
+
+    if (ret < 0) {
+        ret = 0;
+    } else {
+        ret = fdt_get_phandle(fdt, ret);
+    }
+
+    return ret;
+}
+
+static uint32_t client_child(const void *fdt, uint32_t phandle)
+{
+    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
+
+    if (ret < 0) {
+        ret = 0;
+    } else {
+        ret = fdt_get_phandle(fdt, ret);
+    }
+
+    return ret;
+}
+
+static uint32_t client_parent(const void *fdt, uint32_t phandle)
+{
+    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
+
+    if (ret < 0) {
+        ret = 0;
+    } else {
+        ret = fdt_get_phandle(fdt, ret);
+    }
+
+    return ret;
+}
+
+static uint32_t client_open(uint32_t phandle)
+{
+    uint32_t ret = (phandle & CLI_PH_MASK) | CLI_INST_PREFIX;
+
+    return ret;
+}
+
+static uint32_t client_instance_to_path(SpaprMachineState *sm,
+                                        uint32_t instance, uint32_t buf,
+                                        uint32_t len)
+{
+    uint32_t ph = instance & CLI_PH_MASK;
+    char *stdout_path = spapr_vio_stdout_path(sm->vio_bus);
+    int stdout_ph = fdt_path_offset(sm->fdt_blob, stdout_path);
+
+    if (ph == stdout_ph) {
+        len = MIN(len, strlen(stdout_path));
+        cpu_physical_memory_write(buf, stdout_path, len);
+    } else  {
+        len = 0;
+    }
+    g_free(stdout_path);
+
+    return len;
+}
+
+static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
+                                       uint32_t buf, uint32_t len)
+{
+    char tmp[256];
+
+    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
+                          sizeof(tmp))) {
+        tmp[sizeof(tmp) - 1] = 0;
+        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
+    }
+    return len;
+}
+
+static uint32_t client_write(uint32_t instance, uint32_t buf, uint32_t len)
+{
+    char tmp[len + 1]; /* TODO: do a loop as len might be enourmous*/
+
+    readstr(buf, tmp);
+    tmp[len] = 0;
+    printf("%s", tmp);
+
+    return len;
+}
+
+static bool client_claim_avail(SpaprMachineState *sm, uint64_t virt,
+                               uint64_t size)
+{
+    int i;
+    spapr_of_claimed *c;
+
+    for (i = 0; i < sm->claimed->len; ++i) {
+        c = &g_array_index(sm->claimed, spapr_of_claimed, i);
+        if ((c->start <= virt && virt < c->start + c->size) ||
+            (virt <= c->start && c->start < virt + size)) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
+uint64_t spapr_do_client_claim(SpaprMachineState *sm, uint64_t virt,
+                               uint64_t size, uint64_t align)
+{
+    uint32_t ret;
+    spapr_of_claimed newclaim;
+
+    if (align == 0) {
+        if (!client_claim_avail(sm, virt, size)) {
+            return -1;
+        }
+        ret = virt;
+    } else {
+        align = pow2ceil(align);
+        sm->claimed_base = (sm->claimed_base + align - 1) & ~(align - 1);
+        while (1) {
+            if (sm->claimed_base >= sm->rma_size) {
+                perror("Out of memory");
+                return -1;
+            }
+            if (client_claim_avail(sm, sm->claimed_base, size)) {
+                break;
+            }
+            sm->claimed_base += size;
+        }
+        ret = sm->claimed_base;
+    }
+
+    sm->claimed_base = MAX(sm->claimed_base, ret) + size;
+
+    newclaim.start = virt;
+    newclaim.size = size;
+    g_array_append_val(sm->claimed, newclaim);
+    trace_spapr_client_claim(virt, size, align, ret);
+
+    return ret;
+}
+
+static uint32_t client_claim(SpaprMachineState *sm, uint32_t virt,
+                             uint32_t size, uint32_t align)
+{
+    return spapr_do_client_claim(sm, virt, size, align);
+}
+
+static uint32_t client_nextprop(const void *fdt, uint32_t phandle,
+                                uint32_t prevaddr, uint32_t nameaddr)
+{
+    int namelen = 0;
+    int offset = fdt_node_offset_by_phandle(fdt, phandle);
+    char prev[256];
+    const struct fdt_property *prop;
+    const char *tmp;
+
+    readstr(prevaddr, prev);
+    for (offset = fdt_first_property_offset(fdt, offset);
+         (offset >= 0);
+         (offset = fdt_next_property_offset(fdt, offset))) {
+
+        prop = fdt_get_property_by_offset(fdt, offset, &namelen);
+        if (!prop) {
+            return 0;
+        }
+
+        tmp = fdt_get_string(fdt, fdt32_ld(&prop->nameoff), &namelen);
+        if (prev[0] == 0 ||
+            strncmp(prev, tmp, MAX(namelen, strlen(prev))) == 0) {
+
+            if (prev[0]) {
+                offset = fdt_next_property_offset(fdt, offset);
+                if (offset < 0) {
+                    return 0;
+                }
+            }
+            prop = fdt_get_property_by_offset(fdt, offset, &namelen);
+            if (!prop) {
+                return 0;
+            }
+
+            tmp = fdt_get_string(fdt, fdt32_ld(&prop->nameoff), &namelen);
+            cpu_physical_memory_write(nameaddr, tmp, namelen + 1);
+            return 1;
+        }
+    }
+
+    return 0;
+}
+
+static uint32_t client_call_method(SpaprMachineState *sm, uint32_t methodaddr,
+                                   uint32_t param1, uint32_t param2,
+                                   uint32_t *ret2)
+{
+    uint32_t ret = 0;
+    char method[256];
+
+    readstr(methodaddr, method);
+    if (strncmp(method, "ibm,client-architecture-support", 31) == 0) {
+
+#define FDT_MAX_SIZE            0x100000
+        ret = do_client_architecture_support(POWERPC_CPU(first_cpu), sm, param2,
+                                             0, FDT_MAX_SIZE);
+        *ret2 = 0;
+    } else if (strncmp(method, "instantiate-rtas", 16) == 0) {
+        uint32_t rtasbase = param2;
+
+        spapr_instantiate_rtas(sm, rtasbase);
+        *ret2 = rtasbase;
+    } else {
+        trace_spapr_client_error_unknown_method(method);
+        return -1;
+    }
+
+    trace_spapr_client_method(method, param1, param2, ret);
+
+    return ret;
+}
+
+static void client_quiesce(SpaprMachineState *sm)
+{
+}
+
+int spapr_h_client(SpaprMachineState *sm, target_ulong client_args)
+{
+    struct prom_args args = { 0 };
+    char service[64];
+    unsigned nargs, nret;
+    int i, servicelen;
+
+    cpu_physical_memory_read(client_args, &args, sizeof(args));
+    nargs = be32_to_cpu(args.nargs);
+    nret = be32_to_cpu(args.nret);
+    readstr(be32_to_cpu(args.service), service);
+    servicelen = strlen(service);
+
+#define cmpservice(s, a, r) \
+        _cmpservice(service, servicelen, nargs, nret, (s), sizeof(s), (a), (r))
+
+    if (cmpservice("finddevice", 1, 1)) {
+        args.args[nargs] = client_finddevice(sm->fdt_blob,
+                                             be32_to_cpu(args.args[0]));
+    } else if (cmpservice("getprop", 4, 1)) {
+        args.args[nargs] = client_getprop(sm->fdt_blob,
+                                          be32_to_cpu(args.args[0]),
+                                          be32_to_cpu(args.args[1]),
+                                          be32_to_cpu(args.args[2]),
+                                          be32_to_cpu(args.args[3]));
+    } else if (cmpservice("getproplen", 2, 1)) {
+        args.args[nargs] = client_getproplen(sm->fdt_blob,
+                                             be32_to_cpu(args.args[0]),
+                                             be32_to_cpu(args.args[1]));
+    } else if (cmpservice("setprop", 4, 1)) {
+        args.args[nargs] = client_setprop(sm,
+                                          be32_to_cpu(args.args[0]),
+                                          be32_to_cpu(args.args[1]),
+                                          be32_to_cpu(args.args[2]),
+                                          be32_to_cpu(args.args[3]));
+    } else if (cmpservice("instance-to-path", 3, 1)) {
+        args.args[nargs] = client_instance_to_path(sm,
+                                                   be32_to_cpu(args.args[0]),
+                                                   be32_to_cpu(args.args[1]),
+                                                   be32_to_cpu(args.args[2]));
+    } else if (cmpservice("package-to-path", 3, 1)) {
+        args.args[nargs] = client_package_to_path(sm->fdt_blob,
+                                                  be32_to_cpu(args.args[0]),
+                                                  be32_to_cpu(args.args[1]),
+                                                  be32_to_cpu(args.args[2]));
+    } else if (cmpservice("write", 3, 1)) {
+        args.args[nargs] = client_write(be32_to_cpu(args.args[0]),
+                                        be32_to_cpu(args.args[1]),
+                                        be32_to_cpu(args.args[2]));
+    } else if (cmpservice("peer", 1, 1)) {
+        args.args[nargs] = client_peer(sm->fdt_blob,
+                                       be32_to_cpu(args.args[0]));
+    } else if (cmpservice("child", 1, 1)) {
+        args.args[nargs] = client_child(sm->fdt_blob,
+                                        be32_to_cpu(args.args[0]));
+    } else if (cmpservice("parent", 1, 1)) {
+        args.args[nargs] = client_parent(sm->fdt_blob,
+                                         be32_to_cpu(args.args[0]));
+    } else if (cmpservice("open", 1, 1)) {
+        args.args[nargs] = client_open(be32_to_cpu(args.args[0]));
+    } else if (cmpservice("call-method", 3, 2)) {
+        uint32_t ret2 = 0;
+
+        args.args[nargs] = client_call_method(sm,
+                                              be32_to_cpu(args.args[0]),
+                                              be32_to_cpu(args.args[1]),
+                                              be32_to_cpu(args.args[2]),
+                                              &ret2);
+        args.args[nargs + 1] = ret2;
+    } else if (cmpservice("claim", 3, 1)) {
+        args.args[nargs] = client_claim(sm,
+                                        be32_to_cpu(args.args[0]),
+                                        be32_to_cpu(args.args[1]),
+                                        be32_to_cpu(args.args[2]));
+    } else if (cmpservice("nextprop", 3, 1)) {
+        args.args[nargs] = client_nextprop(sm->fdt_blob,
+                                           be32_to_cpu(args.args[0]),
+                                           be32_to_cpu(args.args[1]),
+                                           be32_to_cpu(args.args[2]));
+    } else if (cmpservice("quiesce", 0, 0)) {
+        client_quiesce(sm);
+    } else if (cmpservice("exit", 0, 0)) {
+        error_report("Stopped as the VM requested \"exit\"");
+        vm_stop(RUN_STATE_PAUSED);
+    } else {
+        trace_spapr_client_error_unknown_service(service, nargs, nret);
+        args.args[nargs] = -1;
+    }
+
+    for (i = 0; i < nret; ++i) {
+        args.args[nargs + i] = be32_to_cpu(args.args[nargs + i]);
+    }
+    cpu_physical_memory_write(client_args, &args, sizeof(args));
+
+    return H_SUCCESS;
+}
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index f1799b1b707d..76f93ec37959 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1660,15 +1660,12 @@ static bool spapr_hotplugged_dev_before_cas(void)
     return false;
 }
 
-static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
-                                                  SpaprMachineState *spapr,
-                                                  target_ulong opcode,
-                                                  target_ulong *args)
+target_ulong do_client_architecture_support(PowerPCCPU *cpu,
+                                            SpaprMachineState *spapr,
+                                            target_ulong addr,
+                                            target_ulong fdt_buf,
+                                            target_ulong fdt_bufsize)
 {
-    /* Working address in data buffer */
-    target_ulong addr = ppc64_phys_to_real(args[0]);
-    target_ulong fdt_buf = args[1];
-    target_ulong fdt_bufsize = args[2];
     target_ulong ov_table;
     uint32_t cas_pvr;
     SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
@@ -1833,13 +1830,15 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
 
         fdt_bufsize -= sizeof(hdr);
 
-        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
+        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
         _FDT((fdt_pack(fdt)));
 
-        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
-        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
-                                  fdt_totalsize(fdt));
-        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
+        if (spapr->bios_enabled) {
+            cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
+            cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
+                                      fdt_totalsize(fdt));
+            trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
+        }
 
         g_free(spapr->fdt_blob);
         spapr->fdt_size = fdt_totalsize(fdt);
@@ -1854,6 +1853,20 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
     return H_SUCCESS;
 }
 
+static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
+                                                  SpaprMachineState *spapr,
+                                                  target_ulong opcode,
+                                                  target_ulong *args)
+{
+    /* Working address in data buffer */
+    target_ulong addr = ppc64_phys_to_real(args[0]);
+    target_ulong fdt_buf = args[1];
+    target_ulong fdt_bufsize = args[2];
+
+    return do_client_architecture_support(cpu, spapr, addr, fdt_buf,
+                                          fdt_bufsize);
+}
+
 static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
                                               SpaprMachineState *spapr,
                                               target_ulong opcode,
@@ -1998,6 +2011,14 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                             target_ulong opcode, target_ulong *args)
+{
+    target_ulong client_args = ppc64_phys_to_real(args[0]);
+
+    return spapr_h_client(spapr, client_args);
+}
+
 static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
 static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
 static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
@@ -2121,6 +2142,8 @@ static void hypercall_register_types(void)
 
     spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
 
+    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
+
     /* Virtual Processor Home Node */
     spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
                              h_home_node_associativity);
diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
index 9ea620f23c85..92bce3e44693 100644
--- a/hw/ppc/trace-events
+++ b/hw/ppc/trace-events
@@ -21,6 +21,15 @@ spapr_update_dt(unsigned cb) "New blob %u bytes"
 spapr_update_dt_failed_size(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
 spapr_update_dt_failed_check(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
 
+# spapr_client.c
+spapr_client_error_param(const char *method, int nargscheck, int nretcheck, int nargs, int nret) "%s takes/returns %d/%d, not %d/%d"
+spapr_client_error_unknown_service(const char *service, int nargs, int nret) "%s args=%d rets=%d"
+spapr_client_error_unknown_method(const char *method) "%s"
+spapr_client_claim(uint32_t virt, uint32_t size, uint32_t align, uint32_t ret) "virt=0x%x size=0x%x align=0x%x => 0x%x"
+spapr_client_method(const char *method, uint32_t param1, uint32_t param2, uint32_t ret) "%s(0x%x, 0x%x) => 0x%x"
+spapr_client_getprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
+spapr_client_setprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
+
 # spapr_hcall_tpm.c
 spapr_h_tpm_comm(const char *device_path, uint64_t operation) "tpm_device_path=%s operation=0x%"PRIu64
 spapr_tpm_execute(uint64_t data_in, uint64_t data_in_sz, uint64_t data_out, uint64_t data_out_sz) "data_in=0x%"PRIx64", data_in_sz=%"PRIu64", data_out=0x%"PRIx64", data_out_sz=%"PRIu64
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-05 23:42 [PATCH qemu v2] spapr: Kill SLOF Alexey Kardashevskiy
@ 2020-01-06  4:19 ` David Gibson
  2020-01-06  6:28   ` Alexey Kardashevskiy
                     ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: David Gibson @ 2020-01-06  4:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 50138 bytes --]

On Mon, Jan 06, 2020 at 10:42:42AM +1100, Alexey Kardashevskiy wrote:
> The Petitboot bootloader is way more advanced than SLOF is ever going to
> be as Petitboot comes with the full-featured Linux kernel with all
> the drivers, and initramdisk with quite user friendly interface.
> The problem with ditching SLOF is that an unmodified pseries kernel can
> either start via:
> 1. kexec, this requires presence of RTAS and skips
> ibm,client-architecture-support entirely;
> 2. normal boot, this heavily relies on the OF1275 client interface to
> fetch the device tree and do early setup (claim memory).
> 
> This adds a new bios-less mode to the pseries machine: "bios=on|off".
> When enabled, QEMU does not load SLOF and jumps to the kernel from
> "-kernel".
> 
> The client interface is implemented exactly as RTAS - a 20 bytes blob,
> right next after the RTAS blob. The entry point is passed to the kernel
> via GPR5.
> 
> This implements a handful of client interface methods just to get going.
> In particular, this implements the device tree fetching,
> ibm,client-architecture-support and instantiate-rtas.
> 
> This implements changing FDT properties only for "linux,rtas-base" and
> "linux,rtas-entry", just to get early boot going.
> 
> This assigns "phandles" to device tree nodes as there is no more SLOF
> and OF nodes addresses of which served as phandle values.
> This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
> phandles are regenerated at every FDT rebuild.
> 
> When bios=off, this adds "/chosen" every time QEMU builds a tree.
> 
> This implements "claim" which the client (Linux) uses for memory
> allocation; this is also  used by QEMU for claiming kernel/initrd images,
> client interface entry point, RTAS and the initial stack.
> 
> While at this, add a "kernel-addr" machine parameter to allow moving
> the kernel in memory. This is useful for debugging if the kernel is
> loaded at @0, although not necessary.
> 
> This does not implement instances properly but in order to boot a VM,
> we only really need a stdout instance and the "/" instance for
> "call-method", we fake these.

As we've discussed, I really like the idea of this.  It think the
basic approach looks good too.

As you probably realize, there are quite a lot of things to be
polished though.  Comments below.

> 
> The test command line:
> 
> qemu-system-ppc64 \
> -nodefaults \
> -chardev stdio,id=STDIO0,signal=off,mux=on \
> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> -mon id=MON0,chardev=STDIO0,mode=readline \
> -nographic \
> -vga none \
> -kernel vmldbg \
> -machine pseries,bios=off \
> -m 4G \
> -enable-kvm \
> -initrd pb/rootfs.cpio.xz \
> img/u1804-64le.qcow2 \
> -snapshot \
> -smp 8,threads=8 \
> -L /home/aik/t/qemu-ppc64-bios/ \
> -trace events=qemu_trace_events \
> -d guest_errors \
> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh37874 \
> -mon chardev=SOCKET0,mode=control
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v2:
> * fixed claim()
> * added "setprop"
> * cleaner client interface and RTAS blobs management
> * boots to petitboot and further to the target system
> * more trace points
> ---
>  hw/ppc/Makefile.objs   |   1 +
>  include/hw/loader.h    |   1 +
>  include/hw/ppc/spapr.h |  25 ++-
>  hw/ppc/spapr.c         | 231 ++++++++++++++++++--
>  hw/ppc/spapr_client.c  | 464 +++++++++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_hcall.c   |  49 +++--
>  hw/ppc/trace-events    |   9 +
>  7 files changed, 743 insertions(+), 37 deletions(-)
>  create mode 100644 hw/ppc/spapr_client.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index 101e9fc59185..ce31c0acd2fb 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>  obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>  obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
>  obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
> +obj-$(CONFIG_PSERIES) += spapr_client.o

This applies in a bunch of places here.  Just calling things "client"
is clear enough if you're already thinking about Open Firmware.  But
this appears in a bunch of places where you might come across it
without that context, in which case it could be kind of confusing.  So
I'd suggest using "of_client" or "of_client_interface" in most places
you're using "client".

>  obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>  # IBM PowerNV
>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
> diff --git a/include/hw/loader.h b/include/hw/loader.h
> index 48a96cd55904..a2f58077a47e 100644
> --- a/include/hw/loader.h
> +++ b/include/hw/loader.h
> @@ -262,6 +262,7 @@ MemoryRegion *rom_add_blob(const char *name, const void *blob, size_t len,
>  int rom_add_elf_program(const char *name, GMappedFile *mapped_file, void *data,
>                          size_t datasize, size_t romsize, hwaddr addr,
>                          AddressSpace *as);
> +bool rom_intersect(uint64_t start, uint64_t size);

I don't see this called, so I'm guessing it's a stray chunk leftover.

>  int rom_check_and_register_reset(void);
>  void rom_set_fw(FWCfgState *f);
>  void rom_set_order_override(int order);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 61f005c6f686..876879d12029 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -105,6 +105,11 @@ struct SpaprCapabilities {
>      uint8_t caps[SPAPR_CAP_NUM];
>  };
>  
> +typedef struct {
> +    uint64_t start;
> +    uint64_t size;
> +} spapr_of_claimed;

Please use the qemu convention of StudlyCaps for types.

>  /**
>   * SpaprMachineClass:
>   */
> @@ -158,8 +163,13 @@ struct SpaprMachineState {
>      uint32_t fdt_size;
>      uint32_t fdt_initial_size;
>      void *fdt_blob;
> +    uint32_t rtas_base;
>      long kernel_size;
>      bool kernel_le;
> +    uint64_t kernel_addr;

This would intersect with your other patch to allow changing the
kernel load address, yes?

> +    bool bios_enabled;
> +    GArray *claimed; /* array of spapr_of_claimed */
> +    uint64_t claimed_base;

I'd suggest moving all the client interface related fields to a
visually separated chunk in the structure definition.

>      uint32_t initrd_base;
>      long initrd_size;
>      uint64_t rtc_offset; /* Now used only during incoming migration */
> @@ -510,7 +520,8 @@ struct SpaprMachineState {
>  /* Client Architecture support */
>  #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
>  #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
> -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
> +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
> +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
>  
>  /*
>   * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
> @@ -538,6 +549,12 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
>  target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
>                               target_ulong *args);
>  
> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> +                                            SpaprMachineState *spapr,
> +                                            target_ulong addr,
> +                                            target_ulong fdt_buf,
> +                                            target_ulong fdt_bufsize);
> +
>  /* Virtual Processor Area structure constants */
>  #define VPA_MIN_SIZE           640
>  #define VPA_SIZE_OFFSET        0x4
> @@ -769,6 +786,9 @@ struct SpaprEventLogEntry {
>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
>  void spapr_events_init(SpaprMachineState *sm);
>  void spapr_dt_events(SpaprMachineState *sm, void *fdt);
> +uint64_t spapr_do_client_claim(SpaprMachineState *sm, uint64_t virt,
> +                               uint64_t size, uint64_t align);
> +int spapr_h_client(SpaprMachineState *sm, target_ulong client_args);
>  void close_htab_fd(SpaprMachineState *spapr);
>  void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
>  void spapr_free_hpt(SpaprMachineState *spapr);
> @@ -891,4 +911,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
>  #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
>  
>  void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
> +
> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
> +
>  #endif /* HW_SPAPR_H */
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index e62c89b3dd40..1c97534a0aea 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -896,6 +896,51 @@ out:
>      return ret;
>  }
>  
> +/*
> + * Below is a compiled version of RTAS blob and OF client interface entry point.
> + *
> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
> + *
> + *   .globl  _start
> + *   _start:
> + *           mr      4,3
> + *           lis     3,KVMPPC_H_RTAS@h
> + *           ori     3,3,KVMPPC_H_RTAS@l
> + *           sc      1
> + *           blr
> + *           mr      4,3
> + *           lis     3,KVMPPC_H_CLIENT@h
> + *           ori     3,3,KVMPPC_H_CLIENT@l
> + *           sc      1
> + *           blr
> + */
> +static struct {
> +    uint8_t rtas[20], client[20];
> +} QEMU_PACKED rtas_client_blob = {
> +    .rtas = {
> +        0x7c, 0x64, 0x1b, 0x78,
> +        0x3c, 0x60, 0x00, 0x00,
> +        0x60, 0x63, 0xf0, 0x00,
> +        0x44, 0x00, 0x00, 0x22,
> +        0x4e, 0x80, 0x00, 0x20
> +    },
> +    .client = {
> +        0x7c, 0x64, 0x1b, 0x78,
> +        0x3c, 0x60, 0x00, 0x00,
> +        0x60, 0x63, 0xf0, 0x05,
> +        0x44, 0x00, 0x00, 0x22,
> +        0x4e, 0x80, 0x00, 0x20
> +    }
> +};

I'd really prefer to read this in from a file (or files) which we
assemble (as with the existing spapr-rtas.img), rather than having a
magic array in qemu.

> +
> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
> +{
> +    spapr->rtas_base = base;
> +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
> +                              sizeof(rtas_client_blob.rtas));
> +}
> +
>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>  {
>      MachineState *ms = MACHINE(spapr);
> @@ -980,6 +1025,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>      _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
>                       lrdr_capacity, sizeof(lrdr_capacity)));
>  
> +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-base", spapr->rtas_base));
> +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-entry", spapr->rtas_base));

IIUC, we shouldn't need to add these in bios=off mode, yes?  Instead
the client must instantiate-rtas before attempting to use it.

> +    _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
> +                          sizeof(rtas_client_blob.rtas)));
> +    _FDT(fdt_add_mem_rsv(fdt, spapr->rtas_base,
> +                         sizeof(rtas_client_blob.rtas)));

Nor should we need the mem_rsv with bios=off.

> +
>      spapr_dt_rtas_tokens(fdt, rtas);
>  }
>  
> @@ -1057,7 +1109,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>      }
>  
>      if (spapr->kernel_size) {
> -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
> +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
>                                cpu_to_be64(spapr->kernel_size) };
>  
>          _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
> @@ -1104,6 +1156,9 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>          _FDT(fdt_setprop_cell(fdt, chosen, "linux,pci-probe-only", 0));
>      }
>  
> +    if (!spapr->bios_enabled) {
> +        _FDT(fdt_setprop_cell(fdt, chosen, "cpu", 0));

What is this for?

> +    }
>      spapr_dt_ov5_platform_support(spapr, fdt, chosen);
>  
>      g_free(stdout_path);
> @@ -1133,6 +1188,28 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
>      }
>  }
>  
> +static void spapr_fdt_add_phandles(void *fdt, int parent, uint32_t *phandle)
> +{
> +    int proplen = 0, node;
> +    const void *prop;
> +    char tmp[256] = "BAAD";
> +
> +    fdt_get_path(fdt, parent, tmp, sizeof(tmp));

You don't actually appear to use tmp.  Is this a debugging leftover?

> +    prop = fdt_getprop_namelen(fdt, parent, "phandle", 7, &proplen);
> +    if (!prop) {
> +        _FDT(fdt_setprop_cell(fdt, parent, "phandle", *phandle));
> +        while (1) {
> +            ++*phandle;
> +            if (fdt_node_offset_by_phandle(fdt, *phandle) < 0) {
> +                break;
> +            }
> +        }
> +    }
> +    fdt_for_each_subnode(node, fdt, parent) {
> +        spapr_fdt_add_phandles(fdt, node, phandle);
> +    }

Oof, this is pretty inefficient, since fdt_node_offset_by_phandle()
will have to scan the whole tree, and you do it on every node.

Obviously, a "live" tree structure rather than fdt would be better
suited for this.  I'd like to move to that at some point, but it will
be a while before we're ready.

I guess this is fine for a PoC, but I think we can do a bit better
without too much extra complexity:
  * Rather than scanning for an existing node with the handle on each
    iteration, you can use fdt_find_max_phandle() once beforehand,
    and start your counter from there, being assured that it won't hit
    any existing phandles

  * The phandles are arbitrary, so you don't actually care about the
    tree structure here.  So instead of recursively descending the
    tree, you can just alter each node in the order they appear in the
    blob.  fdt_next_node() will let you do that, simply by ignoring
    the 'depth' parameter.  [In fact the structure ensures that will
    be a depth first traversal of the tree, just like you have here,
    not that it matters].

  * You should check for overflows of your phandle counter, though
    it's fine to treat that as a fatal error.

> +}
> +
>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>  {
>      MachineState *machine = MACHINE(spapr);
> @@ -1245,7 +1322,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>      /* Build memory reserve map */
>      if (reset) {
>          if (spapr->kernel_size) {
> -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
> +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
> +                                  spapr->kernel_size)));
>          }
>          if (spapr->initrd_size) {
>              _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
> @@ -1268,12 +1346,19 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>          }
>      }
>  
> +    if (!spapr->bios_enabled) {
> +        uint32_t phandle = 1;
> +
> +        spapr_fdt_add_phandles(fdt, fdt_path_offset(fdt, "/"), &phandle);
> +    }
> +
>      return fdt;
>  }
>  
>  static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
>  {
> -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
> +    SpaprMachineState *spapr = opaque;
> +    return (addr & 0x0fffffff) + spapr->kernel_addr;
>  }
>  
>  static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
> @@ -1660,6 +1745,60 @@ static void spapr_machine_reset(MachineState *machine)
>       */
>      fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
>  
> +    /* Set up the entry state */
> +    if (!spapr->bios_enabled) {
> +        if (spapr->claimed) {
> +            g_array_unref(spapr->claimed);
> +        }
> +        spapr->claimed = g_array_new(false, false, sizeof(spapr_of_claimed));
> +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
> +
> +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr, fdt_addr);
> +
> +        if (spapr_do_client_claim(spapr, spapr->kernel_addr,
> +                                  spapr->kernel_size, 0) == -1) {
> +            error_report("Memory for kernel is in use");
> +            exit(1);
> +        }
> +        if (spapr_do_client_claim(spapr, spapr->initrd_base,
> +                                  spapr->initrd_size, 0) == -1) {
> +            error_report("Memory for initramdisk is in use");
> +            exit(1);
> +        }
> +        first_ppc_cpu->env.gpr[1] = spapr_do_client_claim(spapr, 0, 0x40000,
> +                                                          0x10000);
> +        if (first_ppc_cpu->env.gpr[1] == -1) {
> +            error_report("Memory for stack is in use");
> +            exit(1);
> +        }
> +
> +        first_ppc_cpu->env.gpr[5] =
> +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
> +                                  sizeof(rtas_client_blob.client));
> +        if (first_ppc_cpu->env.gpr[5] == -1) {
> +            error_report("Memory for OF client is in use");
> +            exit(1);
> +        }
> +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
> +                                  rtas_client_blob.client,
> +                                  sizeof(rtas_client_blob.client));
> +
> +        /* Allocate space for RTAS here so spapr_build_fdt() below picks it */
> +        spapr->rtas_base =
> +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.rtas),
> +                                  0x100);
> +        if (spapr->rtas_base == -1) {
> +            error_report("Memory for RTAS is in use");
> +            exit(1);
> +        }
> +        cpu_physical_memory_write(spapr->rtas_base,
> +                                  rtas_client_blob.rtas,
> +                                  sizeof(rtas_client_blob.rtas));

The above is complex enough it might be worth having bios and nobios
variants of spapr_cpu_set_entry_state().

> +    } else {
> +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
> +    }
> +
>      fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
>  
>      rc = fdt_pack(fdt);
> @@ -1675,10 +1814,14 @@ static void spapr_machine_reset(MachineState *machine)
>      spapr->fdt_initial_size = spapr->fdt_size;
>      spapr->fdt_blob = fdt;
>  
> -    /* Set up the entry state */
> -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> -    first_ppc_cpu->env.gpr[5] = 0;
> -
> +    if (!spapr->bios_enabled) {
> +        /*
> +         * Claim FDT space so initramdisk/zImage are not unpacked over it.
> +         * In practice Linux claim additional memory for FDT and fetches it
> +         * using client interface's "getprop" call.
> +         */

IIUC, it shouldn't actually be necessary to put the fdt into guest
memory in nobios mode - it will be retrieved piecemeal via the client
interface instead, won't it?

> +        spapr_do_client_claim(spapr, fdt_addr, fdt_totalsize(fdt), 0);
> +    }
>      spapr->cas_reboot = false;
>  }
>  
> @@ -2897,12 +3040,12 @@ static void spapr_machine_init(MachineState *machine)
>          uint64_t lowaddr = 0;
>  
>          spapr->kernel_size = load_elf(kernel_filename, NULL,
> -                                      translate_kernel_address, NULL,
> +                                      translate_kernel_address, spapr,
>                                        NULL, &lowaddr, NULL, 1,
>                                        PPC_ELF_MACHINE, 0, 0);
>          if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
>              spapr->kernel_size = load_elf(kernel_filename, NULL,
> -                                          translate_kernel_address, NULL, NULL,
> +                                          translate_kernel_address, spapr, NULL,
>                                            &lowaddr, NULL, 0, PPC_ELF_MACHINE,
>                                            0, 0);
>              spapr->kernel_le = spapr->kernel_size > 0;
> @@ -2918,7 +3061,7 @@ static void spapr_machine_init(MachineState *machine)
>              /* Try to locate the initrd in the gap between the kernel
>               * and the firmware. Add a bit of space just in case
>               */
> -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
> +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
>                                    + 0x1ffff) & ~0xffff;
>              spapr->initrd_size = load_image_targphys(initrd_filename,
>                                                       spapr->initrd_base,
> @@ -2932,20 +3075,22 @@ static void spapr_machine_init(MachineState *machine)
>          }
>      }
>  
> -    if (bios_name == NULL) {
> -        bios_name = FW_FILE_NAME;
> +    if (spapr->bios_enabled) {
> +        if (bios_name == NULL) {
> +            bios_name = FW_FILE_NAME;
> +        }
> +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> +        if (!filename) {
> +            error_report("Could not find LPAR firmware '%s'", bios_name);
> +            exit(1);
> +        }
> +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> +        if (fw_size <= 0) {
> +            error_report("Could not load LPAR firmware '%s'", filename);
> +            exit(1);
> +        }
> +        g_free(filename);
>      }
> -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> -    if (!filename) {
> -        error_report("Could not find LPAR firmware '%s'", bios_name);
> -        exit(1);
> -    }
> -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> -    if (fw_size <= 0) {
> -        error_report("Could not load LPAR firmware '%s'", filename);
> -        exit(1);
> -    }
> -    g_free(filename);
>  
>      /* FIXME: Should register things through the MachineState's qdev
>       * interface, this is a legacy from the sPAPREnvironment structure
> @@ -3162,6 +3307,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
>      visit_type_uint32(v, name, (uint32_t *)opaque, errp);
>  }
>  
> +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
> +                                  void *opaque, Error **errp)
> +{
> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> +}
> +
> +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
> +                                  void *opaque, Error **errp)
> +{
> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> +}

Might be clearer to split the adjustable kernel addr stuff out into a
separate patch - in fact, didn't you have such a patch before?

> +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> +
> +    return spapr->bios_enabled;
> +}
> +
> +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> +
> +    spapr->bios_enabled = value;
> +}
> +
>  static char *spapr_get_ic_mode(Object *obj, Error **errp)
>  {
>      SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> @@ -3267,6 +3438,20 @@ static void spapr_instance_init(Object *obj)
>      object_property_add_bool(obj, "vfio-no-msix-emulation",
>                               spapr_get_msix_emulation, NULL, NULL);
>  
> +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
> +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
> +                        &error_abort);
> +    object_property_set_description(obj, "kernel-addr",
> +                                    stringify(KERNEL_LOAD_ADDR)
> +                                    " for -kernel is the default",
> +                                    NULL);
> +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
> +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
> +                            spapr_set_bios_enabled, NULL);
> +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
> +                                    NULL);
> +    spapr->bios_enabled = true;
> +
>      /* The machine class defines the default interrupt controller mode */
>      spapr->irq = smc->irq;
>      object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
> diff --git a/hw/ppc/spapr_client.c b/hw/ppc/spapr_client.c
> new file mode 100644
> index 000000000000..a318eaae3c40
> --- /dev/null
> +++ b/hw/ppc/spapr_client.c
> @@ -0,0 +1,464 @@
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qapi/error.h"
> +#include "exec/memory.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/ppc/spapr_vio.h"
> +#include "trace.h"
> +
> +struct prom_args {

StudlyCaps and typedef, please..

> +        __be32 service;
> +        __be32 nargs;
> +        __be32 nret;
> +        __be32 args[10];
> +};
> +
> +#define CLI_PH_MASK     0x0FFFFFFF
> +#define CLI_INST_PREFIX 0x20000000
> +
> +#define readstr(pa, buf) cpu_physical_memory_read((pa), (buf), sizeof(buf))

I'd prefer "readbuf" or something, since this isn't actually looking
at any string structure.

> +#define readuint32(pa) ({ \
> +    uint32_t __tmp__; \
> +    cpu_physical_memory_read((pa), &__tmp__, sizeof(__tmp__)); \
> +    be32_to_cpu(__tmp__); })

You can simplify this with ldl_be_phys().  In fact, then you probably
don't need any wrapper at all.

> +
> +static bool _cmpservice(const char *s, size_t len,
> +                        unsigned nargs, unsigned nret,
> +                        const char *s1, size_t len1,
> +                        unsigned nargscheck, unsigned nretcheck)
> +{
> +    if (strncmp(s, s1, MAX(len, len1))) {
> +        return false;
> +    }
> +    if (nargscheck == 0 && nretcheck == 0) {
> +        return true;
> +    }
> +    if (nargs != nargscheck || nret != nretcheck) {
> +        trace_spapr_client_error_param(s, nargscheck, nretcheck, nargs, nret);
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +static uint32_t client_finddevice(const void *fdt, uint32_t nodeaddr)
> +{
> +    char node[256];
> +    int ret;
> +
> +    readstr(nodeaddr, node);
> +    ret = fdt_path_offset(fdt, node);
> +    if (ret >= 0) {
> +        ret = fdt_get_phandle(fdt, ret);
> +    }
> +
> +    return (uint32_t) ret;
> +}
> +
> +static uint32_t client_getprop(const void *fdt, uint32_t nodeph, uint32_t pname,
> +            uint32_t valaddr, uint32_t vallen)
> +{
> +    char propname[64];
> +    uint32_t ret = 0;
> +    int proplen = 0;
> +    const void *prop;
> +
> +    readstr(pname, propname);
> +
> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> +                               propname, strlen(propname), &proplen);

You can just use fdt_getprop().  By definition it's equivalent to
fdt_getprop_namelen(.., propname, strlen(propname), ...)

> +    if (prop) {
> +        int cb = MIN(proplen, vallen);
> +
> +        cpu_physical_memory_write(valaddr, prop, cb);
> +        ret = cb;
> +    } else if (strncmp(propname, "stdout", 6) == 0 && vallen == sizeof(ret)) {
> +        ret = cpu_to_be32(1);
> +        cpu_physical_memory_write(valaddr, &ret, MIN(vallen, sizeof(ret)));

I'm guessing the special casing of stdout is to fake the OF instancing
stuff, which I don't really understand.  Some comments to explain
what's going on here would be good.

> +    } else {
> +        ret = -1;
> +    }
> +    trace_spapr_client_getprop(nodeph, propname, ret);
> +
> +    return ret;
> +}
> +
> +static uint32_t client_setprop(SpaprMachineState *sm,
> +                               uint32_t nodeph, uint32_t pname,
> +                               uint32_t valaddr, uint32_t vallen)
> +{
> +    char propname[64];
> +    uint32_t ret = -1;
> +    int proplen = 0;
> +    const void *prop;
> +
> +    readstr(pname, propname);
> +    if (vallen == sizeof(uint32_t) &&
> +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
> +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
> +
> +        sm->rtas_base = readuint32(valaddr);
> +        prop = fdt_getprop_namelen(sm->fdt_blob,
> +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
> +                                                              nodeph),
> +                                   propname, strlen(propname), &proplen);
> +        if (proplen == vallen) {
> +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
> +            ret = proplen;
> +        }

Is there a particular reason to restrict this to the rtas properties,
rather than just allowing the guest to fdt_setprop() something
arbitrary?

> +    }
> +    trace_spapr_client_setprop(nodeph, propname, ret);
> +
> +    return ret;
> +}
> +
> +static uint32_t client_getproplen(const void *fdt, uint32_t nodeph,
> +                                  uint32_t pname)
> +{
> +    char propname[64];
> +    uint32_t ret = 0;
> +    int proplen = 0;
> +    const void *prop;
> +
> +    readstr(pname, propname);
> +
> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> +                               propname, strlen(propname), &proplen);
> +    if (prop) {
> +        ret = proplen;
> +    } else if (strncmp(propname, "stdout", 6) == 0) {
> +        ret = 4;
> +    } else {
> +        ret = -1;
> +    }
> +
> +    return ret;
> +}
> +
> +static uint32_t client_peer(const void *fdt, uint32_t phandle)
> +{
> +    int ret;
> +
> +    if (phandle == 0) {
> +        ret = fdt_path_offset(fdt, "/");
> +    } else {
> +        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> +    }
> +
> +    if (ret < 0) {
> +        ret = 0;
> +    } else {
> +        ret = fdt_get_phandle(fdt, ret);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint32_t client_child(const void *fdt, uint32_t phandle)
> +{
> +    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> +
> +    if (ret < 0) {
> +        ret = 0;
> +    } else {
> +        ret = fdt_get_phandle(fdt, ret);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint32_t client_parent(const void *fdt, uint32_t phandle)
> +{
> +    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> +
> +    if (ret < 0) {
> +        ret = 0;
> +    } else {
> +        ret = fdt_get_phandle(fdt, ret);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint32_t client_open(uint32_t phandle)
> +{
> +    uint32_t ret = (phandle & CLI_PH_MASK) | CLI_INST_PREFIX;
> +
> +    return ret;
> +}
> +
> +static uint32_t client_instance_to_path(SpaprMachineState *sm,
> +                                        uint32_t instance, uint32_t buf,
> +                                        uint32_t len)
> +{
> +    uint32_t ph = instance & CLI_PH_MASK;
> +    char *stdout_path = spapr_vio_stdout_path(sm->vio_bus);
> +    int stdout_ph = fdt_path_offset(sm->fdt_blob, stdout_path);
> +
> +    if (ph == stdout_ph) {
> +        len = MIN(len, strlen(stdout_path));
> +        cpu_physical_memory_write(buf, stdout_path, len);
> +    } else  {
> +        len = 0;
> +    }
> +    g_free(stdout_path);
> +
> +    return len;
> +}
> +
> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
> +                                       uint32_t buf, uint32_t len)
> +{
> +    char tmp[256];

Fixed sized buffers are icky.  You could either dynamically allocate
this based on the size the client gives, or you could use
memory_region_get_ram_ptr() to read the data from the tree directly
into guest memory.

> +    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
> +                          sizeof(tmp))) {
> +        tmp[sizeof(tmp) - 1] = 0;
> +        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
> +    }
> +    return len;
> +}
> +
> +static uint32_t client_write(uint32_t instance, uint32_t buf, uint32_t len)
> +{
> +    char tmp[len + 1]; /* TODO: do a loop as len might be enourmous*/
> +
> +    readstr(buf, tmp);
> +    tmp[len] = 0;
> +    printf("%s", tmp);

I'm guessing this is what's used for console output via OF.  This is
enough for a debugging PoC, but for real we'll need to have this
actually go via the appropriate qemu chardev.

> +
> +    return len;
> +}
> +
> +static bool client_claim_avail(SpaprMachineState *sm, uint64_t virt,
> +                               uint64_t size)
> +{
> +    int i;
> +    spapr_of_claimed *c;
> +
> +    for (i = 0; i < sm->claimed->len; ++i) {
> +        c = &g_array_index(sm->claimed, spapr_of_claimed, i);
> +        if ((c->start <= virt && virt < c->start + c->size) ||
> +            (virt <= c->start && c->start < virt + size)) {
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
> +uint64_t spapr_do_client_claim(SpaprMachineState *sm,

SpaprMachineState * are usually called 'spapr', not 'sm'.

> uint64_t virt,
> +                               uint64_t size, uint64_t align)
> +{
> +    uint32_t ret;
> +    spapr_of_claimed newclaim;
> +
> +    if (align == 0) {
> +        if (!client_claim_avail(sm, virt, size)) {
> +            return -1;
> +        }
> +        ret = virt;
> +    } else {
> +        align = pow2ceil(align);
> +        sm->claimed_base = (sm->claimed_base + align - 1) & ~(align - 1);
> +        while (1) {
> +            if (sm->claimed_base >= sm->rma_size) {
> +                perror("Out of memory");

This should be an error_report() rather than a perror().

> +                return -1;
> +            }
> +            if (client_claim_avail(sm, sm->claimed_base, size)) {
> +                break;
> +            }
> +            sm->claimed_base += size;
> +        }
> +        ret = sm->claimed_base;
> +    }
> +
> +    sm->claimed_base = MAX(sm->claimed_base, ret) + size;

I understand the logic with claimed_base.  It looks like you're
implementing a bump allocator, but isn't the the client passing in the
address to claim anyway, so why do you need an allocator?

> +    newclaim.start = virt;
> +    newclaim.size = size;
> +    g_array_append_val(sm->claimed, newclaim);
> +    trace_spapr_client_claim(virt, size, align, ret);
> +
> +    return ret;
> +}
> +
> +static uint32_t client_claim(SpaprMachineState *sm, uint32_t virt,
> +                             uint32_t size, uint32_t align)
> +{
> +    return spapr_do_client_claim(sm, virt, size, align);

Um.. why this wrapper?

> +}
> +
> +static uint32_t client_nextprop(const void *fdt, uint32_t phandle,
> +                                uint32_t prevaddr, uint32_t nameaddr)
> +{
> +    int namelen = 0;
> +    int offset = fdt_node_offset_by_phandle(fdt, phandle);
> +    char prev[256];

Ew, fixed sized buffer.

> +    const struct fdt_property *prop;
> +    const char *tmp;
> +
> +    readstr(prevaddr, prev);
> +    for (offset = fdt_first_property_offset(fdt, offset);
> +         (offset >= 0);
> +         (offset = fdt_next_property_offset(fdt, offset))) {
> +
> +        prop = fdt_get_property_by_offset(fdt, offset, &namelen);

fdt_getprop_by_offset() also returns the property's name without
having to dick around with fdt_get_string() manually.

> +        if (!prop) {
> +            return 0;
> +        }
> +
> +        tmp = fdt_get_string(fdt, fdt32_ld(&prop->nameoff), &namelen);
> +        if (prev[0] == 0 ||

== '\0' please.

> +            strncmp(prev, tmp, MAX(namelen, strlen(prev))) == 0) {

This will break very badly if the client passed a name larger than
sizeof(prev), since readstr() won't \0 terminate in that case IIUC.

> +            if (prev[0]) {
> +                offset = fdt_next_property_offset(fdt, offset);
> +                if (offset < 0) {
> +                    return 0;
> +                }
> +            }
> +            prop = fdt_get_property_by_offset(fdt, offset, &namelen);
> +            if (!prop) {
> +                return 0;
> +            }
> +
> +            tmp = fdt_get_string(fdt, fdt32_ld(&prop->nameoff), &namelen);
> +            cpu_physical_memory_write(nameaddr, tmp, namelen + 1);
> +            return 1;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static uint32_t client_call_method(SpaprMachineState *sm, uint32_t methodaddr,
> +                                   uint32_t param1, uint32_t param2,
> +                                   uint32_t *ret2)
> +{
> +    uint32_t ret = 0;
> +    char method[256];

More fixed sized buffers.

> +    readstr(methodaddr, method);
> +    if (strncmp(method, "ibm,client-architecture-support", 31) == 0) {
> +
> +#define FDT_MAX_SIZE            0x100000
> +        ret = do_client_architecture_support(POWERPC_CPU(first_cpu), sm, param2,
> +                                             0, FDT_MAX_SIZE);
> +        *ret2 = 0;
> +    } else if (strncmp(method, "instantiate-rtas", 16) == 0) {
> +        uint32_t rtasbase = param2;
> +
> +        spapr_instantiate_rtas(sm, rtasbase);
> +        *ret2 = rtasbase;

In both these cases you ignore param1.  I'm guessing it has the handle
of the node on which you're calling the method?  In which case you
really should validate it.

> +    } else {
> +        trace_spapr_client_error_unknown_method(method);
> +        return -1;
> +    }
> +
> +    trace_spapr_client_method(method, param1, param2, ret);
> +
> +    return ret;
> +}
> +
> +static void client_quiesce(SpaprMachineState *sm)
> +{

Possibly want a flush of stdout here.

> +}
> +
> +int spapr_h_client(SpaprMachineState *sm, target_ulong client_args)
> +{
> +    struct prom_args args = { 0 };
> +    char service[64];

More fixed sized buffers.

> +    unsigned nargs, nret;
> +    int i, servicelen;
> +
> +    cpu_physical_memory_read(client_args, &args, sizeof(args));
> +    nargs = be32_to_cpu(args.nargs);
> +    nret = be32_to_cpu(args.nret);
> +    readstr(be32_to_cpu(args.service), service);
> +    servicelen = strlen(service);
> +
> +#define cmpservice(s, a, r) \
> +        _cmpservice(service, servicelen, nargs, nret, (s), sizeof(s), (a), (r))
> +
> +    if (cmpservice("finddevice", 1, 1)) {

Hrm.  Rather than this giant chain of if(cmp) checks, I'd prefer to
have a table of services as a data structure, which the code would
scan through.  That does mean you'd have to give the functions
implementing each service a common signature, so they'd need to parse
out their own parameters, but I still think it's the better option.
It also makes it closer to how we implement RTAS call dispatch.

> +        args.args[nargs] = client_finddevice(sm->fdt_blob,
> +                                             be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("getprop", 4, 1)) {
> +        args.args[nargs] = client_getprop(sm->fdt_blob,
> +                                          be32_to_cpu(args.args[0]),
> +                                          be32_to_cpu(args.args[1]),
> +                                          be32_to_cpu(args.args[2]),
> +                                          be32_to_cpu(args.args[3]));
> +    } else if (cmpservice("getproplen", 2, 1)) {
> +        args.args[nargs] = client_getproplen(sm->fdt_blob,
> +                                             be32_to_cpu(args.args[0]),
> +                                             be32_to_cpu(args.args[1]));
> +    } else if (cmpservice("setprop", 4, 1)) {
> +        args.args[nargs] = client_setprop(sm,
> +                                          be32_to_cpu(args.args[0]),
> +                                          be32_to_cpu(args.args[1]),
> +                                          be32_to_cpu(args.args[2]),
> +                                          be32_to_cpu(args.args[3]));
> +    } else if (cmpservice("instance-to-path", 3, 1)) {
> +        args.args[nargs] = client_instance_to_path(sm,
> +                                                   be32_to_cpu(args.args[0]),
> +                                                   be32_to_cpu(args.args[1]),
> +                                                   be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("package-to-path", 3, 1)) {
> +        args.args[nargs] = client_package_to_path(sm->fdt_blob,
> +                                                  be32_to_cpu(args.args[0]),
> +                                                  be32_to_cpu(args.args[1]),
> +                                                  be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("write", 3, 1)) {
> +        args.args[nargs] = client_write(be32_to_cpu(args.args[0]),
> +                                        be32_to_cpu(args.args[1]),
> +                                        be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("peer", 1, 1)) {
> +        args.args[nargs] = client_peer(sm->fdt_blob,
> +                                       be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("child", 1, 1)) {
> +        args.args[nargs] = client_child(sm->fdt_blob,
> +                                        be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("parent", 1, 1)) {
> +        args.args[nargs] = client_parent(sm->fdt_blob,
> +                                         be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("open", 1, 1)) {
> +        args.args[nargs] = client_open(be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("call-method", 3, 2)) {
> +        uint32_t ret2 = 0;
> +
> +        args.args[nargs] = client_call_method(sm,
> +                                              be32_to_cpu(args.args[0]),
> +                                              be32_to_cpu(args.args[1]),
> +                                              be32_to_cpu(args.args[2]),
> +                                              &ret2);
> +        args.args[nargs + 1] = ret2;
> +    } else if (cmpservice("claim", 3, 1)) {
> +        args.args[nargs] = client_claim(sm,
> +                                        be32_to_cpu(args.args[0]),
> +                                        be32_to_cpu(args.args[1]),
> +                                        be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("nextprop", 3, 1)) {
> +        args.args[nargs] = client_nextprop(sm->fdt_blob,
> +                                           be32_to_cpu(args.args[0]),
> +                                           be32_to_cpu(args.args[1]),
> +                                           be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("quiesce", 0, 0)) {
> +        client_quiesce(sm);
> +    } else if (cmpservice("exit", 0, 0)) {
> +        error_report("Stopped as the VM requested \"exit\"");
> +        vm_stop(RUN_STATE_PAUSED);
> +    } else {
> +        trace_spapr_client_error_unknown_service(service, nargs, nret);
> +        args.args[nargs] = -1;
> +    }
> +
> +    for (i = 0; i < nret; ++i) {
> +        args.args[nargs + i] = be32_to_cpu(args.args[nargs + i]);
> +    }
> +    cpu_physical_memory_write(client_args, &args, sizeof(args));
> +
> +    return H_SUCCESS;
> +}
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index f1799b1b707d..76f93ec37959 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1660,15 +1660,12 @@ static bool spapr_hotplugged_dev_before_cas(void)
>      return false;
>  }
>  
> -static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> -                                                  SpaprMachineState *spapr,
> -                                                  target_ulong opcode,
> -                                                  target_ulong *args)
> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> +                                            SpaprMachineState *spapr,
> +                                            target_ulong addr,
> +                                            target_ulong fdt_buf,
> +                                            target_ulong fdt_bufsize)
>  {
> -    /* Working address in data buffer */
> -    target_ulong addr = ppc64_phys_to_real(args[0]);
> -    target_ulong fdt_buf = args[1];
> -    target_ulong fdt_bufsize = args[2];
>      target_ulong ov_table;
>      uint32_t cas_pvr;
>      SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
> @@ -1833,13 +1830,15 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>  
>          fdt_bufsize -= sizeof(hdr);
>  
> -        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
> +        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
>          _FDT((fdt_pack(fdt)));
>  
> -        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
> -        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
> -                                  fdt_totalsize(fdt));
> -        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
> +        if (spapr->bios_enabled) {
> +            cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
> +            cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
> +                                      fdt_totalsize(fdt));
> +            trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
> +        }
>  
>          g_free(spapr->fdt_blob);
>          spapr->fdt_size = fdt_totalsize(fdt);
> @@ -1854,6 +1853,20 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>      return H_SUCCESS;
>  }
>  
> +static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> +                                                  SpaprMachineState *spapr,
> +                                                  target_ulong opcode,
> +                                                  target_ulong *args)
> +{
> +    /* Working address in data buffer */
> +    target_ulong addr = ppc64_phys_to_real(args[0]);
> +    target_ulong fdt_buf = args[1];
> +    target_ulong fdt_bufsize = args[2];
> +
> +    return do_client_architecture_support(cpu, spapr, addr, fdt_buf,
> +                                          fdt_bufsize);
> +}
> +
>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>                                                SpaprMachineState *spapr,
>                                                target_ulong opcode,
> @@ -1998,6 +2011,14 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                             target_ulong opcode, target_ulong *args)
> +{
> +    target_ulong client_args = ppc64_phys_to_real(args[0]);
> +
> +    return spapr_h_client(spapr, client_args);
> +}
> +
>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
> @@ -2121,6 +2142,8 @@ static void hypercall_register_types(void)
>  
>      spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
>  
> +    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
> +
>      /* Virtual Processor Home Node */
>      spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
>                               h_home_node_associativity);
> diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
> index 9ea620f23c85..92bce3e44693 100644
> --- a/hw/ppc/trace-events
> +++ b/hw/ppc/trace-events
> @@ -21,6 +21,15 @@ spapr_update_dt(unsigned cb) "New blob %u bytes"
>  spapr_update_dt_failed_size(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
>  spapr_update_dt_failed_check(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
>  
> +# spapr_client.c
> +spapr_client_error_param(const char *method, int nargscheck, int nretcheck, int nargs, int nret) "%s takes/returns %d/%d, not %d/%d"
> +spapr_client_error_unknown_service(const char *service, int nargs, int nret) "%s args=%d rets=%d"
> +spapr_client_error_unknown_method(const char *method) "%s"
> +spapr_client_claim(uint32_t virt, uint32_t size, uint32_t align, uint32_t ret) "virt=0x%x size=0x%x align=0x%x => 0x%x"
> +spapr_client_method(const char *method, uint32_t param1, uint32_t param2, uint32_t ret) "%s(0x%x, 0x%x) => 0x%x"
> +spapr_client_getprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
> +spapr_client_setprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
> +
>  # spapr_hcall_tpm.c
>  spapr_h_tpm_comm(const char *device_path, uint64_t operation) "tpm_device_path=%s operation=0x%"PRIu64
>  spapr_tpm_execute(uint64_t data_in, uint64_t data_in_sz, uint64_t data_out, uint64_t data_out_sz) "data_in=0x%"PRIx64", data_in_sz=%"PRIu64", data_out=0x%"PRIx64", data_out_sz=%"PRIu64

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06  4:19 ` David Gibson
@ 2020-01-06  6:28   ` Alexey Kardashevskiy
  2020-01-06  8:50     ` David Gibson
  2020-01-06 13:39   ` Alexey Kardashevskiy
  2020-01-07  4:44   ` Alexey Kardashevskiy
  2 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-06  6:28 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 06/01/2020 15:19, David Gibson wrote:
> On Mon, Jan 06, 2020 at 10:42:42AM +1100, Alexey Kardashevskiy wrote:
>> The Petitboot bootloader is way more advanced than SLOF is ever going to
>> be as Petitboot comes with the full-featured Linux kernel with all
>> the drivers, and initramdisk with quite user friendly interface.
>> The problem with ditching SLOF is that an unmodified pseries kernel can
>> either start via:
>> 1. kexec, this requires presence of RTAS and skips
>> ibm,client-architecture-support entirely;
>> 2. normal boot, this heavily relies on the OF1275 client interface to
>> fetch the device tree and do early setup (claim memory).
>>
>> This adds a new bios-less mode to the pseries machine: "bios=on|off".
>> When enabled, QEMU does not load SLOF and jumps to the kernel from
>> "-kernel".
>>
>> The client interface is implemented exactly as RTAS - a 20 bytes blob,
>> right next after the RTAS blob. The entry point is passed to the kernel
>> via GPR5.
>>
>> This implements a handful of client interface methods just to get going.
>> In particular, this implements the device tree fetching,
>> ibm,client-architecture-support and instantiate-rtas.
>>
>> This implements changing FDT properties only for "linux,rtas-base" and
>> "linux,rtas-entry", just to get early boot going.
>>
>> This assigns "phandles" to device tree nodes as there is no more SLOF
>> and OF nodes addresses of which served as phandle values.
>> This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
>> phandles are regenerated at every FDT rebuild.
>>
>> When bios=off, this adds "/chosen" every time QEMU builds a tree.
>>
>> This implements "claim" which the client (Linux) uses for memory
>> allocation; this is also  used by QEMU for claiming kernel/initrd images,
>> client interface entry point, RTAS and the initial stack.
>>
>> While at this, add a "kernel-addr" machine parameter to allow moving
>> the kernel in memory. This is useful for debugging if the kernel is
>> loaded at @0, although not necessary.
>>
>> This does not implement instances properly but in order to boot a VM,
>> we only really need a stdout instance and the "/" instance for
>> "call-method", we fake these.
> 
> As we've discussed, I really like the idea of this.  It think the
> basic approach looks good too.
> 
> As you probably realize, there are quite a lot of things to be
> polished though.  Comments below.
> 
>>
>> The test command line:
>>
>> qemu-system-ppc64 \
>> -nodefaults \
>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>> -mon id=MON0,chardev=STDIO0,mode=readline \
>> -nographic \
>> -vga none \
>> -kernel vmldbg \
>> -machine pseries,bios=off \
>> -m 4G \
>> -enable-kvm \
>> -initrd pb/rootfs.cpio.xz \
>> img/u1804-64le.qcow2 \
>> -snapshot \
>> -smp 8,threads=8 \
>> -L /home/aik/t/qemu-ppc64-bios/ \
>> -trace events=qemu_trace_events \
>> -d guest_errors \
>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh37874 \
>> -mon chardev=SOCKET0,mode=control
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v2:
>> * fixed claim()
>> * added "setprop"
>> * cleaner client interface and RTAS blobs management
>> * boots to petitboot and further to the target system
>> * more trace points
>> ---
>>  hw/ppc/Makefile.objs   |   1 +
>>  include/hw/loader.h    |   1 +
>>  include/hw/ppc/spapr.h |  25 ++-
>>  hw/ppc/spapr.c         | 231 ++++++++++++++++++--
>>  hw/ppc/spapr_client.c  | 464 +++++++++++++++++++++++++++++++++++++++++
>>  hw/ppc/spapr_hcall.c   |  49 +++--
>>  hw/ppc/trace-events    |   9 +
>>  7 files changed, 743 insertions(+), 37 deletions(-)
>>  create mode 100644 hw/ppc/spapr_client.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index 101e9fc59185..ce31c0acd2fb 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>>  obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>>  obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
>>  obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
>> +obj-$(CONFIG_PSERIES) += spapr_client.o
> 
> This applies in a bunch of places here.  Just calling things "client"
> is clear enough if you're already thinking about Open Firmware.  But
> this appears in a bunch of places where you might come across it
> without that context, in which case it could be kind of confusing.  So
> I'd suggest using "of_client" or "of_client_interface" in most places
> you're using "client".
> 
>>  obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>>  # IBM PowerNV
>>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
>> diff --git a/include/hw/loader.h b/include/hw/loader.h
>> index 48a96cd55904..a2f58077a47e 100644
>> --- a/include/hw/loader.h
>> +++ b/include/hw/loader.h
>> @@ -262,6 +262,7 @@ MemoryRegion *rom_add_blob(const char *name, const void *blob, size_t len,
>>  int rom_add_elf_program(const char *name, GMappedFile *mapped_file, void *data,
>>                          size_t datasize, size_t romsize, hwaddr addr,
>>                          AddressSpace *as);
>> +bool rom_intersect(uint64_t start, uint64_t size);
> 
> I don't see this called, so I'm guessing it's a stray chunk leftover.
> 
>>  int rom_check_and_register_reset(void);
>>  void rom_set_fw(FWCfgState *f);
>>  void rom_set_order_override(int order);
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 61f005c6f686..876879d12029 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -105,6 +105,11 @@ struct SpaprCapabilities {
>>      uint8_t caps[SPAPR_CAP_NUM];
>>  };
>>  
>> +typedef struct {
>> +    uint64_t start;
>> +    uint64_t size;
>> +} spapr_of_claimed;
> 
> Please use the qemu convention of StudlyCaps for types.
> 
>>  /**
>>   * SpaprMachineClass:
>>   */
>> @@ -158,8 +163,13 @@ struct SpaprMachineState {
>>      uint32_t fdt_size;
>>      uint32_t fdt_initial_size;
>>      void *fdt_blob;
>> +    uint32_t rtas_base;
>>      long kernel_size;
>>      bool kernel_le;
>> +    uint64_t kernel_addr;
> 
> This would intersect with your other patch to allow changing the
> kernel load address, yes?


This includes kernel_addr, it is a single patch now. More about this below.


> 
>> +    bool bios_enabled;
>> +    GArray *claimed; /* array of spapr_of_claimed */
>> +    uint64_t claimed_base;
> 
> I'd suggest moving all the client interface related fields to a
> visually separated chunk in the structure definition.
> 
>>      uint32_t initrd_base;
>>      long initrd_size;
>>      uint64_t rtc_offset; /* Now used only during incoming migration */
>> @@ -510,7 +520,8 @@ struct SpaprMachineState {
>>  /* Client Architecture support */
>>  #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
>>  #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
>> -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
>> +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
>> +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
>>  
>>  /*
>>   * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
>> @@ -538,6 +549,12 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
>>  target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
>>                               target_ulong *args);
>>  
>> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>> +                                            SpaprMachineState *spapr,
>> +                                            target_ulong addr,
>> +                                            target_ulong fdt_buf,
>> +                                            target_ulong fdt_bufsize);
>> +
>>  /* Virtual Processor Area structure constants */
>>  #define VPA_MIN_SIZE           640
>>  #define VPA_SIZE_OFFSET        0x4
>> @@ -769,6 +786,9 @@ struct SpaprEventLogEntry {
>>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
>>  void spapr_events_init(SpaprMachineState *sm);
>>  void spapr_dt_events(SpaprMachineState *sm, void *fdt);
>> +uint64_t spapr_do_client_claim(SpaprMachineState *sm, uint64_t virt,
>> +                               uint64_t size, uint64_t align);
>> +int spapr_h_client(SpaprMachineState *sm, target_ulong client_args);
>>  void close_htab_fd(SpaprMachineState *spapr);
>>  void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
>>  void spapr_free_hpt(SpaprMachineState *spapr);
>> @@ -891,4 +911,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
>>  #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
>>  
>>  void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
>> +
>> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
>> +
>>  #endif /* HW_SPAPR_H */
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index e62c89b3dd40..1c97534a0aea 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -896,6 +896,51 @@ out:
>>      return ret;
>>  }
>>  
>> +/*
>> + * Below is a compiled version of RTAS blob and OF client interface entry point.
>> + *
>> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
>> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
>> + *
>> + *   .globl  _start
>> + *   _start:
>> + *           mr      4,3
>> + *           lis     3,KVMPPC_H_RTAS@h
>> + *           ori     3,3,KVMPPC_H_RTAS@l
>> + *           sc      1
>> + *           blr
>> + *           mr      4,3
>> + *           lis     3,KVMPPC_H_CLIENT@h
>> + *           ori     3,3,KVMPPC_H_CLIENT@l
>> + *           sc      1
>> + *           blr
>> + */
>> +static struct {
>> +    uint8_t rtas[20], client[20];
>> +} QEMU_PACKED rtas_client_blob = {
>> +    .rtas = {
>> +        0x7c, 0x64, 0x1b, 0x78,
>> +        0x3c, 0x60, 0x00, 0x00,
>> +        0x60, 0x63, 0xf0, 0x00,
>> +        0x44, 0x00, 0x00, 0x22,
>> +        0x4e, 0x80, 0x00, 0x20
>> +    },
>> +    .client = {
>> +        0x7c, 0x64, 0x1b, 0x78,
>> +        0x3c, 0x60, 0x00, 0x00,
>> +        0x60, 0x63, 0xf0, 0x05,
>> +        0x44, 0x00, 0x00, 0x22,
>> +        0x4e, 0x80, 0x00, 0x20
>> +    }
>> +};
> 
> I'd really prefer to read this in from a file (or files) which we
> assemble (as with the existing spapr-rtas.img), rather than having a
> magic array in qemu.

Two considerations here:
1. This blob is not going to change (at least often)
2. A new file which needs to be packaged/signed/copied.



>> +
>> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
>> +{
>> +    spapr->rtas_base = base;
>> +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
>> +                              sizeof(rtas_client_blob.rtas));
>> +}
>> +
>>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>  {
>>      MachineState *ms = MACHINE(spapr);
>> @@ -980,6 +1025,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>      _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
>>                       lrdr_capacity, sizeof(lrdr_capacity)));
>>  
>> +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-base", spapr->rtas_base));
>> +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-entry", spapr->rtas_base));
> 
> IIUC, we shouldn't need to add these in bios=off mode, yes?  Instead
> the client must instantiate-rtas before attempting to use it.


Well. Here is what happens:

1. spapr machine reset creates FDT, lets say without rtas-entry
2. linux calls ibm,client-interface-support
3. the QEMU handler rebuilds the FDT
4. linux calls "instantiate-rtas"
5. linux stores "linux,rtas-base" and "linux,rtas-entry" in the FDT
using "setprop" client call
6. linux fetches the tree by traversing it and packs it into FDT blob itself
7. linux calls quiesce
8. linux uses "linux,rtas-entry" from FDT to call RTAS.

Because in 5) I only allow in-place property update and only for
"linux,rtas-base" and "linux,rtas-entry", these do not make to the tree
and there is no "linux,rtas-entry" in the FDT.

There are choices:

1. I can extend "setprop" (tricky as FDT is flattened by then?)
2. simply store rtas base in spapr->rtas_base and do complete FDT rebuild
3. (currently) I basically have 2 placeholders for these
"linux,rtas-base" and "linux,rtas-entry".


Or we can avoid flattening FDT at all for bios=off (as nobody wants
flattened device tree, my "fetch-fdt" never made it to upstream Linux),
then 1) is easy (if it is a problem to begin with).



>> +    _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
>> +                          sizeof(rtas_client_blob.rtas)));
>> +    _FDT(fdt_add_mem_rsv(fdt, spapr->rtas_base,
>> +                         sizeof(rtas_client_blob.rtas)));
> 
> Nor should we need the mem_rsv with bios=off.

True. If not the "setprop" limitation from above, I would only need
"rtas-size" here.


> 
>> +
>>      spapr_dt_rtas_tokens(fdt, rtas);
>>  }
>>  
>> @@ -1057,7 +1109,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>>      }
>>  
>>      if (spapr->kernel_size) {
>> -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
>> +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
>>                                cpu_to_be64(spapr->kernel_size) };
>>  
>>          _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
>> @@ -1104,6 +1156,9 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>>          _FDT(fdt_setprop_cell(fdt, chosen, "linux,pci-probe-only", 0));
>>      }
>>  
>> +    if (!spapr->bios_enabled) {
>> +        _FDT(fdt_setprop_cell(fdt, chosen, "cpu", 0));
> 
> What is this for?

Looks like a leftover, I did this for prom_find_boot_cpu() but after
double checking and trying it looks like it can proceed without this chunk.


> 
>> +    }
>>      spapr_dt_ov5_platform_support(spapr, fdt, chosen);
>>  
>>      g_free(stdout_path);
>> @@ -1133,6 +1188,28 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
>>      }
>>  }
>>  
>> +static void spapr_fdt_add_phandles(void *fdt, int parent, uint32_t *phandle)
>> +{
>> +    int proplen = 0, node;
>> +    const void *prop;
>> +    char tmp[256] = "BAAD";
>> +
>> +    fdt_get_path(fdt, parent, tmp, sizeof(tmp));
> 
> You don't actually appear to use tmp.  Is this a debugging leftover?

Yes it is.


> 
>> +    prop = fdt_getprop_namelen(fdt, parent, "phandle", 7, &proplen);
>> +    if (!prop) {
>> +        _FDT(fdt_setprop_cell(fdt, parent, "phandle", *phandle));
>> +        while (1) {
>> +            ++*phandle;
>> +            if (fdt_node_offset_by_phandle(fdt, *phandle) < 0) {
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    fdt_for_each_subnode(node, fdt, parent) {
>> +        spapr_fdt_add_phandles(fdt, node, phandle);
>> +    }
> 
> Oof, this is pretty inefficient, since fdt_node_offset_by_phandle()
> will have to scan the whole tree, and you do it on every node.


Inefficient indeed but for a VM with ~30 devices it is not that bad :)


> Obviously, a "live" tree structure rather than fdt would be better
> suited for this.  I'd like to move to that at some point, but it will
> be a while before we're ready.

I could have a cache/hashmap of phandles...


> I guess this is fine for a PoC, but I think we can do a bit better
> without too much extra complexity:
>   * Rather than scanning for an existing node with the handle on each
>     iteration, you can use fdt_find_max_phandle() once beforehand,
>     and start your counter from there, being assured that it won't hit
>     any existing phandles

I'd like to avoid that as NPU/GPU phandles are quite huge and we will
have huge gaps between phandles numbers which is not so much of a big
deal but annoying to debug.


>   * The phandles are arbitrary, so you don't actually care about the
>     tree structure here.  So instead of recursively descending the
>     tree, you can just alter each node in the order they appear in the
>     blob.  fdt_next_node() will let you do that, simply by ignoring
>     the 'depth' parameter.  [In fact the structure ensures that will
>     be a depth first traversal of the tree, just like you have here,
>     not that it matters].

... but this one is better.

Unlike many others, fdt_next_node() does not have a comment in
dtc/libfdt/libfdt.h so I was not sure about what *depth is for and I
basically ignored fdt_next_node().


>   * You should check for overflows of your phandle counter, though
>     it's fine to treat that as a fatal error.

Ok.


>> +}
>> +
>>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>  {
>>      MachineState *machine = MACHINE(spapr);
>> @@ -1245,7 +1322,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>      /* Build memory reserve map */
>>      if (reset) {
>>          if (spapr->kernel_size) {
>> -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
>> +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
>> +                                  spapr->kernel_size)));
>>          }
>>          if (spapr->initrd_size) {
>>              _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
>> @@ -1268,12 +1346,19 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>          }
>>      }
>>  
>> +    if (!spapr->bios_enabled) {
>> +        uint32_t phandle = 1;
>> +
>> +        spapr_fdt_add_phandles(fdt, fdt_path_offset(fdt, "/"), &phandle);
>> +    }
>> +
>>      return fdt;
>>  }
>>  
>>  static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
>>  {
>> -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
>> +    SpaprMachineState *spapr = opaque;
>> +    return (addr & 0x0fffffff) + spapr->kernel_addr;
>>  }
>>  
>>  static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
>> @@ -1660,6 +1745,60 @@ static void spapr_machine_reset(MachineState *machine)
>>       */
>>      fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
>>  
>> +    /* Set up the entry state */
>> +    if (!spapr->bios_enabled) {
>> +        if (spapr->claimed) {
>> +            g_array_unref(spapr->claimed);
>> +        }
>> +        spapr->claimed = g_array_new(false, false, sizeof(spapr_of_claimed));
>> +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
>> +
>> +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr, fdt_addr);
>> +
>> +        if (spapr_do_client_claim(spapr, spapr->kernel_addr,
>> +                                  spapr->kernel_size, 0) == -1) {
>> +            error_report("Memory for kernel is in use");
>> +            exit(1);
>> +        }
>> +        if (spapr_do_client_claim(spapr, spapr->initrd_base,
>> +                                  spapr->initrd_size, 0) == -1) {
>> +            error_report("Memory for initramdisk is in use");
>> +            exit(1);
>> +        }
>> +        first_ppc_cpu->env.gpr[1] = spapr_do_client_claim(spapr, 0, 0x40000,
>> +                                                          0x10000);
>> +        if (first_ppc_cpu->env.gpr[1] == -1) {
>> +            error_report("Memory for stack is in use");
>> +            exit(1);
>> +        }
>> +
>> +        first_ppc_cpu->env.gpr[5] =
>> +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
>> +                                  sizeof(rtas_client_blob.client));
>> +        if (first_ppc_cpu->env.gpr[5] == -1) {
>> +            error_report("Memory for OF client is in use");
>> +            exit(1);
>> +        }
>> +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
>> +                                  rtas_client_blob.client,
>> +                                  sizeof(rtas_client_blob.client));
>> +
>> +        /* Allocate space for RTAS here so spapr_build_fdt() below picks it */
>> +        spapr->rtas_base =
>> +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.rtas),
>> +                                  0x100);
>> +        if (spapr->rtas_base == -1) {
>> +            error_report("Memory for RTAS is in use");
>> +            exit(1);
>> +        }
>> +        cpu_physical_memory_write(spapr->rtas_base,
>> +                                  rtas_client_blob.rtas,
>> +                                  sizeof(rtas_client_blob.rtas));
> 
> The above is complex enough it might be worth having bios and nobios
> variants of spapr_cpu_set_entry_state().


Probably yes, when it settles. I'd rather prefer having kernel/initrd
claiming code where we load them but these are Roms and when we "load"
them, they are not loaded, they are loaded from a machine reset handler but:

may be we could add a hook to Roms to be called every time an image is
actually written to the guest memory and

this is when I would "claim" these chunks and

this would make the chunk above shorter and we won't need a separate
spapr_cpu_set_entry_state().


>> +    } else {
>> +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
>> +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
>> +    }
>> +
>>      fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
>>  
>>      rc = fdt_pack(fdt);
>> @@ -1675,10 +1814,14 @@ static void spapr_machine_reset(MachineState *machine)
>>      spapr->fdt_initial_size = spapr->fdt_size;
>>      spapr->fdt_blob = fdt;
>>  
>> -    /* Set up the entry state */
>> -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
>> -    first_ppc_cpu->env.gpr[5] = 0;
>> -
>> +    if (!spapr->bios_enabled) {
>> +        /*
>> +         * Claim FDT space so initramdisk/zImage are not unpacked over it.
>> +         * In practice Linux claim additional memory for FDT and fetches it
>> +         * using client interface's "getprop" call.
>> +         */
> 
> IIUC, it shouldn't actually be necessary to put the fdt into guest
> memory in nobios mode - it will be retrieved piecemeal via the client
> interface instead, won't it?

Huh. I was under the impression that we pass the FDT to a VM via GPR3
and it is safer to avoid "claim" from claiming this memory (even though
Linux is not using it).

Now I realize that this GPR3 is QEMU<->SLOF convention (or some OFW
binding - what is it exactly?) and Linux rather expects initramdisk
start/size in GPR3/4 and this did not hit me just because GPR4 is always
0 so Linux falls back to the device tree for initramdisk location.




> 
>> +        spapr_do_client_claim(spapr, fdt_addr, fdt_totalsize(fdt), 0);
>> +    }
>>      spapr->cas_reboot = false;
>>  }
>>  
>> @@ -2897,12 +3040,12 @@ static void spapr_machine_init(MachineState *machine)
>>          uint64_t lowaddr = 0;
>>  
>>          spapr->kernel_size = load_elf(kernel_filename, NULL,
>> -                                      translate_kernel_address, NULL,
>> +                                      translate_kernel_address, spapr,
>>                                        NULL, &lowaddr, NULL, 1,
>>                                        PPC_ELF_MACHINE, 0, 0);
>>          if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
>>              spapr->kernel_size = load_elf(kernel_filename, NULL,
>> -                                          translate_kernel_address, NULL, NULL,
>> +                                          translate_kernel_address, spapr, NULL,
>>                                            &lowaddr, NULL, 0, PPC_ELF_MACHINE,
>>                                            0, 0);
>>              spapr->kernel_le = spapr->kernel_size > 0;
>> @@ -2918,7 +3061,7 @@ static void spapr_machine_init(MachineState *machine)
>>              /* Try to locate the initrd in the gap between the kernel
>>               * and the firmware. Add a bit of space just in case
>>               */
>> -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
>> +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
>>                                    + 0x1ffff) & ~0xffff;
>>              spapr->initrd_size = load_image_targphys(initrd_filename,
>>                                                       spapr->initrd_base,
>> @@ -2932,20 +3075,22 @@ static void spapr_machine_init(MachineState *machine)
>>          }
>>      }
>>  
>> -    if (bios_name == NULL) {
>> -        bios_name = FW_FILE_NAME;
>> +    if (spapr->bios_enabled) {
>> +        if (bios_name == NULL) {
>> +            bios_name = FW_FILE_NAME;
>> +        }
>> +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
>> +        if (!filename) {
>> +            error_report("Could not find LPAR firmware '%s'", bios_name);
>> +            exit(1);
>> +        }
>> +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
>> +        if (fw_size <= 0) {
>> +            error_report("Could not load LPAR firmware '%s'", filename);
>> +            exit(1);
>> +        }
>> +        g_free(filename);
>>      }
>> -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
>> -    if (!filename) {
>> -        error_report("Could not find LPAR firmware '%s'", bios_name);
>> -        exit(1);
>> -    }
>> -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
>> -    if (fw_size <= 0) {
>> -        error_report("Could not load LPAR firmware '%s'", filename);
>> -        exit(1);
>> -    }
>> -    g_free(filename);
>>  
>>      /* FIXME: Should register things through the MachineState's qdev
>>       * interface, this is a legacy from the sPAPREnvironment structure
>> @@ -3162,6 +3307,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
>>      visit_type_uint32(v, name, (uint32_t *)opaque, errp);
>>  }
>>  
>> +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
>> +                                  void *opaque, Error **errp)
>> +{
>> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
>> +}
>> +
>> +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
>> +                                  void *opaque, Error **errp)
>> +{
>> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
>> +}
> 
> Might be clearer to split the adjustable kernel addr stuff out into a
> separate patch - in fact, didn't you have such a patch before?


I did but it was not useful on its own and the whole "bios=off" feature
does not need to be "series" at this stage. Also having it separate
(before or after "kill slof") means changing same lines in 2 patches of
the same patchset which is never really good.


>> +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
>> +{
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>> +
>> +    return spapr->bios_enabled;
>> +}
>> +
>> +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
>> +{
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>> +
>> +    spapr->bios_enabled = value;
>> +}
>> +
>>  static char *spapr_get_ic_mode(Object *obj, Error **errp)
>>  {
>>      SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>> @@ -3267,6 +3438,20 @@ static void spapr_instance_init(Object *obj)
>>      object_property_add_bool(obj, "vfio-no-msix-emulation",
>>                               spapr_get_msix_emulation, NULL, NULL);
>>  
>> +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
>> +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
>> +                        &error_abort);
>> +    object_property_set_description(obj, "kernel-addr",
>> +                                    stringify(KERNEL_LOAD_ADDR)
>> +                                    " for -kernel is the default",
>> +                                    NULL);
>> +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
>> +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
>> +                            spapr_set_bios_enabled, NULL);
>> +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
>> +                                    NULL);
>> +    spapr->bios_enabled = true;
>> +
>>      /* The machine class defines the default interrupt controller mode */
>>      spapr->irq = smc->irq;
>>      object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
>> diff --git a/hw/ppc/spapr_client.c b/hw/ppc/spapr_client.c
>> new file mode 100644
>> index 000000000000..a318eaae3c40
>> --- /dev/null
>> +++ b/hw/ppc/spapr_client.c
>> @@ -0,0 +1,464 @@
>> +#include "qemu/osdep.h"
>> +#include "qemu-common.h"
>> +#include "qapi/error.h"
>> +#include "exec/memory.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/ppc/spapr_vio.h"
>> +#include "trace.h"
>> +
>> +struct prom_args {
> 
> StudlyCaps and typedef, please..
> 
>> +        __be32 service;
>> +        __be32 nargs;
>> +        __be32 nret;
>> +        __be32 args[10];


btw we do not generally have __be32 (and x86 build fails), I am
replacing these with uint32_t.



>> +};
>> +
>> +#define CLI_PH_MASK     0x0FFFFFFF
>> +#define CLI_INST_PREFIX 0x20000000
>> +
>> +#define readstr(pa, buf) cpu_physical_memory_read((pa), (buf), sizeof(buf))
> 
> I'd prefer "readbuf" or something, since this isn't actually looking
> at any string structure.
> 
>> +#define readuint32(pa) ({ \
>> +    uint32_t __tmp__; \
>> +    cpu_physical_memory_read((pa), &__tmp__, sizeof(__tmp__)); \
>> +    be32_to_cpu(__tmp__); })
> 
> You can simplify this with ldl_be_phys().  In fact, then you probably
> don't need any wrapper at all.

Ah right. There are too many of these, was quicker to type this :)


>> +
>> +static bool _cmpservice(const char *s, size_t len,
>> +                        unsigned nargs, unsigned nret,
>> +                        const char *s1, size_t len1,
>> +                        unsigned nargscheck, unsigned nretcheck)
>> +{
>> +    if (strncmp(s, s1, MAX(len, len1))) {
>> +        return false;
>> +    }
>> +    if (nargscheck == 0 && nretcheck == 0) {
>> +        return true;
>> +    }
>> +    if (nargs != nargscheck || nret != nretcheck) {
>> +        trace_spapr_client_error_param(s, nargscheck, nretcheck, nargs, nret);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static uint32_t client_finddevice(const void *fdt, uint32_t nodeaddr)
>> +{
>> +    char node[256];
>> +    int ret;
>> +
>> +    readstr(nodeaddr, node);
>> +    ret = fdt_path_offset(fdt, node);
>> +    if (ret >= 0) {
>> +        ret = fdt_get_phandle(fdt, ret);
>> +    }
>> +
>> +    return (uint32_t) ret;
>> +}
>> +
>> +static uint32_t client_getprop(const void *fdt, uint32_t nodeph, uint32_t pname,
>> +            uint32_t valaddr, uint32_t vallen)
>> +{
>> +    char propname[64];
>> +    uint32_t ret = 0;
>> +    int proplen = 0;
>> +    const void *prop;
>> +
>> +    readstr(pname, propname);
>> +
>> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
>> +                               propname, strlen(propname), &proplen);
> 
> You can just use fdt_getprop().  By definition it's equivalent to
> fdt_getprop_namelen(.., propname, strlen(propname), ...)
> 
>> +    if (prop) {
>> +        int cb = MIN(proplen, vallen);
>> +
>> +        cpu_physical_memory_write(valaddr, prop, cb);
>> +        ret = cb;
>> +    } else if (strncmp(propname, "stdout", 6) == 0 && vallen == sizeof(ret)) {
>> +        ret = cpu_to_be32(1);
>> +        cpu_physical_memory_write(valaddr, &ret, MIN(vallen, sizeof(ret)));
> 
> I'm guessing the special casing of stdout is to fake the OF instancing
> stuff, which I don't really understand.  Some comments to explain
> what's going on here would be good.


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/prom_init.c#n2219

This thing wants ihandle which is stored as /chosen/stdout and which we
do not have as there are no instances.

If this property is not there, prom_panic() happens.

I could implement some primitive intances, I'll look into this.

> 
>> +    } else {
>> +        ret = -1;
>> +    }
>> +    trace_spapr_client_getprop(nodeph, propname, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t client_setprop(SpaprMachineState *sm,
>> +                               uint32_t nodeph, uint32_t pname,
>> +                               uint32_t valaddr, uint32_t vallen)
>> +{
>> +    char propname[64];
>> +    uint32_t ret = -1;
>> +    int proplen = 0;
>> +    const void *prop;
>> +
>> +    readstr(pname, propname);
>> +    if (vallen == sizeof(uint32_t) &&
>> +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
>> +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
>> +
>> +        sm->rtas_base = readuint32(valaddr);
>> +        prop = fdt_getprop_namelen(sm->fdt_blob,
>> +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
>> +                                                              nodeph),
>> +                                   propname, strlen(propname), &proplen);
>> +        if (proplen == vallen) {
>> +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
>> +            ret = proplen;
>> +        }
> 
> Is there a particular reason to restrict this to the rtas properties,
> rather than just allowing the guest to fdt_setprop() something
> arbitrary?

The FDT is flatten and I am not quite sure if libfdt can handle updating
properties if the length has changed.

Also, more importantly, potentially property changes like this may have
to be reflected in the QEMU device tree so I allowed only the properties
which I know how to deal with.


> 
>> +    }
>> +    trace_spapr_client_setprop(nodeph, propname, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t client_getproplen(const void *fdt, uint32_t nodeph,
>> +                                  uint32_t pname)
>> +{
>> +    char propname[64];
>> +    uint32_t ret = 0;
>> +    int proplen = 0;
>> +    const void *prop;
>> +
>> +    readstr(pname, propname);
>> +
>> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
>> +                               propname, strlen(propname), &proplen);
>> +    if (prop) {
>> +        ret = proplen;
>> +    } else if (strncmp(propname, "stdout", 6) == 0) {
>> +        ret = 4;
>> +    } else {
>> +        ret = -1;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t client_peer(const void *fdt, uint32_t phandle)
>> +{
>> +    int ret;
>> +
>> +    if (phandle == 0) {
>> +        ret = fdt_path_offset(fdt, "/");
>> +    } else {
>> +        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>> +    }
>> +
>> +    if (ret < 0) {
>> +        ret = 0;
>> +    } else {
>> +        ret = fdt_get_phandle(fdt, ret);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t client_child(const void *fdt, uint32_t phandle)
>> +{
>> +    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>> +
>> +    if (ret < 0) {
>> +        ret = 0;
>> +    } else {
>> +        ret = fdt_get_phandle(fdt, ret);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t client_parent(const void *fdt, uint32_t phandle)
>> +{
>> +    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>> +
>> +    if (ret < 0) {
>> +        ret = 0;
>> +    } else {
>> +        ret = fdt_get_phandle(fdt, ret);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t client_open(uint32_t phandle)
>> +{
>> +    uint32_t ret = (phandle & CLI_PH_MASK) | CLI_INST_PREFIX;
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t client_instance_to_path(SpaprMachineState *sm,
>> +                                        uint32_t instance, uint32_t buf,
>> +                                        uint32_t len)
>> +{
>> +    uint32_t ph = instance & CLI_PH_MASK;
>> +    char *stdout_path = spapr_vio_stdout_path(sm->vio_bus);
>> +    int stdout_ph = fdt_path_offset(sm->fdt_blob, stdout_path);
>> +
>> +    if (ph == stdout_ph) {
>> +        len = MIN(len, strlen(stdout_path));
>> +        cpu_physical_memory_write(buf, stdout_path, len);
>> +    } else  {
>> +        len = 0;
>> +    }
>> +    g_free(stdout_path);
>> +
>> +    return len;
>> +}
>> +
>> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
>> +                                       uint32_t buf, uint32_t len)
>> +{
>> +    char tmp[256];
> 
> Fixed sized buffers are icky.  You could either dynamically allocate
> this based on the size the client gives, or you could use
> memory_region_get_ram_ptr() to read the data from the tree directly
> into guest memory.

True, will fix.


>> +    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
>> +                          sizeof(tmp))) {
>> +        tmp[sizeof(tmp) - 1] = 0;
>> +        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
>> +    }
>> +    return len;
>> +}
>> +
>> +static uint32_t client_write(uint32_t instance, uint32_t buf, uint32_t len)
>> +{
>> +    char tmp[len + 1]; /* TODO: do a loop as len might be enourmous*/
>> +
>> +    readstr(buf, tmp);
>> +    tmp[len] = 0;
>> +    printf("%s", tmp);
> 
> I'm guessing this is what's used for console output via OF.  This is
> enough for a debugging PoC, but for real we'll need to have this
> actually go via the appropriate qemu chardev.

Yes, this is for PoC. instance need to be tracked back to the actual
device and I just do not have instances now.


>> +
>> +    return len;
>> +}
>> +
>> +static bool client_claim_avail(SpaprMachineState *sm, uint64_t virt,
>> +                               uint64_t size)
>> +{
>> +    int i;
>> +    spapr_of_claimed *c;
>> +
>> +    for (i = 0; i < sm->claimed->len; ++i) {
>> +        c = &g_array_index(sm->claimed, spapr_of_claimed, i);
>> +        if ((c->start <= virt && virt < c->start + c->size) ||
>> +            (virt <= c->start && c->start < virt + size)) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +uint64_t spapr_do_client_claim(SpaprMachineState *sm,
> 
> SpaprMachineState * are usually called 'spapr', not 'sm'.

Ok I'll fix but include/hw/ppc/spapr.h disagrees with you, this was the
place where I added the very first function prototype which started all
this :)


> 
>> uint64_t virt,
>> +                               uint64_t size, uint64_t align)
>> +{
>> +    uint32_t ret;
>> +    spapr_of_claimed newclaim;
>> +
>> +    if (align == 0) {
>> +        if (!client_claim_avail(sm, virt, size)) {
>> +            return -1;
>> +        }
>> +        ret = virt;
>> +    } else {
>> +        align = pow2ceil(align);
>> +        sm->claimed_base = (sm->claimed_base + align - 1) & ~(align - 1);
>> +        while (1) {
>> +            if (sm->claimed_base >= sm->rma_size) {
>> +                perror("Out of memory");
> 
> This should be an error_report() rather than a perror().
> 
>> +                return -1;
>> +            }
>> +            if (client_claim_avail(sm, sm->claimed_base, size)) {
>> +                break;
>> +            }
>> +            sm->claimed_base += size;
>> +        }
>> +        ret = sm->claimed_base;
>> +    }
>> +
>> +    sm->claimed_base = MAX(sm->claimed_base, ret) + size;
> 
> I understand the logic with claimed_base.  It looks like you're
> implementing a bump allocator, but isn't the the client passing in the
> address to claim anyway, so why do you need an allocator?


The client passes 3 numbers - address, size, align, the address is only
used when align==0. So it is either allocating or claiming. I was not
quite sure if address==0 mode is (not) used, especially for zImage.

Having said that, I noticed that Linux always calls "claim" with
non-zero address and increments it until it succeeded so I can ignore
allocating case when called by Linux. I still need some
allocator/claimer for myself though.


> 
>> +    newclaim.start = virt;
>> +    newclaim.size = size;
>> +    g_array_append_val(sm->claimed, newclaim);
>> +    trace_spapr_client_claim(virt, size, align, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t client_claim(SpaprMachineState *sm, uint32_t virt,
>> +                             uint32_t size, uint32_t align)
>> +{
>> +    return spapr_do_client_claim(sm, virt, size, align);
> 
> Um.. why this wrapper?


It grew to this. And I may add separate claim() and alloc() and only
call claim() from here.

> 
>> +}
>> +
>> +static uint32_t client_nextprop(const void *fdt, uint32_t phandle,
>> +                                uint32_t prevaddr, uint32_t nameaddr)
>> +{
>> +    int namelen = 0;
>> +    int offset = fdt_node_offset_by_phandle(fdt, phandle);
>> +    char prev[256];
> 
> Ew, fixed sized buffer.
> 
>> +    const struct fdt_property *prop;
>> +    const char *tmp;
>> +
>> +    readstr(prevaddr, prev);
>> +    for (offset = fdt_first_property_offset(fdt, offset);
>> +         (offset >= 0);
>> +         (offset = fdt_next_property_offset(fdt, offset))) {
>> +
>> +        prop = fdt_get_property_by_offset(fdt, offset, &namelen);
> 
> fdt_getprop_by_offset() also returns the property's name without
> having to dick around with fdt_get_string() manually.


Ah you're right.


> 
>> +        if (!prop) {
>> +            return 0;
>> +        }
>> +
>> +        tmp = fdt_get_string(fdt, fdt32_ld(&prop->nameoff), &namelen);
>> +        if (prev[0] == 0 ||
> 
> == '\0' please.
> 
>> +            strncmp(prev, tmp, MAX(namelen, strlen(prev))) == 0) {
> 
> This will break very badly if the client passed a name larger than
> sizeof(prev), since readstr() won't \0 terminate in that case IIUC.
> 
>> +            if (prev[0]) {
>> +                offset = fdt_next_property_offset(fdt, offset);
>> +                if (offset < 0) {
>> +                    return 0;
>> +                }
>> +            }
>> +            prop = fdt_get_property_by_offset(fdt, offset, &namelen);
>> +            if (!prop) {
>> +                return 0;
>> +            }
>> +
>> +            tmp = fdt_get_string(fdt, fdt32_ld(&prop->nameoff), &namelen);
>> +            cpu_physical_memory_write(nameaddr, tmp, namelen + 1);
>> +            return 1;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static uint32_t client_call_method(SpaprMachineState *sm, uint32_t methodaddr,
>> +                                   uint32_t param1, uint32_t param2,
>> +                                   uint32_t *ret2)
>> +{
>> +    uint32_t ret = 0;
>> +    char method[256];
> 
> More fixed sized buffers.
> 
>> +    readstr(methodaddr, method);
>> +    if (strncmp(method, "ibm,client-architecture-support", 31) == 0) {
>> +
>> +#define FDT_MAX_SIZE            0x100000
>> +        ret = do_client_architecture_support(POWERPC_CPU(first_cpu), sm, param2,
>> +                                             0, FDT_MAX_SIZE);
>> +        *ret2 = 0;
>> +    } else if (strncmp(method, "instantiate-rtas", 16) == 0) {
>> +        uint32_t rtasbase = param2;
>> +
>> +        spapr_instantiate_rtas(sm, rtasbase);
>> +        *ret2 = rtasbase;
> 
> In both these cases you ignore param1.  I'm guessing it has the handle
> of the node on which you're calling the method?  In which case you
> really should validate it.

It is an ihandle which I do not do now.


> 
>> +    } else {
>> +        trace_spapr_client_error_unknown_method(method);
>> +        return -1;
>> +    }
>> +
>> +    trace_spapr_client_method(method, param1, param2, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static void client_quiesce(SpaprMachineState *sm)
>> +{
> 
> Possibly want a flush of stdout here.
> 
>> +}
>> +
>> +int spapr_h_client(SpaprMachineState *sm, target_ulong client_args)
>> +{
>> +    struct prom_args args = { 0 };
>> +    char service[64];
> 
> More fixed sized buffers.
> 
>> +    unsigned nargs, nret;
>> +    int i, servicelen;
>> +
>> +    cpu_physical_memory_read(client_args, &args, sizeof(args));
>> +    nargs = be32_to_cpu(args.nargs);
>> +    nret = be32_to_cpu(args.nret);
>> +    readstr(be32_to_cpu(args.service), service);
>> +    servicelen = strlen(service);
>> +
>> +#define cmpservice(s, a, r) \
>> +        _cmpservice(service, servicelen, nargs, nret, (s), sizeof(s), (a), (r))
>> +
>> +    if (cmpservice("finddevice", 1, 1)) {
> 
> Hrm.  Rather than this giant chain of if(cmp) checks, I'd prefer to
> have a table of services as a data structure, which the code would
> scan through.  That does mean you'd have to give the functions
> implementing each service a common signature, so they'd need to parse
> out their own parameters, but I still think it's the better option.
> It also makes it closer to how we implement RTAS call dispatch.


I seriously doubt it will make the code simpler/shorter/easier; and it
will move certain checks from compile time to run time. Dispatch tables
make sense for "call-method" but for the others not so much. Thanks,


>> +        args.args[nargs] = client_finddevice(sm->fdt_blob,
>> +                                             be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("getprop", 4, 1)) {
>> +        args.args[nargs] = client_getprop(sm->fdt_blob,
>> +                                          be32_to_cpu(args.args[0]),
>> +                                          be32_to_cpu(args.args[1]),
>> +                                          be32_to_cpu(args.args[2]),
>> +                                          be32_to_cpu(args.args[3]));
>> +    } else if (cmpservice("getproplen", 2, 1)) {
>> +        args.args[nargs] = client_getproplen(sm->fdt_blob,
>> +                                             be32_to_cpu(args.args[0]),
>> +                                             be32_to_cpu(args.args[1]));
>> +    } else if (cmpservice("setprop", 4, 1)) {
>> +        args.args[nargs] = client_setprop(sm,
>> +                                          be32_to_cpu(args.args[0]),
>> +                                          be32_to_cpu(args.args[1]),
>> +                                          be32_to_cpu(args.args[2]),
>> +                                          be32_to_cpu(args.args[3]));
>> +    } else if (cmpservice("instance-to-path", 3, 1)) {
>> +        args.args[nargs] = client_instance_to_path(sm,
>> +                                                   be32_to_cpu(args.args[0]),
>> +                                                   be32_to_cpu(args.args[1]),
>> +                                                   be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("package-to-path", 3, 1)) {
>> +        args.args[nargs] = client_package_to_path(sm->fdt_blob,
>> +                                                  be32_to_cpu(args.args[0]),
>> +                                                  be32_to_cpu(args.args[1]),
>> +                                                  be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("write", 3, 1)) {
>> +        args.args[nargs] = client_write(be32_to_cpu(args.args[0]),
>> +                                        be32_to_cpu(args.args[1]),
>> +                                        be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("peer", 1, 1)) {
>> +        args.args[nargs] = client_peer(sm->fdt_blob,
>> +                                       be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("child", 1, 1)) {
>> +        args.args[nargs] = client_child(sm->fdt_blob,
>> +                                        be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("parent", 1, 1)) {
>> +        args.args[nargs] = client_parent(sm->fdt_blob,
>> +                                         be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("open", 1, 1)) {
>> +        args.args[nargs] = client_open(be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("call-method", 3, 2)) {
>> +        uint32_t ret2 = 0;
>> +
>> +        args.args[nargs] = client_call_method(sm,
>> +                                              be32_to_cpu(args.args[0]),
>> +                                              be32_to_cpu(args.args[1]),
>> +                                              be32_to_cpu(args.args[2]),
>> +                                              &ret2);
>> +        args.args[nargs + 1] = ret2;
>> +    } else if (cmpservice("claim", 3, 1)) {
>> +        args.args[nargs] = client_claim(sm,
>> +                                        be32_to_cpu(args.args[0]),
>> +                                        be32_to_cpu(args.args[1]),
>> +                                        be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("nextprop", 3, 1)) {
>> +        args.args[nargs] = client_nextprop(sm->fdt_blob,
>> +                                           be32_to_cpu(args.args[0]),
>> +                                           be32_to_cpu(args.args[1]),
>> +                                           be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("quiesce", 0, 0)) {
>> +        client_quiesce(sm);
>> +    } else if (cmpservice("exit", 0, 0)) {
>> +        error_report("Stopped as the VM requested \"exit\"");
>> +        vm_stop(RUN_STATE_PAUSED);
>> +    } else {
>> +        trace_spapr_client_error_unknown_service(service, nargs, nret);
>> +        args.args[nargs] = -1;
>> +    }
>> +
>> +    for (i = 0; i < nret; ++i) {
>> +        args.args[nargs + i] = be32_to_cpu(args.args[nargs + i]);
>> +    }
>> +    cpu_physical_memory_write(client_args, &args, sizeof(args));
>> +
>> +    return H_SUCCESS;
>> +}
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index f1799b1b707d..76f93ec37959 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -1660,15 +1660,12 @@ static bool spapr_hotplugged_dev_before_cas(void)
>>      return false;
>>  }
>>  
>> -static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>> -                                                  SpaprMachineState *spapr,
>> -                                                  target_ulong opcode,
>> -                                                  target_ulong *args)
>> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>> +                                            SpaprMachineState *spapr,
>> +                                            target_ulong addr,
>> +                                            target_ulong fdt_buf,
>> +                                            target_ulong fdt_bufsize)
>>  {
>> -    /* Working address in data buffer */
>> -    target_ulong addr = ppc64_phys_to_real(args[0]);
>> -    target_ulong fdt_buf = args[1];
>> -    target_ulong fdt_bufsize = args[2];
>>      target_ulong ov_table;
>>      uint32_t cas_pvr;
>>      SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
>> @@ -1833,13 +1830,15 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>  
>>          fdt_bufsize -= sizeof(hdr);
>>  
>> -        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
>> +        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
>>          _FDT((fdt_pack(fdt)));
>>  
>> -        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
>> -        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
>> -                                  fdt_totalsize(fdt));
>> -        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
>> +        if (spapr->bios_enabled) {
>> +            cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
>> +            cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
>> +                                      fdt_totalsize(fdt));
>> +            trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
>> +        }
>>  
>>          g_free(spapr->fdt_blob);
>>          spapr->fdt_size = fdt_totalsize(fdt);
>> @@ -1854,6 +1853,20 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>      return H_SUCCESS;
>>  }
>>  
>> +static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>> +                                                  SpaprMachineState *spapr,
>> +                                                  target_ulong opcode,
>> +                                                  target_ulong *args)
>> +{
>> +    /* Working address in data buffer */
>> +    target_ulong addr = ppc64_phys_to_real(args[0]);
>> +    target_ulong fdt_buf = args[1];
>> +    target_ulong fdt_bufsize = args[2];
>> +
>> +    return do_client_architecture_support(cpu, spapr, addr, fdt_buf,
>> +                                          fdt_bufsize);
>> +}
>> +
>>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>>                                                SpaprMachineState *spapr,
>>                                                target_ulong opcode,
>> @@ -1998,6 +2011,14 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>      return H_SUCCESS;
>>  }
>>  
>> +static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
>> +                             target_ulong opcode, target_ulong *args)
>> +{
>> +    target_ulong client_args = ppc64_phys_to_real(args[0]);
>> +
>> +    return spapr_h_client(spapr, client_args);
>> +}
>> +
>>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
>>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
>>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
>> @@ -2121,6 +2142,8 @@ static void hypercall_register_types(void)
>>  
>>      spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
>>  
>> +    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
>> +
>>      /* Virtual Processor Home Node */
>>      spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
>>                               h_home_node_associativity);
>> diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
>> index 9ea620f23c85..92bce3e44693 100644
>> --- a/hw/ppc/trace-events
>> +++ b/hw/ppc/trace-events
>> @@ -21,6 +21,15 @@ spapr_update_dt(unsigned cb) "New blob %u bytes"
>>  spapr_update_dt_failed_size(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
>>  spapr_update_dt_failed_check(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
>>  
>> +# spapr_client.c
>> +spapr_client_error_param(const char *method, int nargscheck, int nretcheck, int nargs, int nret) "%s takes/returns %d/%d, not %d/%d"
>> +spapr_client_error_unknown_service(const char *service, int nargs, int nret) "%s args=%d rets=%d"
>> +spapr_client_error_unknown_method(const char *method) "%s"
>> +spapr_client_claim(uint32_t virt, uint32_t size, uint32_t align, uint32_t ret) "virt=0x%x size=0x%x align=0x%x => 0x%x"
>> +spapr_client_method(const char *method, uint32_t param1, uint32_t param2, uint32_t ret) "%s(0x%x, 0x%x) => 0x%x"
>> +spapr_client_getprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
>> +spapr_client_setprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
>> +
>>  # spapr_hcall_tpm.c
>>  spapr_h_tpm_comm(const char *device_path, uint64_t operation) "tpm_device_path=%s operation=0x%"PRIu64
>>  spapr_tpm_execute(uint64_t data_in, uint64_t data_in_sz, uint64_t data_out, uint64_t data_out_sz) "data_in=0x%"PRIx64", data_in_sz=%"PRIu64", data_out=0x%"PRIx64", data_out_sz=%"PRIu64
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06  6:28   ` Alexey Kardashevskiy
@ 2020-01-06  8:50     ` David Gibson
  2020-01-06 12:34       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2020-01-06  8:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 63549 bytes --]

On Mon, Jan 06, 2020 at 05:28:55PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 06/01/2020 15:19, David Gibson wrote:
> > On Mon, Jan 06, 2020 at 10:42:42AM +1100, Alexey Kardashevskiy wrote:
> >> The Petitboot bootloader is way more advanced than SLOF is ever going to
> >> be as Petitboot comes with the full-featured Linux kernel with all
> >> the drivers, and initramdisk with quite user friendly interface.
> >> The problem with ditching SLOF is that an unmodified pseries kernel can
> >> either start via:
> >> 1. kexec, this requires presence of RTAS and skips
> >> ibm,client-architecture-support entirely;
> >> 2. normal boot, this heavily relies on the OF1275 client interface to
> >> fetch the device tree and do early setup (claim memory).
> >>
> >> This adds a new bios-less mode to the pseries machine: "bios=on|off".
> >> When enabled, QEMU does not load SLOF and jumps to the kernel from
> >> "-kernel".
> >>
> >> The client interface is implemented exactly as RTAS - a 20 bytes blob,
> >> right next after the RTAS blob. The entry point is passed to the kernel
> >> via GPR5.
> >>
> >> This implements a handful of client interface methods just to get going.
> >> In particular, this implements the device tree fetching,
> >> ibm,client-architecture-support and instantiate-rtas.
> >>
> >> This implements changing FDT properties only for "linux,rtas-base" and
> >> "linux,rtas-entry", just to get early boot going.
> >>
> >> This assigns "phandles" to device tree nodes as there is no more SLOF
> >> and OF nodes addresses of which served as phandle values.
> >> This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
> >> phandles are regenerated at every FDT rebuild.
> >>
> >> When bios=off, this adds "/chosen" every time QEMU builds a tree.
> >>
> >> This implements "claim" which the client (Linux) uses for memory
> >> allocation; this is also  used by QEMU for claiming kernel/initrd images,
> >> client interface entry point, RTAS and the initial stack.
> >>
> >> While at this, add a "kernel-addr" machine parameter to allow moving
> >> the kernel in memory. This is useful for debugging if the kernel is
> >> loaded at @0, although not necessary.
> >>
> >> This does not implement instances properly but in order to boot a VM,
> >> we only really need a stdout instance and the "/" instance for
> >> "call-method", we fake these.
> > 
> > As we've discussed, I really like the idea of this.  It think the
> > basic approach looks good too.
> > 
> > As you probably realize, there are quite a lot of things to be
> > polished though.  Comments below.
> > 
> >>
> >> The test command line:
> >>
> >> qemu-system-ppc64 \
> >> -nodefaults \
> >> -chardev stdio,id=STDIO0,signal=off,mux=on \
> >> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> >> -mon id=MON0,chardev=STDIO0,mode=readline \
> >> -nographic \
> >> -vga none \
> >> -kernel vmldbg \
> >> -machine pseries,bios=off \
> >> -m 4G \
> >> -enable-kvm \
> >> -initrd pb/rootfs.cpio.xz \
> >> img/u1804-64le.qcow2 \
> >> -snapshot \
> >> -smp 8,threads=8 \
> >> -L /home/aik/t/qemu-ppc64-bios/ \
> >> -trace events=qemu_trace_events \
> >> -d guest_errors \
> >> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh37874 \
> >> -mon chardev=SOCKET0,mode=control
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v2:
> >> * fixed claim()
> >> * added "setprop"
> >> * cleaner client interface and RTAS blobs management
> >> * boots to petitboot and further to the target system
> >> * more trace points
> >> ---
> >>  hw/ppc/Makefile.objs   |   1 +
> >>  include/hw/loader.h    |   1 +
> >>  include/hw/ppc/spapr.h |  25 ++-
> >>  hw/ppc/spapr.c         | 231 ++++++++++++++++++--
> >>  hw/ppc/spapr_client.c  | 464 +++++++++++++++++++++++++++++++++++++++++
> >>  hw/ppc/spapr_hcall.c   |  49 +++--
> >>  hw/ppc/trace-events    |   9 +
> >>  7 files changed, 743 insertions(+), 37 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_client.c
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index 101e9fc59185..ce31c0acd2fb 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
> >>  obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
> >>  obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
> >>  obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
> >> +obj-$(CONFIG_PSERIES) += spapr_client.o
> > 
> > This applies in a bunch of places here.  Just calling things "client"
> > is clear enough if you're already thinking about Open Firmware.  But
> > this appears in a bunch of places where you might come across it
> > without that context, in which case it could be kind of confusing.  So
> > I'd suggest using "of_client" or "of_client_interface" in most places
> > you're using "client".
> > 
> >>  obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
> >>  # IBM PowerNV
> >>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
> >> diff --git a/include/hw/loader.h b/include/hw/loader.h
> >> index 48a96cd55904..a2f58077a47e 100644
> >> --- a/include/hw/loader.h
> >> +++ b/include/hw/loader.h
> >> @@ -262,6 +262,7 @@ MemoryRegion *rom_add_blob(const char *name, const void *blob, size_t len,
> >>  int rom_add_elf_program(const char *name, GMappedFile *mapped_file, void *data,
> >>                          size_t datasize, size_t romsize, hwaddr addr,
> >>                          AddressSpace *as);
> >> +bool rom_intersect(uint64_t start, uint64_t size);
> > 
> > I don't see this called, so I'm guessing it's a stray chunk leftover.
> > 
> >>  int rom_check_and_register_reset(void);
> >>  void rom_set_fw(FWCfgState *f);
> >>  void rom_set_order_override(int order);
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 61f005c6f686..876879d12029 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -105,6 +105,11 @@ struct SpaprCapabilities {
> >>      uint8_t caps[SPAPR_CAP_NUM];
> >>  };
> >>  
> >> +typedef struct {
> >> +    uint64_t start;
> >> +    uint64_t size;
> >> +} spapr_of_claimed;
> > 
> > Please use the qemu convention of StudlyCaps for types.
> > 
> >>  /**
> >>   * SpaprMachineClass:
> >>   */
> >> @@ -158,8 +163,13 @@ struct SpaprMachineState {
> >>      uint32_t fdt_size;
> >>      uint32_t fdt_initial_size;
> >>      void *fdt_blob;
> >> +    uint32_t rtas_base;
> >>      long kernel_size;
> >>      bool kernel_le;
> >> +    uint64_t kernel_addr;
> > 
> > This would intersect with your other patch to allow changing the
> > kernel load address, yes?
> 
> 
> This includes kernel_addr, it is a single patch now. More about this below.

Hm, ok.

> >> +    bool bios_enabled;
> >> +    GArray *claimed; /* array of spapr_of_claimed */
> >> +    uint64_t claimed_base;
> > 
> > I'd suggest moving all the client interface related fields to a
> > visually separated chunk in the structure definition.
> > 
> >>      uint32_t initrd_base;
> >>      long initrd_size;
> >>      uint64_t rtc_offset; /* Now used only during incoming migration */
> >> @@ -510,7 +520,8 @@ struct SpaprMachineState {
> >>  /* Client Architecture support */
> >>  #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
> >>  #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
> >> -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
> >> +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
> >> +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
> >>  
> >>  /*
> >>   * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
> >> @@ -538,6 +549,12 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
> >>  target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
> >>                               target_ulong *args);
> >>  
> >> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> >> +                                            SpaprMachineState *spapr,
> >> +                                            target_ulong addr,
> >> +                                            target_ulong fdt_buf,
> >> +                                            target_ulong fdt_bufsize);
> >> +
> >>  /* Virtual Processor Area structure constants */
> >>  #define VPA_MIN_SIZE           640
> >>  #define VPA_SIZE_OFFSET        0x4
> >> @@ -769,6 +786,9 @@ struct SpaprEventLogEntry {
> >>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
> >>  void spapr_events_init(SpaprMachineState *sm);
> >>  void spapr_dt_events(SpaprMachineState *sm, void *fdt);
> >> +uint64_t spapr_do_client_claim(SpaprMachineState *sm, uint64_t virt,
> >> +                               uint64_t size, uint64_t align);
> >> +int spapr_h_client(SpaprMachineState *sm, target_ulong client_args);
> >>  void close_htab_fd(SpaprMachineState *spapr);
> >>  void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
> >>  void spapr_free_hpt(SpaprMachineState *spapr);
> >> @@ -891,4 +911,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
> >>  #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
> >>  
> >>  void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
> >> +
> >> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
> >> +
> >>  #endif /* HW_SPAPR_H */
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index e62c89b3dd40..1c97534a0aea 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -896,6 +896,51 @@ out:
> >>      return ret;
> >>  }
> >>  
> >> +/*
> >> + * Below is a compiled version of RTAS blob and OF client interface entry point.
> >> + *
> >> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
> >> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
> >> + *
> >> + *   .globl  _start
> >> + *   _start:
> >> + *           mr      4,3
> >> + *           lis     3,KVMPPC_H_RTAS@h
> >> + *           ori     3,3,KVMPPC_H_RTAS@l
> >> + *           sc      1
> >> + *           blr
> >> + *           mr      4,3
> >> + *           lis     3,KVMPPC_H_CLIENT@h
> >> + *           ori     3,3,KVMPPC_H_CLIENT@l
> >> + *           sc      1
> >> + *           blr
> >> + */
> >> +static struct {
> >> +    uint8_t rtas[20], client[20];
> >> +} QEMU_PACKED rtas_client_blob = {
> >> +    .rtas = {
> >> +        0x7c, 0x64, 0x1b, 0x78,
> >> +        0x3c, 0x60, 0x00, 0x00,
> >> +        0x60, 0x63, 0xf0, 0x00,
> >> +        0x44, 0x00, 0x00, 0x22,
> >> +        0x4e, 0x80, 0x00, 0x20
> >> +    },
> >> +    .client = {
> >> +        0x7c, 0x64, 0x1b, 0x78,
> >> +        0x3c, 0x60, 0x00, 0x00,
> >> +        0x60, 0x63, 0xf0, 0x05,
> >> +        0x44, 0x00, 0x00, 0x22,
> >> +        0x4e, 0x80, 0x00, 0x20
> >> +    }
> >> +};
> > 
> > I'd really prefer to read this in from a file (or files) which we
> > assemble (as with the existing spapr-rtas.img), rather than having a
> > magic array in qemu.
> 
> Two considerations here:
> 1. This blob is not going to change (at least often)

True, but I'd still prefer to build it from a .S rather than have the
binary written out.  We already had this until recently (when we moved
the RTAS image into SLOF), and it's really not too hard to manage.

> 2. A new file which needs to be packaged/signed/copied.

Eh, again, we had this for RTAS until very recently and it really
wasn't that much hassle.

> >> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
> >> +{
> >> +    spapr->rtas_base = base;
> >> +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
> >> +                              sizeof(rtas_client_blob.rtas));
> >> +}
> >> +
> >>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >>  {
> >>      MachineState *ms = MACHINE(spapr);
> >> @@ -980,6 +1025,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >>      _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
> >>                       lrdr_capacity, sizeof(lrdr_capacity)));
> >>  
> >> +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-base", spapr->rtas_base));
> >> +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-entry", spapr->rtas_base));
> > 
> > IIUC, we shouldn't need to add these in bios=off mode, yes?  Instead
> > the client must instantiate-rtas before attempting to use it.
> 
> 
> Well. Here is what happens:
> 
> 1. spapr machine reset creates FDT, lets say without rtas-entry
> 2. linux calls ibm,client-interface-support

I assume you mean ibm,client-architecture-support?

> 3. the QEMU handler rebuilds the FDT

Oh.. not directly relevant here, but that might be an issue.  Do we
need to ensure that phandles are stable across the CAS call?

> 4. linux calls "instantiate-rtas"
> 5. linux stores "linux,rtas-base" and "linux,rtas-entry" in the FDT
> using "setprop" client call
> 6. linux fetches the tree by traversing it and packs it into FDT blob itself
> 7. linux calls quiesce
> 8. linux uses "linux,rtas-entry" from FDT to call RTAS.
> 
> Because in 5) I only allow in-place property update and only for
> "linux,rtas-base" and "linux,rtas-entry", these do not make to the tree
> and there is no "linux,rtas-entry" in the FDT.
> 
> There are choices:
> 
> 1. I can extend "setprop" (tricky as FDT is flattened by then?)

I actually don't think this should be too hard - though maybe I've
missed something, so I'd say this is my frontrunning option.

> 2. simply store rtas base in spapr->rtas_base and do complete FDT rebuild

I don't think that's viable - phandles definitely shouldn't change
across an instantiate-rtas call.

> 3. (currently) I basically have 2 placeholders for these
> "linux,rtas-base" and "linux,rtas-entry".

Hm, ok.

> Or we can avoid flattening FDT at all for bios=off (as nobody wants
> flattened device tree, my "fetch-fdt" never made it to upstream Linux),
> then 1) is easy (if it is a problem to begin with).

Right, but we need a live DT library to make it happen, which means
someone finding time to write it.

> >> +    _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
> >> +                          sizeof(rtas_client_blob.rtas)));
> >> +    _FDT(fdt_add_mem_rsv(fdt, spapr->rtas_base,
> >> +                         sizeof(rtas_client_blob.rtas)));
> > 
> > Nor should we need the mem_rsv with bios=off.
> 
> True. If not the "setprop" limitation from above, I would only need
> "rtas-size" here.
> 
> 
> > 
> >> +
> >>      spapr_dt_rtas_tokens(fdt, rtas);
> >>  }
> >>  
> >> @@ -1057,7 +1109,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
> >>      }
> >>  
> >>      if (spapr->kernel_size) {
> >> -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
> >> +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
> >>                                cpu_to_be64(spapr->kernel_size) };
> >>  
> >>          _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
> >> @@ -1104,6 +1156,9 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
> >>          _FDT(fdt_setprop_cell(fdt, chosen, "linux,pci-probe-only", 0));
> >>      }
> >>  
> >> +    if (!spapr->bios_enabled) {
> >> +        _FDT(fdt_setprop_cell(fdt, chosen, "cpu", 0));
> > 
> > What is this for?
> 
> Looks like a leftover, I did this for prom_find_boot_cpu() but after
> double checking and trying it looks like it can proceed without this chunk.

Ok.

> > 
> >> +    }
> >>      spapr_dt_ov5_platform_support(spapr, fdt, chosen);
> >>  
> >>      g_free(stdout_path);
> >> @@ -1133,6 +1188,28 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
> >>      }
> >>  }
> >>  
> >> +static void spapr_fdt_add_phandles(void *fdt, int parent, uint32_t *phandle)
> >> +{
> >> +    int proplen = 0, node;
> >> +    const void *prop;
> >> +    char tmp[256] = "BAAD";
> >> +
> >> +    fdt_get_path(fdt, parent, tmp, sizeof(tmp));
> > 
> > You don't actually appear to use tmp.  Is this a debugging leftover?
> 
> Yes it is.
> 
> 
> > 
> >> +    prop = fdt_getprop_namelen(fdt, parent, "phandle", 7, &proplen);
> >> +    if (!prop) {
> >> +        _FDT(fdt_setprop_cell(fdt, parent, "phandle", *phandle));
> >> +        while (1) {
> >> +            ++*phandle;
> >> +            if (fdt_node_offset_by_phandle(fdt, *phandle) < 0) {
> >> +                break;
> >> +            }
> >> +        }
> >> +    }
> >> +    fdt_for_each_subnode(node, fdt, parent) {
> >> +        spapr_fdt_add_phandles(fdt, node, phandle);
> >> +    }
> > 
> > Oof, this is pretty inefficient, since fdt_node_offset_by_phandle()
> > will have to scan the whole tree, and you do it on every node.
> 
> 
> Inefficient indeed but for a VM with ~30 devices it is not that bad :)
> 
> 
> > Obviously, a "live" tree structure rather than fdt would be better
> > suited for this.  I'd like to move to that at some point, but it will
> > be a while before we're ready.
> 
> I could have a cache/hashmap of phandles...

We could, but I doubt it's worth the trouble.  Keeping the offsets in
there up to date against setprops or other changes could actually a
fair bit of hassle, I suspect.

> > I guess this is fine for a PoC, but I think we can do a bit better
> > without too much extra complexity:
> >   * Rather than scanning for an existing node with the handle on each
> >     iteration, you can use fdt_find_max_phandle() once beforehand,
> >     and start your counter from there, being assured that it won't hit
> >     any existing phandles
> 
> I'd like to avoid that as NPU/GPU phandles are quite huge and we will
> have huge gaps between phandles numbers which is not so much of a big
> deal but annoying to debug.

Hrm.  So huge we'll actually be short of space above them?

> >   * The phandles are arbitrary, so you don't actually care about the
> >     tree structure here.  So instead of recursively descending the
> >     tree, you can just alter each node in the order they appear in the
> >     blob.  fdt_next_node() will let you do that, simply by ignoring
> >     the 'depth' parameter.  [In fact the structure ensures that will
> >     be a depth first traversal of the tree, just like you have here,
> >     not that it matters].
> 
> ... but this one is better.
> 
> Unlike many others, fdt_next_node() does not have a comment in
> dtc/libfdt/libfdt.h so I was not sure about what *depth is for and I
> basically ignored fdt_next_node().

Yeah, it's kind of subtle - mostly intended as an internal primitive -
but it suits your needs here.  It moves to the next node in the
structure block - regardless of whether it's a sibling, child or
sibling-of-ancestor.  *depth, if a pointer is supplied, is updated to
reflect which case it is.

> >   * You should check for overflows of your phandle counter, though
> >     it's fine to treat that as a fatal error.
> 
> Ok.
> 
> 
> >> +}
> >> +
> >>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
> >>  {
> >>      MachineState *machine = MACHINE(spapr);
> >> @@ -1245,7 +1322,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
> >>      /* Build memory reserve map */
> >>      if (reset) {
> >>          if (spapr->kernel_size) {
> >> -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
> >> +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
> >> +                                  spapr->kernel_size)));
> >>          }
> >>          if (spapr->initrd_size) {
> >>              _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
> >> @@ -1268,12 +1346,19 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
> >>          }
> >>      }
> >>  
> >> +    if (!spapr->bios_enabled) {
> >> +        uint32_t phandle = 1;
> >> +
> >> +        spapr_fdt_add_phandles(fdt, fdt_path_offset(fdt, "/"), &phandle);
> >> +    }
> >> +
> >>      return fdt;
> >>  }
> >>  
> >>  static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
> >>  {
> >> -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
> >> +    SpaprMachineState *spapr = opaque;
> >> +    return (addr & 0x0fffffff) + spapr->kernel_addr;
> >>  }
> >>  
> >>  static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
> >> @@ -1660,6 +1745,60 @@ static void spapr_machine_reset(MachineState *machine)
> >>       */
> >>      fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
> >>  
> >> +    /* Set up the entry state */
> >> +    if (!spapr->bios_enabled) {
> >> +        if (spapr->claimed) {
> >> +            g_array_unref(spapr->claimed);
> >> +        }
> >> +        spapr->claimed = g_array_new(false, false, sizeof(spapr_of_claimed));
> >> +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
> >> +
> >> +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr, fdt_addr);
> >> +
> >> +        if (spapr_do_client_claim(spapr, spapr->kernel_addr,
> >> +                                  spapr->kernel_size, 0) == -1) {
> >> +            error_report("Memory for kernel is in use");
> >> +            exit(1);
> >> +        }
> >> +        if (spapr_do_client_claim(spapr, spapr->initrd_base,
> >> +                                  spapr->initrd_size, 0) == -1) {
> >> +            error_report("Memory for initramdisk is in use");
> >> +            exit(1);
> >> +        }
> >> +        first_ppc_cpu->env.gpr[1] = spapr_do_client_claim(spapr, 0, 0x40000,
> >> +                                                          0x10000);
> >> +        if (first_ppc_cpu->env.gpr[1] == -1) {
> >> +            error_report("Memory for stack is in use");
> >> +            exit(1);
> >> +        }
> >> +
> >> +        first_ppc_cpu->env.gpr[5] =
> >> +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
> >> +                                  sizeof(rtas_client_blob.client));
> >> +        if (first_ppc_cpu->env.gpr[5] == -1) {
> >> +            error_report("Memory for OF client is in use");
> >> +            exit(1);
> >> +        }
> >> +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
> >> +                                  rtas_client_blob.client,
> >> +                                  sizeof(rtas_client_blob.client));
> >> +
> >> +        /* Allocate space for RTAS here so spapr_build_fdt() below picks it */
> >> +        spapr->rtas_base =
> >> +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.rtas),
> >> +                                  0x100);
> >> +        if (spapr->rtas_base == -1) {
> >> +            error_report("Memory for RTAS is in use");
> >> +            exit(1);
> >> +        }
> >> +        cpu_physical_memory_write(spapr->rtas_base,
> >> +                                  rtas_client_blob.rtas,
> >> +                                  sizeof(rtas_client_blob.rtas));
> > 
> > The above is complex enough it might be worth having bios and nobios
> > variants of spapr_cpu_set_entry_state().
> 
> Probably yes, when it settles. I'd rather prefer having kernel/initrd
> claiming code where we load them but these are Roms and when we "load"
> them, they are not loaded, they are loaded from a machine reset handler but:
> 
> may be we could add a hook to Roms to be called every time an image is
> actually written to the guest memory and
> 
> this is when I would "claim" these chunks and
> 
> this would make the chunk above shorter and we won't need a separate
> spapr_cpu_set_entry_state().

Hm, ok.  Well let's see how it turns out.

Fwiw, I have considered replacing those load_targphys() things with
explicit write in at reset time.  load_targphys() is, IIUC, mostly
used for loading things like ROMs which exist in physical memory but
don't go into regular RAM.  So if it turns out not to really match our
needs here, that's ok.

> >> +    } else {
> >> +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> >> +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
> >> +    }
> >> +
> >>      fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
> >>  
> >>      rc = fdt_pack(fdt);
> >> @@ -1675,10 +1814,14 @@ static void spapr_machine_reset(MachineState *machine)
> >>      spapr->fdt_initial_size = spapr->fdt_size;
> >>      spapr->fdt_blob = fdt;
> >>  
> >> -    /* Set up the entry state */
> >> -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> >> -    first_ppc_cpu->env.gpr[5] = 0;
> >> -
> >> +    if (!spapr->bios_enabled) {
> >> +        /*
> >> +         * Claim FDT space so initramdisk/zImage are not unpacked over it.
> >> +         * In practice Linux claim additional memory for FDT and fetches it
> >> +         * using client interface's "getprop" call.
> >> +         */
> > 
> > IIUC, it shouldn't actually be necessary to put the fdt into guest
> > memory in nobios mode - it will be retrieved piecemeal via the client
> > interface instead, won't it?
> 
> Huh. I was under the impression that we pass the FDT to a VM via GPR3
> and it is safer to avoid "claim" from claiming this memory (even though
> Linux is not using it).

Well, depends what you mean.  I mean that's the "fdt boot" convention,
which you've used previously for the nobios boot with an aware guest
image.  We use a similar convention for firing up SLOF itself.  But if
we're starting with a (simulated) OF entry condition, then as long as
we have a working client interface, we don't need an fdt blob anywhere
guest visible.

> Now I realize that this GPR3 is QEMU<->SLOF convention (or some OFW
> binding - what is it exactly?) and Linux rather expects initramdisk
> start/size in GPR3/4 and this did not hit me just because GPR4 is always
> 0 so Linux falls back to the device tree for initramdisk location.
> 
> 
> 
> 
> > 
> >> +        spapr_do_client_claim(spapr, fdt_addr, fdt_totalsize(fdt), 0);
> >> +    }
> >>      spapr->cas_reboot = false;
> >>  }
> >>  
> >> @@ -2897,12 +3040,12 @@ static void spapr_machine_init(MachineState *machine)
> >>          uint64_t lowaddr = 0;
> >>  
> >>          spapr->kernel_size = load_elf(kernel_filename, NULL,
> >> -                                      translate_kernel_address, NULL,
> >> +                                      translate_kernel_address, spapr,
> >>                                        NULL, &lowaddr, NULL, 1,
> >>                                        PPC_ELF_MACHINE, 0, 0);
> >>          if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
> >>              spapr->kernel_size = load_elf(kernel_filename, NULL,
> >> -                                          translate_kernel_address, NULL, NULL,
> >> +                                          translate_kernel_address, spapr, NULL,
> >>                                            &lowaddr, NULL, 0, PPC_ELF_MACHINE,
> >>                                            0, 0);
> >>              spapr->kernel_le = spapr->kernel_size > 0;
> >> @@ -2918,7 +3061,7 @@ static void spapr_machine_init(MachineState *machine)
> >>              /* Try to locate the initrd in the gap between the kernel
> >>               * and the firmware. Add a bit of space just in case
> >>               */
> >> -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
> >> +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
> >>                                    + 0x1ffff) & ~0xffff;
> >>              spapr->initrd_size = load_image_targphys(initrd_filename,
> >>                                                       spapr->initrd_base,
> >> @@ -2932,20 +3075,22 @@ static void spapr_machine_init(MachineState *machine)
> >>          }
> >>      }
> >>  
> >> -    if (bios_name == NULL) {
> >> -        bios_name = FW_FILE_NAME;
> >> +    if (spapr->bios_enabled) {
> >> +        if (bios_name == NULL) {
> >> +            bios_name = FW_FILE_NAME;
> >> +        }
> >> +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> >> +        if (!filename) {
> >> +            error_report("Could not find LPAR firmware '%s'", bios_name);
> >> +            exit(1);
> >> +        }
> >> +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> >> +        if (fw_size <= 0) {
> >> +            error_report("Could not load LPAR firmware '%s'", filename);
> >> +            exit(1);
> >> +        }
> >> +        g_free(filename);
> >>      }
> >> -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> >> -    if (!filename) {
> >> -        error_report("Could not find LPAR firmware '%s'", bios_name);
> >> -        exit(1);
> >> -    }
> >> -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> >> -    if (fw_size <= 0) {
> >> -        error_report("Could not load LPAR firmware '%s'", filename);
> >> -        exit(1);
> >> -    }
> >> -    g_free(filename);
> >>  
> >>      /* FIXME: Should register things through the MachineState's qdev
> >>       * interface, this is a legacy from the sPAPREnvironment structure
> >> @@ -3162,6 +3307,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
> >>      visit_type_uint32(v, name, (uint32_t *)opaque, errp);
> >>  }
> >>  
> >> +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
> >> +                                  void *opaque, Error **errp)
> >> +{
> >> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> >> +}
> >> +
> >> +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
> >> +                                  void *opaque, Error **errp)
> >> +{
> >> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> >> +}
> > 
> > Might be clearer to split the adjustable kernel addr stuff out into a
> > separate patch - in fact, didn't you have such a patch before?
> 
> 
> I did but it was not useful on its own and the whole "bios=off" feature
> does not need to be "series" at this stage. Also having it separate
> (before or after "kill slof") means changing same lines in 2 patches of
> the same patchset which is never really good.

Hm, ok.  What's the reason you require the changeable load location
for nobios?

> >> +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
> >> +{
> >> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> >> +
> >> +    return spapr->bios_enabled;
> >> +}
> >> +
> >> +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
> >> +{
> >> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> >> +
> >> +    spapr->bios_enabled = value;
> >> +}
> >> +
> >>  static char *spapr_get_ic_mode(Object *obj, Error **errp)
> >>  {
> >>      SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> >> @@ -3267,6 +3438,20 @@ static void spapr_instance_init(Object *obj)
> >>      object_property_add_bool(obj, "vfio-no-msix-emulation",
> >>                               spapr_get_msix_emulation, NULL, NULL);
> >>  
> >> +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
> >> +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
> >> +                        &error_abort);
> >> +    object_property_set_description(obj, "kernel-addr",
> >> +                                    stringify(KERNEL_LOAD_ADDR)
> >> +                                    " for -kernel is the default",
> >> +                                    NULL);
> >> +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
> >> +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
> >> +                            spapr_set_bios_enabled, NULL);
> >> +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
> >> +                                    NULL);
> >> +    spapr->bios_enabled = true;
> >> +
> >>      /* The machine class defines the default interrupt controller mode */
> >>      spapr->irq = smc->irq;
> >>      object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
> >> diff --git a/hw/ppc/spapr_client.c b/hw/ppc/spapr_client.c
> >> new file mode 100644
> >> index 000000000000..a318eaae3c40
> >> --- /dev/null
> >> +++ b/hw/ppc/spapr_client.c
> >> @@ -0,0 +1,464 @@
> >> +#include "qemu/osdep.h"
> >> +#include "qemu-common.h"
> >> +#include "qapi/error.h"
> >> +#include "exec/memory.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "hw/ppc/spapr_vio.h"
> >> +#include "trace.h"
> >> +
> >> +struct prom_args {
> > 
> > StudlyCaps and typedef, please..
> > 
> >> +        __be32 service;
> >> +        __be32 nargs;
> >> +        __be32 nret;
> >> +        __be32 args[10];
> 
> 
> btw we do not generally have __be32 (and x86 build fails), I am
> replacing these with uint32_t.

Ok, good.

> >> +};
> >> +
> >> +#define CLI_PH_MASK     0x0FFFFFFF
> >> +#define CLI_INST_PREFIX 0x20000000
> >> +
> >> +#define readstr(pa, buf) cpu_physical_memory_read((pa), (buf), sizeof(buf))
> > 
> > I'd prefer "readbuf" or something, since this isn't actually looking
> > at any string structure.
> > 
> >> +#define readuint32(pa) ({ \
> >> +    uint32_t __tmp__; \
> >> +    cpu_physical_memory_read((pa), &__tmp__, sizeof(__tmp__)); \
> >> +    be32_to_cpu(__tmp__); })
> > 
> > You can simplify this with ldl_be_phys().  In fact, then you probably
> > don't need any wrapper at all.
> 
> Ah right. There are too many of these, was quicker to type this :)
> 
> 
> >> +
> >> +static bool _cmpservice(const char *s, size_t len,
> >> +                        unsigned nargs, unsigned nret,
> >> +                        const char *s1, size_t len1,
> >> +                        unsigned nargscheck, unsigned nretcheck)
> >> +{
> >> +    if (strncmp(s, s1, MAX(len, len1))) {
> >> +        return false;
> >> +    }
> >> +    if (nargscheck == 0 && nretcheck == 0) {
> >> +        return true;
> >> +    }
> >> +    if (nargs != nargscheck || nret != nretcheck) {
> >> +        trace_spapr_client_error_param(s, nargscheck, nretcheck, nargs, nret);
> >> +        return false;
> >> +    }
> >> +
> >> +    return true;
> >> +}
> >> +
> >> +static uint32_t client_finddevice(const void *fdt, uint32_t nodeaddr)
> >> +{
> >> +    char node[256];
> >> +    int ret;
> >> +
> >> +    readstr(nodeaddr, node);
> >> +    ret = fdt_path_offset(fdt, node);
> >> +    if (ret >= 0) {
> >> +        ret = fdt_get_phandle(fdt, ret);
> >> +    }
> >> +
> >> +    return (uint32_t) ret;
> >> +}
> >> +
> >> +static uint32_t client_getprop(const void *fdt, uint32_t nodeph, uint32_t pname,
> >> +            uint32_t valaddr, uint32_t vallen)
> >> +{
> >> +    char propname[64];
> >> +    uint32_t ret = 0;
> >> +    int proplen = 0;
> >> +    const void *prop;
> >> +
> >> +    readstr(pname, propname);
> >> +
> >> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> >> +                               propname, strlen(propname), &proplen);
> > 
> > You can just use fdt_getprop().  By definition it's equivalent to
> > fdt_getprop_namelen(.., propname, strlen(propname), ...)
> > 
> >> +    if (prop) {
> >> +        int cb = MIN(proplen, vallen);
> >> +
> >> +        cpu_physical_memory_write(valaddr, prop, cb);
> >> +        ret = cb;
> >> +    } else if (strncmp(propname, "stdout", 6) == 0 && vallen == sizeof(ret)) {
> >> +        ret = cpu_to_be32(1);
> >> +        cpu_physical_memory_write(valaddr, &ret, MIN(vallen, sizeof(ret)));
> > 
> > I'm guessing the special casing of stdout is to fake the OF instancing
> > stuff, which I don't really understand.  Some comments to explain
> > what's going on here would be good.
> 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/prom_init.c#n2219
> 
> This thing wants ihandle which is stored as /chosen/stdout and which we
> do not have as there are no instances.
> 
> If this property is not there, prom_panic() happens.
> 
> I could implement some primitive intances, I'll look into this.

Ok, that's roughly as I guessed.  This seems like a reasonable way of
handling this for now without having to implement a notion of
instances - but it would be good to have some comment saying what's
going on here.

We might need to implement at least basic awareness of instances
if/when we need to be a bit more flexible about what input/output
devices we use for the console.

> >> +    } else {
> >> +        ret = -1;
> >> +    }
> >> +    trace_spapr_client_getprop(nodeph, propname, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t client_setprop(SpaprMachineState *sm,
> >> +                               uint32_t nodeph, uint32_t pname,
> >> +                               uint32_t valaddr, uint32_t vallen)
> >> +{
> >> +    char propname[64];
> >> +    uint32_t ret = -1;
> >> +    int proplen = 0;
> >> +    const void *prop;
> >> +
> >> +    readstr(pname, propname);
> >> +    if (vallen == sizeof(uint32_t) &&
> >> +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
> >> +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
> >> +
> >> +        sm->rtas_base = readuint32(valaddr);
> >> +        prop = fdt_getprop_namelen(sm->fdt_blob,
> >> +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
> >> +                                                              nodeph),
> >> +                                   propname, strlen(propname), &proplen);
> >> +        if (proplen == vallen) {
> >> +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
> >> +            ret = proplen;
> >> +        }
> > 
> > Is there a particular reason to restrict this to the rtas properties,
> > rather than just allowing the guest to fdt_setprop() something
> > arbitrary?
> 
> The FDT is flatten and I am not quite sure if libfdt can handle updating
> properties if the length has changed.

fdt_setprop() will handle updating properties with changed length (in
fact there's a special fdt_setprop_inplace() optimized for the case
where you don't need that).  It's not particularly efficient, but it
should work fine for the cases we have here.  In fact, I think you're
already relying on this for the code that adds the phandles to
everything.

One complication is that it can return FDT_ERR_NOSPACE if there isn't
enough buffer for the increased thing.  We could either trap that,
resize and retry, or we could leave a bunch of extra space.  The
latter would be basically equivalent to not doing fdt_pack() on the
blob in the nobios case.

> Also, more importantly, potentially property changes like this may have
> to be reflected in the QEMU device tree so I allowed only the properties
> which I know how to deal with.

That's a reasonable concern, but the nice thing about not having SLOF
is that there's only one copy of the device tree - the blob in qemu.
So a setprop() on that is automatically a setprop() everywhere (this
is another reason not to write the fdt into guest memory in the nobios
case - it will become stale as soon as the client changes anything).

> >> +    }
> >> +    trace_spapr_client_setprop(nodeph, propname, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t client_getproplen(const void *fdt, uint32_t nodeph,
> >> +                                  uint32_t pname)
> >> +{
> >> +    char propname[64];
> >> +    uint32_t ret = 0;
> >> +    int proplen = 0;
> >> +    const void *prop;
> >> +
> >> +    readstr(pname, propname);
> >> +
> >> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> >> +                               propname, strlen(propname), &proplen);
> >> +    if (prop) {
> >> +        ret = proplen;
> >> +    } else if (strncmp(propname, "stdout", 6) == 0) {
> >> +        ret = 4;
> >> +    } else {
> >> +        ret = -1;
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t client_peer(const void *fdt, uint32_t phandle)
> >> +{
> >> +    int ret;
> >> +
> >> +    if (phandle == 0) {
> >> +        ret = fdt_path_offset(fdt, "/");
> >> +    } else {
> >> +        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> >> +    }
> >> +
> >> +    if (ret < 0) {
> >> +        ret = 0;
> >> +    } else {
> >> +        ret = fdt_get_phandle(fdt, ret);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t client_child(const void *fdt, uint32_t phandle)
> >> +{
> >> +    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> >> +
> >> +    if (ret < 0) {
> >> +        ret = 0;
> >> +    } else {
> >> +        ret = fdt_get_phandle(fdt, ret);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t client_parent(const void *fdt, uint32_t phandle)
> >> +{
> >> +    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> >> +
> >> +    if (ret < 0) {
> >> +        ret = 0;
> >> +    } else {
> >> +        ret = fdt_get_phandle(fdt, ret);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t client_open(uint32_t phandle)
> >> +{
> >> +    uint32_t ret = (phandle & CLI_PH_MASK) | CLI_INST_PREFIX;
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t client_instance_to_path(SpaprMachineState *sm,
> >> +                                        uint32_t instance, uint32_t buf,
> >> +                                        uint32_t len)
> >> +{
> >> +    uint32_t ph = instance & CLI_PH_MASK;
> >> +    char *stdout_path = spapr_vio_stdout_path(sm->vio_bus);
> >> +    int stdout_ph = fdt_path_offset(sm->fdt_blob, stdout_path);
> >> +
> >> +    if (ph == stdout_ph) {
> >> +        len = MIN(len, strlen(stdout_path));
> >> +        cpu_physical_memory_write(buf, stdout_path, len);
> >> +    } else  {
> >> +        len = 0;
> >> +    }
> >> +    g_free(stdout_path);
> >> +
> >> +    return len;
> >> +}
> >> +
> >> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
> >> +                                       uint32_t buf, uint32_t len)
> >> +{
> >> +    char tmp[256];
> > 
> > Fixed sized buffers are icky.  You could either dynamically allocate
> > this based on the size the client gives, or you could use
> > memory_region_get_ram_ptr() to read the data from the tree directly
> > into guest memory.
> 
> True, will fix.
> 
> 
> >> +    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
> >> +                          sizeof(tmp))) {
> >> +        tmp[sizeof(tmp) - 1] = 0;
> >> +        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
> >> +    }
> >> +    return len;
> >> +}
> >> +
> >> +static uint32_t client_write(uint32_t instance, uint32_t buf, uint32_t len)
> >> +{
> >> +    char tmp[len + 1]; /* TODO: do a loop as len might be enourmous*/
> >> +
> >> +    readstr(buf, tmp);
> >> +    tmp[len] = 0;
> >> +    printf("%s", tmp);
> > 
> > I'm guessing this is what's used for console output via OF.  This is
> > enough for a debugging PoC, but for real we'll need to have this
> > actually go via the appropriate qemu chardev.
> 
> Yes, this is for PoC. instance need to be tracked back to the actual
> device and I just do not have instances now.
> 
> 
> >> +
> >> +    return len;
> >> +}
> >> +
> >> +static bool client_claim_avail(SpaprMachineState *sm, uint64_t virt,
> >> +                               uint64_t size)
> >> +{
> >> +    int i;
> >> +    spapr_of_claimed *c;
> >> +
> >> +    for (i = 0; i < sm->claimed->len; ++i) {
> >> +        c = &g_array_index(sm->claimed, spapr_of_claimed, i);
> >> +        if ((c->start <= virt && virt < c->start + c->size) ||
> >> +            (virt <= c->start && c->start < virt + size)) {
> >> +            return false;
> >> +        }
> >> +    }
> >> +
> >> +    return true;
> >> +}
> >> +
> >> +uint64_t spapr_do_client_claim(SpaprMachineState *sm,
> > 
> > SpaprMachineState * are usually called 'spapr', not 'sm'.
> 
> Ok I'll fix but include/hw/ppc/spapr.h disagrees with you, this was the
> place where I added the very first function prototype which started all
> this :)

Oops, I'll try to fix that.

> >> uint64_t virt,
> >> +                               uint64_t size, uint64_t align)
> >> +{
> >> +    uint32_t ret;
> >> +    spapr_of_claimed newclaim;
> >> +
> >> +    if (align == 0) {
> >> +        if (!client_claim_avail(sm, virt, size)) {
> >> +            return -1;
> >> +        }
> >> +        ret = virt;
> >> +    } else {
> >> +        align = pow2ceil(align);
> >> +        sm->claimed_base = (sm->claimed_base + align - 1) & ~(align - 1);
> >> +        while (1) {
> >> +            if (sm->claimed_base >= sm->rma_size) {
> >> +                perror("Out of memory");
> > 
> > This should be an error_report() rather than a perror().
> > 
> >> +                return -1;
> >> +            }
> >> +            if (client_claim_avail(sm, sm->claimed_base, size)) {
> >> +                break;
> >> +            }
> >> +            sm->claimed_base += size;
> >> +        }
> >> +        ret = sm->claimed_base;
> >> +    }
> >> +
> >> +    sm->claimed_base = MAX(sm->claimed_base, ret) + size;
> > 
> > I understand the logic with claimed_base.  It looks like you're
> > implementing a bump allocator, but isn't the the client passing in the
> > address to claim anyway, so why do you need an allocator?
> 
> 
> The client passes 3 numbers - address, size, align, the address is only
> used when align==0. So it is either allocating or claiming. I was not
> quite sure if address==0 mode is (not) used, especially for zImage.

Ah, ok.  Some comments to clarify that would be nice.

> Having said that, I noticed that Linux always calls "claim" with
> non-zero address and increments it until it succeeded so I can ignore
> allocating case when called by Linux. I still need some
> allocator/claimer for myself though.
> 
> 
> > 
> >> +    newclaim.start = virt;
> >> +    newclaim.size = size;
> >> +    g_array_append_val(sm->claimed, newclaim);
> >> +    trace_spapr_client_claim(virt, size, align, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t client_claim(SpaprMachineState *sm, uint32_t virt,
> >> +                             uint32_t size, uint32_t align)
> >> +{
> >> +    return spapr_do_client_claim(sm, virt, size, align);
> > 
> > Um.. why this wrapper?
> 
> 
> It grew to this. And I may add separate claim() and alloc() and only
> call claim() from here.
> 
> > 
> >> +}
> >> +
> >> +static uint32_t client_nextprop(const void *fdt, uint32_t phandle,
> >> +                                uint32_t prevaddr, uint32_t nameaddr)
> >> +{
> >> +    int namelen = 0;
> >> +    int offset = fdt_node_offset_by_phandle(fdt, phandle);
> >> +    char prev[256];
> > 
> > Ew, fixed sized buffer.
> > 
> >> +    const struct fdt_property *prop;
> >> +    const char *tmp;
> >> +
> >> +    readstr(prevaddr, prev);
> >> +    for (offset = fdt_first_property_offset(fdt, offset);
> >> +         (offset >= 0);
> >> +         (offset = fdt_next_property_offset(fdt, offset))) {
> >> +
> >> +        prop = fdt_get_property_by_offset(fdt, offset, &namelen);
> > 
> > fdt_getprop_by_offset() also returns the property's name without
> > having to dick around with fdt_get_string() manually.
> 
> 
> Ah you're right.
> 
> 
> > 
> >> +        if (!prop) {
> >> +            return 0;
> >> +        }
> >> +
> >> +        tmp = fdt_get_string(fdt, fdt32_ld(&prop->nameoff), &namelen);
> >> +        if (prev[0] == 0 ||
> > 
> > == '\0' please.
> > 
> >> +            strncmp(prev, tmp, MAX(namelen, strlen(prev))) == 0) {
> > 
> > This will break very badly if the client passed a name larger than
> > sizeof(prev), since readstr() won't \0 terminate in that case IIUC.
> > 
> >> +            if (prev[0]) {
> >> +                offset = fdt_next_property_offset(fdt, offset);
> >> +                if (offset < 0) {
> >> +                    return 0;
> >> +                }
> >> +            }
> >> +            prop = fdt_get_property_by_offset(fdt, offset, &namelen);
> >> +            if (!prop) {
> >> +                return 0;
> >> +            }
> >> +
> >> +            tmp = fdt_get_string(fdt, fdt32_ld(&prop->nameoff), &namelen);
> >> +            cpu_physical_memory_write(nameaddr, tmp, namelen + 1);
> >> +            return 1;
> >> +        }
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static uint32_t client_call_method(SpaprMachineState *sm, uint32_t methodaddr,
> >> +                                   uint32_t param1, uint32_t param2,
> >> +                                   uint32_t *ret2)
> >> +{
> >> +    uint32_t ret = 0;
> >> +    char method[256];
> > 
> > More fixed sized buffers.
> > 
> >> +    readstr(methodaddr, method);
> >> +    if (strncmp(method, "ibm,client-architecture-support", 31) == 0) {
> >> +
> >> +#define FDT_MAX_SIZE            0x100000
> >> +        ret = do_client_architecture_support(POWERPC_CPU(first_cpu), sm, param2,
> >> +                                             0, FDT_MAX_SIZE);
> >> +        *ret2 = 0;
> >> +    } else if (strncmp(method, "instantiate-rtas", 16) == 0) {
> >> +        uint32_t rtasbase = param2;
> >> +
> >> +        spapr_instantiate_rtas(sm, rtasbase);
> >> +        *ret2 = rtasbase;
> > 
> > In both these cases you ignore param1.  I'm guessing it has the handle
> > of the node on which you're calling the method?  In which case you
> > really should validate it.
> 
> It is an ihandle which I do not do now.

Ah... ok.

> >> +    } else {
> >> +        trace_spapr_client_error_unknown_method(method);
> >> +        return -1;
> >> +    }
> >> +
> >> +    trace_spapr_client_method(method, param1, param2, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static void client_quiesce(SpaprMachineState *sm)
> >> +{
> > 
> > Possibly want a flush of stdout here.
> > 
> >> +}
> >> +
> >> +int spapr_h_client(SpaprMachineState *sm, target_ulong client_args)
> >> +{
> >> +    struct prom_args args = { 0 };
> >> +    char service[64];
> > 
> > More fixed sized buffers.
> > 
> >> +    unsigned nargs, nret;
> >> +    int i, servicelen;
> >> +
> >> +    cpu_physical_memory_read(client_args, &args, sizeof(args));
> >> +    nargs = be32_to_cpu(args.nargs);
> >> +    nret = be32_to_cpu(args.nret);
> >> +    readstr(be32_to_cpu(args.service), service);
> >> +    servicelen = strlen(service);
> >> +
> >> +#define cmpservice(s, a, r) \
> >> +        _cmpservice(service, servicelen, nargs, nret, (s), sizeof(s), (a), (r))
> >> +
> >> +    if (cmpservice("finddevice", 1, 1)) {
> > 
> > Hrm.  Rather than this giant chain of if(cmp) checks, I'd prefer to
> > have a table of services as a data structure, which the code would
> > scan through.  That does mean you'd have to give the functions
> > implementing each service a common signature, so they'd need to parse
> > out their own parameters, but I still think it's the better option.
> > It also makes it closer to how we implement RTAS call dispatch.
> 
> 
> I seriously doubt it will make the code simpler/shorter/easier; and it
> will move certain checks from compile time to run time. Dispatch tables
> make sense for "call-method" but for the others not so much. Thanks,
> 
> 
> >> +        args.args[nargs] = client_finddevice(sm->fdt_blob,
> >> +                                             be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("getprop", 4, 1)) {
> >> +        args.args[nargs] = client_getprop(sm->fdt_blob,
> >> +                                          be32_to_cpu(args.args[0]),
> >> +                                          be32_to_cpu(args.args[1]),
> >> +                                          be32_to_cpu(args.args[2]),
> >> +                                          be32_to_cpu(args.args[3]));
> >> +    } else if (cmpservice("getproplen", 2, 1)) {
> >> +        args.args[nargs] = client_getproplen(sm->fdt_blob,
> >> +                                             be32_to_cpu(args.args[0]),
> >> +                                             be32_to_cpu(args.args[1]));
> >> +    } else if (cmpservice("setprop", 4, 1)) {
> >> +        args.args[nargs] = client_setprop(sm,
> >> +                                          be32_to_cpu(args.args[0]),
> >> +                                          be32_to_cpu(args.args[1]),
> >> +                                          be32_to_cpu(args.args[2]),
> >> +                                          be32_to_cpu(args.args[3]));
> >> +    } else if (cmpservice("instance-to-path", 3, 1)) {
> >> +        args.args[nargs] = client_instance_to_path(sm,
> >> +                                                   be32_to_cpu(args.args[0]),
> >> +                                                   be32_to_cpu(args.args[1]),
> >> +                                                   be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("package-to-path", 3, 1)) {
> >> +        args.args[nargs] = client_package_to_path(sm->fdt_blob,
> >> +                                                  be32_to_cpu(args.args[0]),
> >> +                                                  be32_to_cpu(args.args[1]),
> >> +                                                  be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("write", 3, 1)) {
> >> +        args.args[nargs] = client_write(be32_to_cpu(args.args[0]),
> >> +                                        be32_to_cpu(args.args[1]),
> >> +                                        be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("peer", 1, 1)) {
> >> +        args.args[nargs] = client_peer(sm->fdt_blob,
> >> +                                       be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("child", 1, 1)) {
> >> +        args.args[nargs] = client_child(sm->fdt_blob,
> >> +                                        be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("parent", 1, 1)) {
> >> +        args.args[nargs] = client_parent(sm->fdt_blob,
> >> +                                         be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("open", 1, 1)) {
> >> +        args.args[nargs] = client_open(be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("call-method", 3, 2)) {
> >> +        uint32_t ret2 = 0;
> >> +
> >> +        args.args[nargs] = client_call_method(sm,
> >> +                                              be32_to_cpu(args.args[0]),
> >> +                                              be32_to_cpu(args.args[1]),
> >> +                                              be32_to_cpu(args.args[2]),
> >> +                                              &ret2);
> >> +        args.args[nargs + 1] = ret2;
> >> +    } else if (cmpservice("claim", 3, 1)) {
> >> +        args.args[nargs] = client_claim(sm,
> >> +                                        be32_to_cpu(args.args[0]),
> >> +                                        be32_to_cpu(args.args[1]),
> >> +                                        be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("nextprop", 3, 1)) {
> >> +        args.args[nargs] = client_nextprop(sm->fdt_blob,
> >> +                                           be32_to_cpu(args.args[0]),
> >> +                                           be32_to_cpu(args.args[1]),
> >> +                                           be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("quiesce", 0, 0)) {
> >> +        client_quiesce(sm);
> >> +    } else if (cmpservice("exit", 0, 0)) {
> >> +        error_report("Stopped as the VM requested \"exit\"");
> >> +        vm_stop(RUN_STATE_PAUSED);
> >> +    } else {
> >> +        trace_spapr_client_error_unknown_service(service, nargs, nret);
> >> +        args.args[nargs] = -1;
> >> +    }
> >> +
> >> +    for (i = 0; i < nret; ++i) {
> >> +        args.args[nargs + i] = be32_to_cpu(args.args[nargs + i]);
> >> +    }
> >> +    cpu_physical_memory_write(client_args, &args, sizeof(args));
> >> +
> >> +    return H_SUCCESS;
> >> +}
> >> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> >> index f1799b1b707d..76f93ec37959 100644
> >> --- a/hw/ppc/spapr_hcall.c
> >> +++ b/hw/ppc/spapr_hcall.c
> >> @@ -1660,15 +1660,12 @@ static bool spapr_hotplugged_dev_before_cas(void)
> >>      return false;
> >>  }
> >>  
> >> -static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >> -                                                  SpaprMachineState *spapr,
> >> -                                                  target_ulong opcode,
> >> -                                                  target_ulong *args)
> >> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> >> +                                            SpaprMachineState *spapr,
> >> +                                            target_ulong addr,
> >> +                                            target_ulong fdt_buf,
> >> +                                            target_ulong fdt_bufsize)
> >>  {
> >> -    /* Working address in data buffer */
> >> -    target_ulong addr = ppc64_phys_to_real(args[0]);
> >> -    target_ulong fdt_buf = args[1];
> >> -    target_ulong fdt_bufsize = args[2];
> >>      target_ulong ov_table;
> >>      uint32_t cas_pvr;
> >>      SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
> >> @@ -1833,13 +1830,15 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >>  
> >>          fdt_bufsize -= sizeof(hdr);
> >>  
> >> -        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
> >> +        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
> >>          _FDT((fdt_pack(fdt)));
> >>  
> >> -        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
> >> -        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
> >> -                                  fdt_totalsize(fdt));
> >> -        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
> >> +        if (spapr->bios_enabled) {
> >> +            cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
> >> +            cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
> >> +                                      fdt_totalsize(fdt));
> >> +            trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
> >> +        }
> >>  
> >>          g_free(spapr->fdt_blob);
> >>          spapr->fdt_size = fdt_totalsize(fdt);
> >> @@ -1854,6 +1853,20 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >>      return H_SUCCESS;
> >>  }
> >>  
> >> +static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >> +                                                  SpaprMachineState *spapr,
> >> +                                                  target_ulong opcode,
> >> +                                                  target_ulong *args)
> >> +{
> >> +    /* Working address in data buffer */
> >> +    target_ulong addr = ppc64_phys_to_real(args[0]);
> >> +    target_ulong fdt_buf = args[1];
> >> +    target_ulong fdt_bufsize = args[2];
> >> +
> >> +    return do_client_architecture_support(cpu, spapr, addr, fdt_buf,
> >> +                                          fdt_bufsize);
> >> +}
> >> +
> >>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
> >>                                                SpaprMachineState *spapr,
> >>                                                target_ulong opcode,
> >> @@ -1998,6 +2011,14 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
> >>      return H_SUCCESS;
> >>  }
> >>  
> >> +static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
> >> +                             target_ulong opcode, target_ulong *args)
> >> +{
> >> +    target_ulong client_args = ppc64_phys_to_real(args[0]);
> >> +
> >> +    return spapr_h_client(spapr, client_args);
> >> +}
> >> +
> >>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
> >>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
> >>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
> >> @@ -2121,6 +2142,8 @@ static void hypercall_register_types(void)
> >>  
> >>      spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
> >>  
> >> +    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
> >> +
> >>      /* Virtual Processor Home Node */
> >>      spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
> >>                               h_home_node_associativity);
> >> diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
> >> index 9ea620f23c85..92bce3e44693 100644
> >> --- a/hw/ppc/trace-events
> >> +++ b/hw/ppc/trace-events
> >> @@ -21,6 +21,15 @@ spapr_update_dt(unsigned cb) "New blob %u bytes"
> >>  spapr_update_dt_failed_size(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
> >>  spapr_update_dt_failed_check(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
> >>  
> >> +# spapr_client.c
> >> +spapr_client_error_param(const char *method, int nargscheck, int nretcheck, int nargs, int nret) "%s takes/returns %d/%d, not %d/%d"
> >> +spapr_client_error_unknown_service(const char *service, int nargs, int nret) "%s args=%d rets=%d"
> >> +spapr_client_error_unknown_method(const char *method) "%s"
> >> +spapr_client_claim(uint32_t virt, uint32_t size, uint32_t align, uint32_t ret) "virt=0x%x size=0x%x align=0x%x => 0x%x"
> >> +spapr_client_method(const char *method, uint32_t param1, uint32_t param2, uint32_t ret) "%s(0x%x, 0x%x) => 0x%x"
> >> +spapr_client_getprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
> >> +spapr_client_setprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
> >> +
> >>  # spapr_hcall_tpm.c
> >>  spapr_h_tpm_comm(const char *device_path, uint64_t operation) "tpm_device_path=%s operation=0x%"PRIu64
> >>  spapr_tpm_execute(uint64_t data_in, uint64_t data_in_sz, uint64_t data_out, uint64_t data_out_sz) "data_in=0x%"PRIx64", data_in_sz=%"PRIu64", data_out=0x%"PRIx64", data_out_sz=%"PRIu64
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06  8:50     ` David Gibson
@ 2020-01-06 12:34       ` Alexey Kardashevskiy
  2020-01-06 17:09         ` Cédric Le Goater
  2020-01-07  5:26         ` David Gibson
  0 siblings, 2 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-06 12:34 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 06/01/2020 19:50, David Gibson wrote:
> On Mon, Jan 06, 2020 at 05:28:55PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 06/01/2020 15:19, David Gibson wrote:
>>> On Mon, Jan 06, 2020 at 10:42:42AM +1100, Alexey Kardashevskiy wrote:
>>>> The Petitboot bootloader is way more advanced than SLOF is ever going to
>>>> be as Petitboot comes with the full-featured Linux kernel with all
>>>> the drivers, and initramdisk with quite user friendly interface.
>>>> The problem with ditching SLOF is that an unmodified pseries kernel can
>>>> either start via:
>>>> 1. kexec, this requires presence of RTAS and skips
>>>> ibm,client-architecture-support entirely;
>>>> 2. normal boot, this heavily relies on the OF1275 client interface to
>>>> fetch the device tree and do early setup (claim memory).
>>>>
>>>> This adds a new bios-less mode to the pseries machine: "bios=on|off".
>>>> When enabled, QEMU does not load SLOF and jumps to the kernel from
>>>> "-kernel".
>>>>
>>>> The client interface is implemented exactly as RTAS - a 20 bytes blob,
>>>> right next after the RTAS blob. The entry point is passed to the kernel
>>>> via GPR5.
>>>>
>>>> This implements a handful of client interface methods just to get going.
>>>> In particular, this implements the device tree fetching,
>>>> ibm,client-architecture-support and instantiate-rtas.
>>>>
>>>> This implements changing FDT properties only for "linux,rtas-base" and
>>>> "linux,rtas-entry", just to get early boot going.
>>>>
>>>> This assigns "phandles" to device tree nodes as there is no more SLOF
>>>> and OF nodes addresses of which served as phandle values.
>>>> This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
>>>> phandles are regenerated at every FDT rebuild.
>>>>
>>>> When bios=off, this adds "/chosen" every time QEMU builds a tree.
>>>>
>>>> This implements "claim" which the client (Linux) uses for memory
>>>> allocation; this is also  used by QEMU for claiming kernel/initrd images,
>>>> client interface entry point, RTAS and the initial stack.
>>>>
>>>> While at this, add a "kernel-addr" machine parameter to allow moving
>>>> the kernel in memory. This is useful for debugging if the kernel is
>>>> loaded at @0, although not necessary.
>>>>
>>>> This does not implement instances properly but in order to boot a VM,
>>>> we only really need a stdout instance and the "/" instance for
>>>> "call-method", we fake these.
>>>
>>> As we've discussed, I really like the idea of this.  It think the
>>> basic approach looks good too.
>>>
>>> As you probably realize, there are quite a lot of things to be
>>> polished though.  Comments below.
>>>
>>>>
>>>> The test command line:
>>>>
>>>> qemu-system-ppc64 \
>>>> -nodefaults \
>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>> -nographic \
>>>> -vga none \
>>>> -kernel vmldbg \
>>>> -machine pseries,bios=off \
>>>> -m 4G \
>>>> -enable-kvm \
>>>> -initrd pb/rootfs.cpio.xz \
>>>> img/u1804-64le.qcow2 \
>>>> -snapshot \
>>>> -smp 8,threads=8 \
>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>> -trace events=qemu_trace_events \
>>>> -d guest_errors \
>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh37874 \
>>>> -mon chardev=SOCKET0,mode=control
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v2:
>>>> * fixed claim()
>>>> * added "setprop"
>>>> * cleaner client interface and RTAS blobs management
>>>> * boots to petitboot and further to the target system
>>>> * more trace points
>>>> ---
>>>>   hw/ppc/Makefile.objs   |   1 +
>>>>   include/hw/loader.h    |   1 +
>>>>   include/hw/ppc/spapr.h |  25 ++-
>>>>   hw/ppc/spapr.c         | 231 ++++++++++++++++++--
>>>>   hw/ppc/spapr_client.c  | 464 +++++++++++++++++++++++++++++++++++++++++
>>>>   hw/ppc/spapr_hcall.c   |  49 +++--
>>>>   hw/ppc/trace-events    |   9 +
>>>>   7 files changed, 743 insertions(+), 37 deletions(-)
>>>>   create mode 100644 hw/ppc/spapr_client.c
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index 101e9fc59185..ce31c0acd2fb 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>>>>   obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>>>>   obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
>>>>   obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
>>>> +obj-$(CONFIG_PSERIES) += spapr_client.o
>>>
>>> This applies in a bunch of places here.  Just calling things "client"
>>> is clear enough if you're already thinking about Open Firmware.  But
>>> this appears in a bunch of places where you might come across it
>>> without that context, in which case it could be kind of confusing.  So
>>> I'd suggest using "of_client" or "of_client_interface" in most places
>>> you're using "client".
>>>
>>>>   obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>>>>   # IBM PowerNV
>>>>   obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
>>>> diff --git a/include/hw/loader.h b/include/hw/loader.h
>>>> index 48a96cd55904..a2f58077a47e 100644
>>>> --- a/include/hw/loader.h
>>>> +++ b/include/hw/loader.h
>>>> @@ -262,6 +262,7 @@ MemoryRegion *rom_add_blob(const char *name, const void *blob, size_t len,
>>>>   int rom_add_elf_program(const char *name, GMappedFile *mapped_file, void *data,
>>>>                           size_t datasize, size_t romsize, hwaddr addr,
>>>>                           AddressSpace *as);
>>>> +bool rom_intersect(uint64_t start, uint64_t size);
>>>
>>> I don't see this called, so I'm guessing it's a stray chunk leftover.
>>>
>>>>   int rom_check_and_register_reset(void);
>>>>   void rom_set_fw(FWCfgState *f);
>>>>   void rom_set_order_override(int order);
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index 61f005c6f686..876879d12029 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -105,6 +105,11 @@ struct SpaprCapabilities {
>>>>       uint8_t caps[SPAPR_CAP_NUM];
>>>>   };
>>>>   
>>>> +typedef struct {
>>>> +    uint64_t start;
>>>> +    uint64_t size;
>>>> +} spapr_of_claimed;
>>>
>>> Please use the qemu convention of StudlyCaps for types.
>>>
>>>>   /**
>>>>    * SpaprMachineClass:
>>>>    */
>>>> @@ -158,8 +163,13 @@ struct SpaprMachineState {
>>>>       uint32_t fdt_size;
>>>>       uint32_t fdt_initial_size;
>>>>       void *fdt_blob;
>>>> +    uint32_t rtas_base;
>>>>       long kernel_size;
>>>>       bool kernel_le;
>>>> +    uint64_t kernel_addr;
>>>
>>> This would intersect with your other patch to allow changing the
>>> kernel load address, yes?
>>
>>
>> This includes kernel_addr, it is a single patch now. More about this below.
> 
> Hm, ok.
> 
>>>> +    bool bios_enabled;
>>>> +    GArray *claimed; /* array of spapr_of_claimed */
>>>> +    uint64_t claimed_base;
>>>
>>> I'd suggest moving all the client interface related fields to a
>>> visually separated chunk in the structure definition.
>>>
>>>>       uint32_t initrd_base;
>>>>       long initrd_size;
>>>>       uint64_t rtc_offset; /* Now used only during incoming migration */
>>>> @@ -510,7 +520,8 @@ struct SpaprMachineState {
>>>>   /* Client Architecture support */
>>>>   #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
>>>>   #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
>>>> -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
>>>> +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
>>>> +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
>>>>   
>>>>   /*
>>>>    * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
>>>> @@ -538,6 +549,12 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
>>>>   target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
>>>>                                target_ulong *args);
>>>>   
>>>> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>>>> +                                            SpaprMachineState *spapr,
>>>> +                                            target_ulong addr,
>>>> +                                            target_ulong fdt_buf,
>>>> +                                            target_ulong fdt_bufsize);
>>>> +
>>>>   /* Virtual Processor Area structure constants */
>>>>   #define VPA_MIN_SIZE           640
>>>>   #define VPA_SIZE_OFFSET        0x4
>>>> @@ -769,6 +786,9 @@ struct SpaprEventLogEntry {
>>>>   void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
>>>>   void spapr_events_init(SpaprMachineState *sm);
>>>>   void spapr_dt_events(SpaprMachineState *sm, void *fdt);
>>>> +uint64_t spapr_do_client_claim(SpaprMachineState *sm, uint64_t virt,
>>>> +                               uint64_t size, uint64_t align);
>>>> +int spapr_h_client(SpaprMachineState *sm, target_ulong client_args);
>>>>   void close_htab_fd(SpaprMachineState *spapr);
>>>>   void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
>>>>   void spapr_free_hpt(SpaprMachineState *spapr);
>>>> @@ -891,4 +911,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
>>>>   #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
>>>>   
>>>>   void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
>>>> +
>>>> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
>>>> +
>>>>   #endif /* HW_SPAPR_H */
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index e62c89b3dd40..1c97534a0aea 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -896,6 +896,51 @@ out:
>>>>       return ret;
>>>>   }
>>>>   
>>>> +/*
>>>> + * Below is a compiled version of RTAS blob and OF client interface entry point.
>>>> + *
>>>> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
>>>> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
>>>> + *
>>>> + *   .globl  _start
>>>> + *   _start:
>>>> + *           mr      4,3
>>>> + *           lis     3,KVMPPC_H_RTAS@h
>>>> + *           ori     3,3,KVMPPC_H_RTAS@l
>>>> + *           sc      1
>>>> + *           blr
>>>> + *           mr      4,3
>>>> + *           lis     3,KVMPPC_H_CLIENT@h
>>>> + *           ori     3,3,KVMPPC_H_CLIENT@l
>>>> + *           sc      1
>>>> + *           blr
>>>> + */
>>>> +static struct {
>>>> +    uint8_t rtas[20], client[20];
>>>> +} QEMU_PACKED rtas_client_blob = {
>>>> +    .rtas = {
>>>> +        0x7c, 0x64, 0x1b, 0x78,
>>>> +        0x3c, 0x60, 0x00, 0x00,
>>>> +        0x60, 0x63, 0xf0, 0x00,
>>>> +        0x44, 0x00, 0x00, 0x22,
>>>> +        0x4e, 0x80, 0x00, 0x20
>>>> +    },
>>>> +    .client = {
>>>> +        0x7c, 0x64, 0x1b, 0x78,
>>>> +        0x3c, 0x60, 0x00, 0x00,
>>>> +        0x60, 0x63, 0xf0, 0x05,
>>>> +        0x44, 0x00, 0x00, 0x22,
>>>> +        0x4e, 0x80, 0x00, 0x20
>>>> +    }
>>>> +};
>>>
>>> I'd really prefer to read this in from a file (or files) which we
>>> assemble (as with the existing spapr-rtas.img), rather than having a
>>> magic array in qemu.
>>
>> Two considerations here:
>> 1. This blob is not going to change (at least often)
> 
> True, but I'd still prefer to build it from a .S rather than have the
> binary written out.  We already had this until recently (when we moved
> the RTAS image into SLOF), and it's really not too hard to manage.


But what is exactly the benefit? It is not going to change. I can look into making it a .S, compile it to a binary an 
include this binary into qemu-system-ppc64 as a byte array, just need to refresh my gcc/ldd memories but a separate file 
is too much for this imho.


>> 2. A new file which needs to be packaged/signed/copied.
> 
> Eh, again, we had this for RTAS until very recently and it really
> wasn't that much hassle.


I keep secure VMs in mind. And SLOF is a separate package, why was not RTAS?


>>>> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
>>>> +{
>>>> +    spapr->rtas_base = base;
>>>> +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
>>>> +                              sizeof(rtas_client_blob.rtas));
>>>> +}
>>>> +
>>>>   static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>>>   {
>>>>       MachineState *ms = MACHINE(spapr);
>>>> @@ -980,6 +1025,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>>>       _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
>>>>                        lrdr_capacity, sizeof(lrdr_capacity)));
>>>>   
>>>> +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-base", spapr->rtas_base));
>>>> +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-entry", spapr->rtas_base));
>>>
>>> IIUC, we shouldn't need to add these in bios=off mode, yes?  Instead
>>> the client must instantiate-rtas before attempting to use it.
>>
>>
>> Well. Here is what happens:
>>
>> 1. spapr machine reset creates FDT, lets say without rtas-entry
>> 2. linux calls ibm,client-interface-support
> 
> I assume you mean ibm,client-architecture-support?

Yes.

> 
>> 3. the QEMU handler rebuilds the FDT
> 
> Oh.. not directly relevant here, but that might be an issue.  Do we
> need to ensure that phandles are stable across the CAS call?


We already have predefined values for phandles we care about.


>> 4. linux calls "instantiate-rtas"
>> 5. linux stores "linux,rtas-base" and "linux,rtas-entry" in the FDT
>> using "setprop" client call
>> 6. linux fetches the tree by traversing it and packs it into FDT blob itself
>> 7. linux calls quiesce
>> 8. linux uses "linux,rtas-entry" from FDT to call RTAS.
>>
>> Because in 5) I only allow in-place property update and only for
>> "linux,rtas-base" and "linux,rtas-entry", these do not make to the tree
>> and there is no "linux,rtas-entry" in the FDT.
>>
>> There are choices:
>>
>> 1. I can extend "setprop" (tricky as FDT is flattened by then?)
> 
> I actually don't think this should be too hard - though maybe I've
> missed something, so I'd say this is my frontrunning option.

fdt_splice_() fails because of lack of space in the FDT (you pointed to it below). Not calling fdt_pack() fixed it as it 
just keeps FDT blob rather big. Not sure it is worth further optimizations at this point.


>> 2. simply store rtas base in spapr->rtas_base and do complete FDT rebuild
> 
> I don't think that's viable - phandles definitely shouldn't change
> across an instantiate-rtas call.
> 
>> 3. (currently) I basically have 2 placeholders for these
>> "linux,rtas-base" and "linux,rtas-entry".
> 
> Hm, ok.
> 
>> Or we can avoid flattening FDT at all for bios=off (as nobody wants
>> flattened device tree, my "fetch-fdt" never made it to upstream Linux),
>> then 1) is easy (if it is a problem to begin with).
> 
> Right, but we need a live DT library to make it happen, which means
> someone finding time to write it.


I dropped fdt_pack() and change properties via fdt_setprop() - works just fine.



>>>> +    _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
>>>> +                          sizeof(rtas_client_blob.rtas)));
>>>> +    _FDT(fdt_add_mem_rsv(fdt, spapr->rtas_base,
>>>> +                         sizeof(rtas_client_blob.rtas)));
>>>
>>> Nor should we need the mem_rsv with bios=off.
>>
>> True. If not the "setprop" limitation from above, I would only need
>> "rtas-size" here.
>>
>>
>>>
>>>> +
>>>>       spapr_dt_rtas_tokens(fdt, rtas);
>>>>   }
>>>>   
>>>> @@ -1057,7 +1109,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>>>>       }
>>>>   
>>>>       if (spapr->kernel_size) {
>>>> -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
>>>> +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
>>>>                                 cpu_to_be64(spapr->kernel_size) };
>>>>   
>>>>           _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
>>>> @@ -1104,6 +1156,9 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>>>>           _FDT(fdt_setprop_cell(fdt, chosen, "linux,pci-probe-only", 0));
>>>>       }
>>>>   
>>>> +    if (!spapr->bios_enabled) {
>>>> +        _FDT(fdt_setprop_cell(fdt, chosen, "cpu", 0));
>>>
>>> What is this for?
>>
>> Looks like a leftover, I did this for prom_find_boot_cpu() but after
>> double checking and trying it looks like it can proceed without this chunk.
> 
> Ok.
> 
>>>
>>>> +    }
>>>>       spapr_dt_ov5_platform_support(spapr, fdt, chosen);
>>>>   
>>>>       g_free(stdout_path);
>>>> @@ -1133,6 +1188,28 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
>>>>       }
>>>>   }
>>>>   
>>>> +static void spapr_fdt_add_phandles(void *fdt, int parent, uint32_t *phandle)
>>>> +{
>>>> +    int proplen = 0, node;
>>>> +    const void *prop;
>>>> +    char tmp[256] = "BAAD";
>>>> +
>>>> +    fdt_get_path(fdt, parent, tmp, sizeof(tmp));
>>>
>>> You don't actually appear to use tmp.  Is this a debugging leftover?
>>
>> Yes it is.
>>
>>
>>>
>>>> +    prop = fdt_getprop_namelen(fdt, parent, "phandle", 7, &proplen);
>>>> +    if (!prop) {
>>>> +        _FDT(fdt_setprop_cell(fdt, parent, "phandle", *phandle));
>>>> +        while (1) {
>>>> +            ++*phandle;
>>>> +            if (fdt_node_offset_by_phandle(fdt, *phandle) < 0) {
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    fdt_for_each_subnode(node, fdt, parent) {
>>>> +        spapr_fdt_add_phandles(fdt, node, phandle);
>>>> +    }
>>>
>>> Oof, this is pretty inefficient, since fdt_node_offset_by_phandle()
>>> will have to scan the whole tree, and you do it on every node.
>>
>>
>> Inefficient indeed but for a VM with ~30 devices it is not that bad :)
>>
>>
>>> Obviously, a "live" tree structure rather than fdt would be better
>>> suited for this.  I'd like to move to that at some point, but it will
>>> be a while before we're ready.
>>
>> I could have a cache/hashmap of phandles...
> 
> We could, but I doubt it's worth the trouble.  Keeping the offsets in
> there up to date against setprops or other changes could actually a
> fair bit of hassle, I suspect.
> 
>>> I guess this is fine for a PoC, but I think we can do a bit better
>>> without too much extra complexity:
>>>    * Rather than scanning for an existing node with the handle on each
>>>      iteration, you can use fdt_find_max_phandle() once beforehand,
>>>      and start your counter from there, being assured that it won't hit
>>>      any existing phandles
>>
>> I'd like to avoid that as NPU/GPU phandles are quite huge and we will
>> have huge gaps between phandles numbers which is not so much of a big
>> deal but annoying to debug.
> 
> Hrm.  So huge we'll actually be short of space above them?


No, but "randomness" is annoying.


>>>    * The phandles are arbitrary, so you don't actually care about the
>>>      tree structure here.  So instead of recursively descending the
>>>      tree, you can just alter each node in the order they appear in the
>>>      blob.  fdt_next_node() will let you do that, simply by ignoring
>>>      the 'depth' parameter.  [In fact the structure ensures that will
>>>      be a depth first traversal of the tree, just like you have here,
>>>      not that it matters].
>>
>> ... but this one is better.
>>
>> Unlike many others, fdt_next_node() does not have a comment in
>> dtc/libfdt/libfdt.h so I was not sure about what *depth is for and I
>> basically ignored fdt_next_node().
> 
> Yeah, it's kind of subtle - mostly intended as an internal primitive -
> but it suits your needs here.  It moves to the next node in the
> structure block - regardless of whether it's a sibling, child or
> sibling-of-ancestor.  *depth, if a pointer is supplied, is updated to
> reflect which case it is.
> 
>>>    * You should check for overflows of your phandle counter, though
>>>      it's fine to treat that as a fatal error.
>>
>> Ok.
>>
>>
>>>> +}
>>>> +
>>>>   void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>>>   {
>>>>       MachineState *machine = MACHINE(spapr);
>>>> @@ -1245,7 +1322,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>>>       /* Build memory reserve map */
>>>>       if (reset) {
>>>>           if (spapr->kernel_size) {
>>>> -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
>>>> +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
>>>> +                                  spapr->kernel_size)));
>>>>           }
>>>>           if (spapr->initrd_size) {
>>>>               _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
>>>> @@ -1268,12 +1346,19 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>>>           }
>>>>       }
>>>>   
>>>> +    if (!spapr->bios_enabled) {
>>>> +        uint32_t phandle = 1;
>>>> +
>>>> +        spapr_fdt_add_phandles(fdt, fdt_path_offset(fdt, "/"), &phandle);
>>>> +    }
>>>> +
>>>>       return fdt;
>>>>   }
>>>>   
>>>>   static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
>>>>   {
>>>> -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
>>>> +    SpaprMachineState *spapr = opaque;
>>>> +    return (addr & 0x0fffffff) + spapr->kernel_addr;
>>>>   }
>>>>   
>>>>   static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
>>>> @@ -1660,6 +1745,60 @@ static void spapr_machine_reset(MachineState *machine)
>>>>        */
>>>>       fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
>>>>   
>>>> +    /* Set up the entry state */
>>>> +    if (!spapr->bios_enabled) {
>>>> +        if (spapr->claimed) {
>>>> +            g_array_unref(spapr->claimed);
>>>> +        }
>>>> +        spapr->claimed = g_array_new(false, false, sizeof(spapr_of_claimed));
>>>> +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
>>>> +
>>>> +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr, fdt_addr);
>>>> +
>>>> +        if (spapr_do_client_claim(spapr, spapr->kernel_addr,
>>>> +                                  spapr->kernel_size, 0) == -1) {
>>>> +            error_report("Memory for kernel is in use");
>>>> +            exit(1);
>>>> +        }
>>>> +        if (spapr_do_client_claim(spapr, spapr->initrd_base,
>>>> +                                  spapr->initrd_size, 0) == -1) {
>>>> +            error_report("Memory for initramdisk is in use");
>>>> +            exit(1);
>>>> +        }
>>>> +        first_ppc_cpu->env.gpr[1] = spapr_do_client_claim(spapr, 0, 0x40000,
>>>> +                                                          0x10000);
>>>> +        if (first_ppc_cpu->env.gpr[1] == -1) {
>>>> +            error_report("Memory for stack is in use");
>>>> +            exit(1);
>>>> +        }
>>>> +
>>>> +        first_ppc_cpu->env.gpr[5] =
>>>> +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
>>>> +                                  sizeof(rtas_client_blob.client));
>>>> +        if (first_ppc_cpu->env.gpr[5] == -1) {
>>>> +            error_report("Memory for OF client is in use");
>>>> +            exit(1);
>>>> +        }
>>>> +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
>>>> +                                  rtas_client_blob.client,
>>>> +                                  sizeof(rtas_client_blob.client));
>>>> +
>>>> +        /* Allocate space for RTAS here so spapr_build_fdt() below picks it */
>>>> +        spapr->rtas_base =
>>>> +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.rtas),
>>>> +                                  0x100);
>>>> +        if (spapr->rtas_base == -1) {
>>>> +            error_report("Memory for RTAS is in use");
>>>> +            exit(1);
>>>> +        }
>>>> +        cpu_physical_memory_write(spapr->rtas_base,
>>>> +                                  rtas_client_blob.rtas,
>>>> +                                  sizeof(rtas_client_blob.rtas));
>>>
>>> The above is complex enough it might be worth having bios and nobios
>>> variants of spapr_cpu_set_entry_state().
>>
>> Probably yes, when it settles. I'd rather prefer having kernel/initrd
>> claiming code where we load them but these are Roms and when we "load"
>> them, they are not loaded, they are loaded from a machine reset handler but:
>>
>> may be we could add a hook to Roms to be called every time an image is
>> actually written to the guest memory and
>>
>> this is when I would "claim" these chunks and
>>
>> this would make the chunk above shorter and we won't need a separate
>> spapr_cpu_set_entry_state().
> 
> Hm, ok.  Well let's see how it turns out.
> 
> Fwiw, I have considered replacing those load_targphys() things with
> explicit write in at reset time.  load_targphys() is, IIUC, mostly
> used for loading things like ROMs which exist in physical memory but
> don't go into regular RAM.  So if it turns out not to really match our
> needs here, that's ok.


Agree, may be later. I want to keep this a single patch.



>>>> +    } else {
>>>> +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
>>>> +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
>>>> +    }
>>>> +
>>>>       fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
>>>>   
>>>>       rc = fdt_pack(fdt);
>>>> @@ -1675,10 +1814,14 @@ static void spapr_machine_reset(MachineState *machine)
>>>>       spapr->fdt_initial_size = spapr->fdt_size;
>>>>       spapr->fdt_blob = fdt;
>>>>   
>>>> -    /* Set up the entry state */
>>>> -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
>>>> -    first_ppc_cpu->env.gpr[5] = 0;
>>>> -
>>>> +    if (!spapr->bios_enabled) {
>>>> +        /*
>>>> +         * Claim FDT space so initramdisk/zImage are not unpacked over it.
>>>> +         * In practice Linux claim additional memory for FDT and fetches it
>>>> +         * using client interface's "getprop" call.
>>>> +         */
>>>
>>> IIUC, it shouldn't actually be necessary to put the fdt into guest
>>> memory in nobios mode - it will be retrieved piecemeal via the client
>>> interface instead, won't it?
>>
>> Huh. I was under the impression that we pass the FDT to a VM via GPR3
>> and it is safer to avoid "claim" from claiming this memory (even though
>> Linux is not using it).
> 
> Well, depends what you mean.  I mean that's the "fdt boot" convention,
> which you've used previously for the nobios boot with an aware guest
> image.  We use a similar convention for firing up SLOF itself.  But if
> we're starting with a (simulated) OF entry condition, then as long as
> we have a working client interface, we don't need an fdt blob anywhere
> guest visible.
> 
>> Now I realize that this GPR3 is QEMU<->SLOF convention (or some OFW
>> binding - what is it exactly?) and Linux rather expects initramdisk
>> start/size in GPR3/4 and this did not hit me just because GPR4 is always
>> 0 so Linux falls back to the device tree for initramdisk location.
>>
>>
>>
>>
>>>
>>>> +        spapr_do_client_claim(spapr, fdt_addr, fdt_totalsize(fdt), 0);
>>>> +    }
>>>>       spapr->cas_reboot = false;
>>>>   }
>>>>   
>>>> @@ -2897,12 +3040,12 @@ static void spapr_machine_init(MachineState *machine)
>>>>           uint64_t lowaddr = 0;
>>>>   
>>>>           spapr->kernel_size = load_elf(kernel_filename, NULL,
>>>> -                                      translate_kernel_address, NULL,
>>>> +                                      translate_kernel_address, spapr,
>>>>                                         NULL, &lowaddr, NULL, 1,
>>>>                                         PPC_ELF_MACHINE, 0, 0);
>>>>           if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
>>>>               spapr->kernel_size = load_elf(kernel_filename, NULL,
>>>> -                                          translate_kernel_address, NULL, NULL,
>>>> +                                          translate_kernel_address, spapr, NULL,
>>>>                                             &lowaddr, NULL, 0, PPC_ELF_MACHINE,
>>>>                                             0, 0);
>>>>               spapr->kernel_le = spapr->kernel_size > 0;
>>>> @@ -2918,7 +3061,7 @@ static void spapr_machine_init(MachineState *machine)
>>>>               /* Try to locate the initrd in the gap between the kernel
>>>>                * and the firmware. Add a bit of space just in case
>>>>                */
>>>> -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
>>>> +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
>>>>                                     + 0x1ffff) & ~0xffff;
>>>>               spapr->initrd_size = load_image_targphys(initrd_filename,
>>>>                                                        spapr->initrd_base,
>>>> @@ -2932,20 +3075,22 @@ static void spapr_machine_init(MachineState *machine)
>>>>           }
>>>>       }
>>>>   
>>>> -    if (bios_name == NULL) {
>>>> -        bios_name = FW_FILE_NAME;
>>>> +    if (spapr->bios_enabled) {
>>>> +        if (bios_name == NULL) {
>>>> +            bios_name = FW_FILE_NAME;
>>>> +        }
>>>> +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
>>>> +        if (!filename) {
>>>> +            error_report("Could not find LPAR firmware '%s'", bios_name);
>>>> +            exit(1);
>>>> +        }
>>>> +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
>>>> +        if (fw_size <= 0) {
>>>> +            error_report("Could not load LPAR firmware '%s'", filename);
>>>> +            exit(1);
>>>> +        }
>>>> +        g_free(filename);
>>>>       }
>>>> -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
>>>> -    if (!filename) {
>>>> -        error_report("Could not find LPAR firmware '%s'", bios_name);
>>>> -        exit(1);
>>>> -    }
>>>> -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
>>>> -    if (fw_size <= 0) {
>>>> -        error_report("Could not load LPAR firmware '%s'", filename);
>>>> -        exit(1);
>>>> -    }
>>>> -    g_free(filename);
>>>>   
>>>>       /* FIXME: Should register things through the MachineState's qdev
>>>>        * interface, this is a legacy from the sPAPREnvironment structure
>>>> @@ -3162,6 +3307,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
>>>>       visit_type_uint32(v, name, (uint32_t *)opaque, errp);
>>>>   }
>>>>   
>>>> +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
>>>> +                                  void *opaque, Error **errp)
>>>> +{
>>>> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
>>>> +}
>>>> +
>>>> +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
>>>> +                                  void *opaque, Error **errp)
>>>> +{
>>>> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
>>>> +}
>>>
>>> Might be clearer to split the adjustable kernel addr stuff out into a
>>> separate patch - in fact, didn't you have such a patch before?
>>
>>
>> I did but it was not useful on its own and the whole "bios=off" feature
>> does not need to be "series" at this stage. Also having it separate
>> (before or after "kill slof") means changing same lines in 2 patches of
>> the same patchset which is never really good.
> 
> Hm, ok.  What's the reason you require the changeable load location
> for nobios?


I do not require it. It is purely for debugging. You can basically set address to 0, run qemu with -s and do step 
debugging with debug symbols from vmlinux (well, keep switching endianness), otherwise you would need to calculate 
offset (0x400000 or whatever it is) for every breakpoint you set. Once I passed the prom_init() debugging stage (which 
did not work because I did not provide stack), I have not used it for last couple of days.

So I could ditch it now but we could use it to reduce (small but still) memory fragmentation by placing vmlinux at 0. Dunno.


>>>> +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
>>>> +{
>>>> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>>>> +
>>>> +    return spapr->bios_enabled;
>>>> +}
>>>> +
>>>> +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
>>>> +{
>>>> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>>>> +
>>>> +    spapr->bios_enabled = value;
>>>> +}
>>>> +
>>>>   static char *spapr_get_ic_mode(Object *obj, Error **errp)
>>>>   {
>>>>       SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>>>> @@ -3267,6 +3438,20 @@ static void spapr_instance_init(Object *obj)
>>>>       object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>                                spapr_get_msix_emulation, NULL, NULL);
>>>>   
>>>> +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
>>>> +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
>>>> +                        &error_abort);
>>>> +    object_property_set_description(obj, "kernel-addr",
>>>> +                                    stringify(KERNEL_LOAD_ADDR)
>>>> +                                    " for -kernel is the default",
>>>> +                                    NULL);
>>>> +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
>>>> +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
>>>> +                            spapr_set_bios_enabled, NULL);
>>>> +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
>>>> +                                    NULL);
>>>> +    spapr->bios_enabled = true;
>>>> +
>>>>       /* The machine class defines the default interrupt controller mode */
>>>>       spapr->irq = smc->irq;
>>>>       object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
>>>> diff --git a/hw/ppc/spapr_client.c b/hw/ppc/spapr_client.c
>>>> new file mode 100644
>>>> index 000000000000..a318eaae3c40
>>>> --- /dev/null
>>>> +++ b/hw/ppc/spapr_client.c
>>>> @@ -0,0 +1,464 @@
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu-common.h"
>>>> +#include "qapi/error.h"
>>>> +#include "exec/memory.h"
>>>> +#include "hw/ppc/spapr.h"
>>>> +#include "hw/ppc/spapr_vio.h"
>>>> +#include "trace.h"
>>>> +
>>>> +struct prom_args {
>>>
>>> StudlyCaps and typedef, please..
>>>
>>>> +        __be32 service;
>>>> +        __be32 nargs;
>>>> +        __be32 nret;
>>>> +        __be32 args[10];
>>
>>
>> btw we do not generally have __be32 (and x86 build fails), I am
>> replacing these with uint32_t.
> 
> Ok, good.
> 
>>>> +};
>>>> +
>>>> +#define CLI_PH_MASK     0x0FFFFFFF
>>>> +#define CLI_INST_PREFIX 0x20000000
>>>> +
>>>> +#define readstr(pa, buf) cpu_physical_memory_read((pa), (buf), sizeof(buf))
>>>
>>> I'd prefer "readbuf" or something, since this isn't actually looking
>>> at any string structure.
>>>
>>>> +#define readuint32(pa) ({ \
>>>> +    uint32_t __tmp__; \
>>>> +    cpu_physical_memory_read((pa), &__tmp__, sizeof(__tmp__)); \
>>>> +    be32_to_cpu(__tmp__); })
>>>
>>> You can simplify this with ldl_be_phys().  In fact, then you probably
>>> don't need any wrapper at all.
>>
>> Ah right. There are too many of these, was quicker to type this :)
>>
>>
>>>> +
>>>> +static bool _cmpservice(const char *s, size_t len,
>>>> +                        unsigned nargs, unsigned nret,
>>>> +                        const char *s1, size_t len1,
>>>> +                        unsigned nargscheck, unsigned nretcheck)
>>>> +{
>>>> +    if (strncmp(s, s1, MAX(len, len1))) {
>>>> +        return false;
>>>> +    }
>>>> +    if (nargscheck == 0 && nretcheck == 0) {
>>>> +        return true;
>>>> +    }
>>>> +    if (nargs != nargscheck || nret != nretcheck) {
>>>> +        trace_spapr_client_error_param(s, nargscheck, nretcheck, nargs, nret);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static uint32_t client_finddevice(const void *fdt, uint32_t nodeaddr)
>>>> +{
>>>> +    char node[256];
>>>> +    int ret;
>>>> +
>>>> +    readstr(nodeaddr, node);
>>>> +    ret = fdt_path_offset(fdt, node);
>>>> +    if (ret >= 0) {
>>>> +        ret = fdt_get_phandle(fdt, ret);
>>>> +    }
>>>> +
>>>> +    return (uint32_t) ret;
>>>> +}
>>>> +
>>>> +static uint32_t client_getprop(const void *fdt, uint32_t nodeph, uint32_t pname,
>>>> +            uint32_t valaddr, uint32_t vallen)
>>>> +{
>>>> +    char propname[64];
>>>> +    uint32_t ret = 0;
>>>> +    int proplen = 0;
>>>> +    const void *prop;
>>>> +
>>>> +    readstr(pname, propname);
>>>> +
>>>> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
>>>> +                               propname, strlen(propname), &proplen);
>>>
>>> You can just use fdt_getprop().  By definition it's equivalent to
>>> fdt_getprop_namelen(.., propname, strlen(propname), ...)
>>>
>>>> +    if (prop) {
>>>> +        int cb = MIN(proplen, vallen);
>>>> +
>>>> +        cpu_physical_memory_write(valaddr, prop, cb);
>>>> +        ret = cb;
>>>> +    } else if (strncmp(propname, "stdout", 6) == 0 && vallen == sizeof(ret)) {
>>>> +        ret = cpu_to_be32(1);
>>>> +        cpu_physical_memory_write(valaddr, &ret, MIN(vallen, sizeof(ret)));
>>>
>>> I'm guessing the special casing of stdout is to fake the OF instancing
>>> stuff, which I don't really understand.  Some comments to explain
>>> what's going on here would be good.
>>
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/prom_init.c#n2219
>>
>> This thing wants ihandle which is stored as /chosen/stdout and which we
>> do not have as there are no instances.
>>
>> If this property is not there, prom_panic() happens.
>>
>> I could implement some primitive intances, I'll look into this.
> 
> Ok, that's roughly as I guessed.  This seems like a reasonable way of
> handling this for now without having to implement a notion of
> instances - but it would be good to have some comment saying what's
> going on here.
> 
> We might need to implement at least basic awareness of instances
> if/when we need to be a bit more flexible about what input/output
> devices we use for the console.
> 
>>>> +    } else {
>>>> +        ret = -1;
>>>> +    }
>>>> +    trace_spapr_client_getprop(nodeph, propname, ret);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint32_t client_setprop(SpaprMachineState *sm,
>>>> +                               uint32_t nodeph, uint32_t pname,
>>>> +                               uint32_t valaddr, uint32_t vallen)
>>>> +{
>>>> +    char propname[64];
>>>> +    uint32_t ret = -1;
>>>> +    int proplen = 0;
>>>> +    const void *prop;
>>>> +
>>>> +    readstr(pname, propname);
>>>> +    if (vallen == sizeof(uint32_t) &&
>>>> +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
>>>> +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
>>>> +
>>>> +        sm->rtas_base = readuint32(valaddr);
>>>> +        prop = fdt_getprop_namelen(sm->fdt_blob,
>>>> +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
>>>> +                                                              nodeph),
>>>> +                                   propname, strlen(propname), &proplen);
>>>> +        if (proplen == vallen) {
>>>> +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
>>>> +            ret = proplen;
>>>> +        }
>>>
>>> Is there a particular reason to restrict this to the rtas properties,
>>> rather than just allowing the guest to fdt_setprop() something
>>> arbitrary?
>>
>> The FDT is flatten and I am not quite sure if libfdt can handle updating
>> properties if the length has changed.
> 
> fdt_setprop() will handle updating properties with changed length (in
> fact there's a special fdt_setprop_inplace() optimized for the case
> where you don't need that).  It's not particularly efficient, but it
> should work fine for the cases we have here.  In fact, I think you're
> already relying on this for the code that adds the phandles to
> everything.

Well, I used to add phandles before calling fdt_pack() so it is not exactly the same.


> One complication is that it can return FDT_ERR_NOSPACE if there isn't
> enough buffer for the increased thing.  We could either trap that,
> resize and retry, or we could leave a bunch of extra space.  The
> latter would be basically equivalent to not doing fdt_pack() on the
> blob in the nobios case.


This is what I ended up doing.


>> Also, more importantly, potentially property changes like this may have
>> to be reflected in the QEMU device tree so I allowed only the properties
>> which I know how to deal with.
> 
> That's a reasonable concern, but the nice thing about not having SLOF
> is that there's only one copy of the device tree - the blob in qemu.
> So a setprop() on that is automatically a setprop() everywhere (this
> is another reason not to write the fdt into guest memory in the nobios
> case - it will become stale as soon as the client changes anything).


True to a degree. It is "setprop" to the current fdt blob which we do not analyze after we build the fdt. We either need 
to do parse the tree before we rebuild it as CAS so we do not lose the updates or do selective changes to the QEMUs 
objects from the "setprop" handler (this is what I do now). Thanks,


-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06  4:19 ` David Gibson
  2020-01-06  6:28   ` Alexey Kardashevskiy
@ 2020-01-06 13:39   ` Alexey Kardashevskiy
  2020-01-06 23:56     ` David Gibson
  2020-01-07  4:44   ` Alexey Kardashevskiy
  2 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-06 13:39 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 06/01/2020 15:19, David Gibson wrote:

>> +    const struct fdt_property *prop;
>> +    const char *tmp;
>> +
>> +    readstr(prevaddr, prev);
>> +    for (offset = fdt_first_property_offset(fdt, offset);
>> +         (offset >= 0);
>> +         (offset = fdt_next_property_offset(fdt, offset))) {
>> +
>> +        prop = fdt_get_property_by_offset(fdt, offset, &namelen);
> 
> fdt_getprop_by_offset() also returns the property's name without
> having to dick around with fdt_get_string() manually.


btw I looked quickly:

const void *fdt_getprop_by_offset(const void *fdt, int offset,
                   const char **namep, int *lenp)
{
     const struct fdt_property *prop;

     prop = fdt_get_property_by_offset_(fdt, offset, lenp);
     if (!prop)
         return NULL;
     if (namep) {
         const char *name;
         int namelen;
         name = fdt_get_string(fdt, fdt32_ld(&prop->nameoff),
                       &namelen);
         if (!name) {
             if (lenp)
                 *lenp = namelen;
             return NULL;
         }
         *namep = name;
     }

     /* Handle realignment */
     if (fdt_version(fdt) < 0x10 && (offset + sizeof(*prop)) % 8 &&
         fdt32_ld(&prop->len) >= 8)
         return prop->data + 4;
     return prop->data;
}


and incorrectly assumed *lenp is the name length because of "*lenp = namelen". Huh :)


-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06 12:34       ` Alexey Kardashevskiy
@ 2020-01-06 17:09         ` Cédric Le Goater
  2020-01-06 17:25           ` Peter Maydell
  2020-01-07  5:26         ` David Gibson
  1 sibling, 1 reply; 22+ messages in thread
From: Cédric Le Goater @ 2020-01-06 17:09 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson; +Cc: qemu-ppc, qemu-devel

>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>> index e62c89b3dd40..1c97534a0aea 100644
>>>>> --- a/hw/ppc/spapr.c
>>>>> +++ b/hw/ppc/spapr.c
>>>>> @@ -896,6 +896,51 @@ out:
>>>>>       return ret;
>>>>>   }
>>>>>   +/*
>>>>> + * Below is a compiled version of RTAS blob and OF client interface entry point.
>>>>> + *
>>>>> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
>>>>> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
>>>>> + *
>>>>> + *   .globl  _start
>>>>> + *   _start:
>>>>> + *           mr      4,3
>>>>> + *           lis     3,KVMPPC_H_RTAS@h
>>>>> + *           ori     3,3,KVMPPC_H_RTAS@l
>>>>> + *           sc      1
>>>>> + *           blr
>>>>> + *           mr      4,3
>>>>> + *           lis     3,KVMPPC_H_CLIENT@h
>>>>> + *           ori     3,3,KVMPPC_H_CLIENT@l
>>>>> + *           sc      1
>>>>> + *           blr
>>>>> + */
>>>>> +static struct {
>>>>> +    uint8_t rtas[20], client[20];
>>>>> +} QEMU_PACKED rtas_client_blob = {
>>>>> +    .rtas = {
>>>>> +        0x7c, 0x64, 0x1b, 0x78,
>>>>> +        0x3c, 0x60, 0x00, 0x00,
>>>>> +        0x60, 0x63, 0xf0, 0x00,
>>>>> +        0x44, 0x00, 0x00, 0x22,
>>>>> +        0x4e, 0x80, 0x00, 0x20
>>>>> +    },
>>>>> +    .client = {
>>>>> +        0x7c, 0x64, 0x1b, 0x78,
>>>>> +        0x3c, 0x60, 0x00, 0x00,
>>>>> +        0x60, 0x63, 0xf0, 0x05,
>>>>> +        0x44, 0x00, 0x00, 0x22,
>>>>> +        0x4e, 0x80, 0x00, 0x20
>>>>> +    }
>>>>> +};
>>>>
>>>> I'd really prefer to read this in from a file (or files) which we
>>>> assemble (as with the existing spapr-rtas.img), rather than having a
>>>> magic array in qemu.
>>>
>>> Two considerations here:
>>> 1. This blob is not going to change (at least often)
>>
>> True, but I'd still prefer to build it from a .S rather than have the
>> binary written out.  We already had this until recently (when we moved
>> the RTAS image into SLOF), and it's really not too hard to manage.
> 
> 
> But what is exactly the benefit? It is not going to change. I can look into making it a .S, compile it to a binary an include this binary into qemu-system-ppc64 as a byte array, just need to refresh my gcc/ldd memories but a separate file is too much for this imho.

ARM bootloaders are also embedded in QEMU's code. See hw/arm/boot.c.
You could improve a bit the definition though.

C.

> 
>>> 2. A new file which needs to be packaged/signed/copied.
>>
>> Eh, again, we had this for RTAS until very recently and it really
>> wasn't that much hassle.
> 
> 
> I keep secure VMs in mind. And SLOF is a separate package, why was not RTAS?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06 17:09         ` Cédric Le Goater
@ 2020-01-06 17:25           ` Peter Maydell
  2020-01-06 18:56             ` Philippe Mathieu-Daudé
  2020-01-07  1:44             ` Alexey Kardashevskiy
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Maydell @ 2020-01-06 17:25 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alexey Kardashevskiy, qemu-ppc, QEMU Developers, David Gibson

On Mon, 6 Jan 2020 at 17:09, Cédric Le Goater <clg@kaod.org> wrote:
> ARM bootloaders are also embedded in QEMU's code. See hw/arm/boot.c.
> You could improve a bit the definition though.

Given an opportunity to restart from scratch I don't know
that I'd do things the way hw/arm/boot.c does. The initial
idea was really simple and straightforward: 3 or 4 insns
which just set some registers and jumped to the kernel.
Fast-forward a decade or two, and the complexity has
significantly increased as we added extra tweaks to deal
with SMP systems, the GIC interrupt controller, boards
which need to do some extra odd stuff, CPUs which start
in Secure state, 64-bit CPUs, mangling the DTB, booting
multiple flavours of image file format, implementing
various 'firmware' functionality and APIs, and so on and on.
Insisting from the start that QEMU emulates what bare metal
hardware does, and doesn't get into the business of faking
up the behaviour of firmware would have been a neater
separation of concerns in the long run.

To the narrower concern, yeah, on the arm side we
just embed hand-assembled hex values in the C file;
this is mostly to avoid needing a cross-compiler setup at
QEMU build time, but it also lets us hand-patch the
binary blob at runtime to fill in addresses and so on.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06 17:25           ` Peter Maydell
@ 2020-01-06 18:56             ` Philippe Mathieu-Daudé
  2020-01-07  1:44             ` Alexey Kardashevskiy
  1 sibling, 0 replies; 22+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-01-06 18:56 UTC (permalink / raw)
  To: Peter Maydell, Cédric Le Goater, Alexey Kardashevskiy
  Cc: Alex Bennée, qemu-ppc, QEMU Developers, David Gibson

On 1/6/20 6:25 PM, Peter Maydell wrote:
> On Mon, 6 Jan 2020 at 17:09, Cédric Le Goater <clg@kaod.org> wrote:
>> ARM bootloaders are also embedded in QEMU's code. See hw/arm/boot.c.
>> You could improve a bit the definition though.
> 
> Given an opportunity to restart from scratch I don't know
> that I'd do things the way hw/arm/boot.c does. The initial
> idea was really simple and straightforward: 3 or 4 insns
> which just set some registers and jumped to the kernel.
> Fast-forward a decade or two, and the complexity has
> significantly increased as we added extra tweaks to deal
> with SMP systems, the GIC interrupt controller, boards
> which need to do some extra odd stuff, CPUs which start
> in Secure state, 64-bit CPUs, mangling the DTB, booting
> multiple flavours of image file format, implementing
> various 'firmware' functionality and APIs, and so on and on.

:S

> Insisting from the start that QEMU emulates what bare metal
> hardware does, and doesn't get into the business of faking
> up the behaviour of firmware would have been a neater
> separation of concerns in the long run.
> 
> To the narrower concern, yeah, on the arm side we
> just embed hand-assembled hex values in the C file;
> this is mostly to avoid needing a cross-compiler setup at
> QEMU build time, but it also lets us hand-patch the
> binary blob at runtime to fill in addresses and so on.

This was also before we provide a handy docker image with cross compiler 
for about all architectures supported.

Maybe the cross-build-guest-tests target from tests/tcg/Makefile.qemu 
can be reused to build spapr-rtas.S?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06 13:39   ` Alexey Kardashevskiy
@ 2020-01-06 23:56     ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2020-01-06 23:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

On Tue, Jan 07, 2020 at 12:39:42AM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 06/01/2020 15:19, David Gibson wrote:
> 
> > > +    const struct fdt_property *prop;
> > > +    const char *tmp;
> > > +
> > > +    readstr(prevaddr, prev);
> > > +    for (offset = fdt_first_property_offset(fdt, offset);
> > > +         (offset >= 0);
> > > +         (offset = fdt_next_property_offset(fdt, offset))) {
> > > +
> > > +        prop = fdt_get_property_by_offset(fdt, offset, &namelen);
> > 
> > fdt_getprop_by_offset() also returns the property's name without
> > having to dick around with fdt_get_string() manually.
> 
> 
> btw I looked quickly:
> 
> const void *fdt_getprop_by_offset(const void *fdt, int offset,
>                   const char **namep, int *lenp)
> {
>     const struct fdt_property *prop;
> 
>     prop = fdt_get_property_by_offset_(fdt, offset, lenp);
>     if (!prop)
>         return NULL;
>     if (namep) {
>         const char *name;
>         int namelen;
>         name = fdt_get_string(fdt, fdt32_ld(&prop->nameoff),
>                       &namelen);
>         if (!name) {
>             if (lenp)
>                 *lenp = namelen;
>             return NULL;
>         }
>         *namep = name;
>     }
> 
>     /* Handle realignment */
>     if (fdt_version(fdt) < 0x10 && (offset + sizeof(*prop)) % 8 &&
>         fdt32_ld(&prop->len) >= 8)
>         return prop->data + 4;
>     return prop->data;
> }
> 
> 
> and incorrectly assumed *lenp is the name length because of "*lenp =
> namelen". Huh :)

Ah, yeah, that is a bit confusing, but I don't see a way to improve
it.  That's the error case, in which case both *lenp and namelen are
carrying an error code, not their usual value.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06 17:25           ` Peter Maydell
  2020-01-06 18:56             ` Philippe Mathieu-Daudé
@ 2020-01-07  1:44             ` Alexey Kardashevskiy
  1 sibling, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-07  1:44 UTC (permalink / raw)
  To: Peter Maydell, Cédric Le Goater
  Cc: qemu-ppc, QEMU Developers, David Gibson



On 07/01/2020 04:25, Peter Maydell wrote:
> On Mon, 6 Jan 2020 at 17:09, Cédric Le Goater <clg@kaod.org> wrote:
>> ARM bootloaders are also embedded in QEMU's code. See hw/arm/boot.c.
>> You could improve a bit the definition though.
> 
> Given an opportunity to restart from scratch I don't know
> that I'd do things the way hw/arm/boot.c does. The initial
> idea was really simple and straightforward: 3 or 4 insns
> which just set some registers and jumped to the kernel.

I believe this is the case here - the previous blob like this (RTAS) has
never changed. In fact it was compiled so rarely that when I did compile
it, I found that it does not compile big-endian as it must :) Having a
blob like this solves this too.


-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06  4:19 ` David Gibson
  2020-01-06  6:28   ` Alexey Kardashevskiy
  2020-01-06 13:39   ` Alexey Kardashevskiy
@ 2020-01-07  4:44   ` Alexey Kardashevskiy
  2020-01-07  5:54     ` David Gibson
  2 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-07  4:44 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 06/01/2020 15:19, David Gibson wrote:
>> +
>> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
>> +                                       uint32_t buf, uint32_t len)
>> +{
>> +    char tmp[256];
> 
> Fixed sized buffers are icky.  You could either dynamically allocate
> this based on the size the client gives, or you could use
> memory_region_get_ram_ptr() to read the data from the tree directly
> into guest memory.

@len comes from the guest, I am really not comfortable with allocating
whatever (broken) guest requested. And if I limit @len by 1024 or
similar, then a fixed size buffer will do too, no?

btw how exactly can I use memory_region_get_ram_ptr()?
get_system_memory() returns a root MR which is not RAM, RAM is a
"spapr.ram" sub-MR.



-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-06 12:34       ` Alexey Kardashevskiy
  2020-01-06 17:09         ` Cédric Le Goater
@ 2020-01-07  5:26         ` David Gibson
  2020-01-08  4:07           ` Alexey Kardashevskiy
  1 sibling, 1 reply; 22+ messages in thread
From: David Gibson @ 2020-01-07  5:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 47388 bytes --]

On Mon, Jan 06, 2020 at 11:34:13PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 06/01/2020 19:50, David Gibson wrote:
> > On Mon, Jan 06, 2020 at 05:28:55PM +1100, Alexey Kardashevskiy wrote:
> > > 
> > > 
> > > On 06/01/2020 15:19, David Gibson wrote:
> > > > On Mon, Jan 06, 2020 at 10:42:42AM +1100, Alexey Kardashevskiy wrote:
> > > > > The Petitboot bootloader is way more advanced than SLOF is ever going to
> > > > > be as Petitboot comes with the full-featured Linux kernel with all
> > > > > the drivers, and initramdisk with quite user friendly interface.
> > > > > The problem with ditching SLOF is that an unmodified pseries kernel can
> > > > > either start via:
> > > > > 1. kexec, this requires presence of RTAS and skips
> > > > > ibm,client-architecture-support entirely;
> > > > > 2. normal boot, this heavily relies on the OF1275 client interface to
> > > > > fetch the device tree and do early setup (claim memory).
> > > > > 
> > > > > This adds a new bios-less mode to the pseries machine: "bios=on|off".
> > > > > When enabled, QEMU does not load SLOF and jumps to the kernel from
> > > > > "-kernel".
> > > > > 
> > > > > The client interface is implemented exactly as RTAS - a 20 bytes blob,
> > > > > right next after the RTAS blob. The entry point is passed to the kernel
> > > > > via GPR5.
> > > > > 
> > > > > This implements a handful of client interface methods just to get going.
> > > > > In particular, this implements the device tree fetching,
> > > > > ibm,client-architecture-support and instantiate-rtas.
> > > > > 
> > > > > This implements changing FDT properties only for "linux,rtas-base" and
> > > > > "linux,rtas-entry", just to get early boot going.
> > > > > 
> > > > > This assigns "phandles" to device tree nodes as there is no more SLOF
> > > > > and OF nodes addresses of which served as phandle values.
> > > > > This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
> > > > > phandles are regenerated at every FDT rebuild.
> > > > > 
> > > > > When bios=off, this adds "/chosen" every time QEMU builds a tree.
> > > > > 
> > > > > This implements "claim" which the client (Linux) uses for memory
> > > > > allocation; this is also  used by QEMU for claiming kernel/initrd images,
> > > > > client interface entry point, RTAS and the initial stack.
> > > > > 
> > > > > While at this, add a "kernel-addr" machine parameter to allow moving
> > > > > the kernel in memory. This is useful for debugging if the kernel is
> > > > > loaded at @0, although not necessary.
> > > > > 
> > > > > This does not implement instances properly but in order to boot a VM,
> > > > > we only really need a stdout instance and the "/" instance for
> > > > > "call-method", we fake these.
> > > > 
> > > > As we've discussed, I really like the idea of this.  It think the
> > > > basic approach looks good too.
> > > > 
> > > > As you probably realize, there are quite a lot of things to be
> > > > polished though.  Comments below.
> > > > 
> > > > > 
> > > > > The test command line:
> > > > > 
> > > > > qemu-system-ppc64 \
> > > > > -nodefaults \
> > > > > -chardev stdio,id=STDIO0,signal=off,mux=on \
> > > > > -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> > > > > -mon id=MON0,chardev=STDIO0,mode=readline \
> > > > > -nographic \
> > > > > -vga none \
> > > > > -kernel vmldbg \
> > > > > -machine pseries,bios=off \
> > > > > -m 4G \
> > > > > -enable-kvm \
> > > > > -initrd pb/rootfs.cpio.xz \
> > > > > img/u1804-64le.qcow2 \
> > > > > -snapshot \
> > > > > -smp 8,threads=8 \
> > > > > -L /home/aik/t/qemu-ppc64-bios/ \
> > > > > -trace events=qemu_trace_events \
> > > > > -d guest_errors \
> > > > > -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh37874 \
> > > > > -mon chardev=SOCKET0,mode=control
> > > > > 
> > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > > ---
> > > > > Changes:
> > > > > v2:
> > > > > * fixed claim()
> > > > > * added "setprop"
> > > > > * cleaner client interface and RTAS blobs management
> > > > > * boots to petitboot and further to the target system
> > > > > * more trace points
> > > > > ---
> > > > >   hw/ppc/Makefile.objs   |   1 +
> > > > >   include/hw/loader.h    |   1 +
> > > > >   include/hw/ppc/spapr.h |  25 ++-
> > > > >   hw/ppc/spapr.c         | 231 ++++++++++++++++++--
> > > > >   hw/ppc/spapr_client.c  | 464 +++++++++++++++++++++++++++++++++++++++++
> > > > >   hw/ppc/spapr_hcall.c   |  49 +++--
> > > > >   hw/ppc/trace-events    |   9 +
> > > > >   7 files changed, 743 insertions(+), 37 deletions(-)
> > > > >   create mode 100644 hw/ppc/spapr_client.c
> > > > > 
> > > > > diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> > > > > index 101e9fc59185..ce31c0acd2fb 100644
> > > > > --- a/hw/ppc/Makefile.objs
> > > > > +++ b/hw/ppc/Makefile.objs
> > > > > @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
> > > > >   obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
> > > > >   obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
> > > > >   obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
> > > > > +obj-$(CONFIG_PSERIES) += spapr_client.o
> > > > 
> > > > This applies in a bunch of places here.  Just calling things "client"
> > > > is clear enough if you're already thinking about Open Firmware.  But
> > > > this appears in a bunch of places where you might come across it
> > > > without that context, in which case it could be kind of confusing.  So
> > > > I'd suggest using "of_client" or "of_client_interface" in most places
> > > > you're using "client".
> > > > 
> > > > >   obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
> > > > >   # IBM PowerNV
> > > > >   obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
> > > > > diff --git a/include/hw/loader.h b/include/hw/loader.h
> > > > > index 48a96cd55904..a2f58077a47e 100644
> > > > > --- a/include/hw/loader.h
> > > > > +++ b/include/hw/loader.h
> > > > > @@ -262,6 +262,7 @@ MemoryRegion *rom_add_blob(const char *name, const void *blob, size_t len,
> > > > >   int rom_add_elf_program(const char *name, GMappedFile *mapped_file, void *data,
> > > > >                           size_t datasize, size_t romsize, hwaddr addr,
> > > > >                           AddressSpace *as);
> > > > > +bool rom_intersect(uint64_t start, uint64_t size);
> > > > 
> > > > I don't see this called, so I'm guessing it's a stray chunk leftover.
> > > > 
> > > > >   int rom_check_and_register_reset(void);
> > > > >   void rom_set_fw(FWCfgState *f);
> > > > >   void rom_set_order_override(int order);
> > > > > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > > > > index 61f005c6f686..876879d12029 100644
> > > > > --- a/include/hw/ppc/spapr.h
> > > > > +++ b/include/hw/ppc/spapr.h
> > > > > @@ -105,6 +105,11 @@ struct SpaprCapabilities {
> > > > >       uint8_t caps[SPAPR_CAP_NUM];
> > > > >   };
> > > > > +typedef struct {
> > > > > +    uint64_t start;
> > > > > +    uint64_t size;
> > > > > +} spapr_of_claimed;
> > > > 
> > > > Please use the qemu convention of StudlyCaps for types.
> > > > 
> > > > >   /**
> > > > >    * SpaprMachineClass:
> > > > >    */
> > > > > @@ -158,8 +163,13 @@ struct SpaprMachineState {
> > > > >       uint32_t fdt_size;
> > > > >       uint32_t fdt_initial_size;
> > > > >       void *fdt_blob;
> > > > > +    uint32_t rtas_base;
> > > > >       long kernel_size;
> > > > >       bool kernel_le;
> > > > > +    uint64_t kernel_addr;
> > > > 
> > > > This would intersect with your other patch to allow changing the
> > > > kernel load address, yes?
> > > 
> > > 
> > > This includes kernel_addr, it is a single patch now. More about this below.
> > 
> > Hm, ok.
> > 
> > > > > +    bool bios_enabled;
> > > > > +    GArray *claimed; /* array of spapr_of_claimed */
> > > > > +    uint64_t claimed_base;
> > > > 
> > > > I'd suggest moving all the client interface related fields to a
> > > > visually separated chunk in the structure definition.
> > > > 
> > > > >       uint32_t initrd_base;
> > > > >       long initrd_size;
> > > > >       uint64_t rtc_offset; /* Now used only during incoming migration */
> > > > > @@ -510,7 +520,8 @@ struct SpaprMachineState {
> > > > >   /* Client Architecture support */
> > > > >   #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
> > > > >   #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
> > > > > -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
> > > > > +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
> > > > > +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
> > > > >   /*
> > > > >    * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
> > > > > @@ -538,6 +549,12 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
> > > > >   target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
> > > > >                                target_ulong *args);
> > > > > +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> > > > > +                                            SpaprMachineState *spapr,
> > > > > +                                            target_ulong addr,
> > > > > +                                            target_ulong fdt_buf,
> > > > > +                                            target_ulong fdt_bufsize);
> > > > > +
> > > > >   /* Virtual Processor Area structure constants */
> > > > >   #define VPA_MIN_SIZE           640
> > > > >   #define VPA_SIZE_OFFSET        0x4
> > > > > @@ -769,6 +786,9 @@ struct SpaprEventLogEntry {
> > > > >   void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
> > > > >   void spapr_events_init(SpaprMachineState *sm);
> > > > >   void spapr_dt_events(SpaprMachineState *sm, void *fdt);
> > > > > +uint64_t spapr_do_client_claim(SpaprMachineState *sm, uint64_t virt,
> > > > > +                               uint64_t size, uint64_t align);
> > > > > +int spapr_h_client(SpaprMachineState *sm, target_ulong client_args);
> > > > >   void close_htab_fd(SpaprMachineState *spapr);
> > > > >   void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
> > > > >   void spapr_free_hpt(SpaprMachineState *spapr);
> > > > > @@ -891,4 +911,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
> > > > >   #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
> > > > >   void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
> > > > > +
> > > > > +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
> > > > > +
> > > > >   #endif /* HW_SPAPR_H */
> > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > > > index e62c89b3dd40..1c97534a0aea 100644
> > > > > --- a/hw/ppc/spapr.c
> > > > > +++ b/hw/ppc/spapr.c
> > > > > @@ -896,6 +896,51 @@ out:
> > > > >       return ret;
> > > > >   }
> > > > > +/*
> > > > > + * Below is a compiled version of RTAS blob and OF client interface entry point.
> > > > > + *
> > > > > + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
> > > > > + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
> > > > > + *
> > > > > + *   .globl  _start
> > > > > + *   _start:
> > > > > + *           mr      4,3
> > > > > + *           lis     3,KVMPPC_H_RTAS@h
> > > > > + *           ori     3,3,KVMPPC_H_RTAS@l
> > > > > + *           sc      1
> > > > > + *           blr
> > > > > + *           mr      4,3
> > > > > + *           lis     3,KVMPPC_H_CLIENT@h
> > > > > + *           ori     3,3,KVMPPC_H_CLIENT@l
> > > > > + *           sc      1
> > > > > + *           blr
> > > > > + */
> > > > > +static struct {
> > > > > +    uint8_t rtas[20], client[20];
> > > > > +} QEMU_PACKED rtas_client_blob = {
> > > > > +    .rtas = {
> > > > > +        0x7c, 0x64, 0x1b, 0x78,
> > > > > +        0x3c, 0x60, 0x00, 0x00,
> > > > > +        0x60, 0x63, 0xf0, 0x00,
> > > > > +        0x44, 0x00, 0x00, 0x22,
> > > > > +        0x4e, 0x80, 0x00, 0x20
> > > > > +    },
> > > > > +    .client = {
> > > > > +        0x7c, 0x64, 0x1b, 0x78,
> > > > > +        0x3c, 0x60, 0x00, 0x00,
> > > > > +        0x60, 0x63, 0xf0, 0x05,
> > > > > +        0x44, 0x00, 0x00, 0x22,
> > > > > +        0x4e, 0x80, 0x00, 0x20
> > > > > +    }
> > > > > +};
> > > > 
> > > > I'd really prefer to read this in from a file (or files) which we
> > > > assemble (as with the existing spapr-rtas.img), rather than having a
> > > > magic array in qemu.
> > > 
> > > Two considerations here:
> > > 1. This blob is not going to change (at least often)
> > 
> > True, but I'd still prefer to build it from a .S rather than have the
> > binary written out.  We already had this until recently (when we moved
> > the RTAS image into SLOF), and it's really not too hard to manage.
> 
> 
> But what is exactly the benefit? It is not going to change. I can look into
> making it a .S, compile it to a binary an include this binary into
> qemu-system-ppc64 as a byte array, just need to refresh my gcc/ldd memories
> but a separate file is too much for this imho.

Hmm, you have a point.  I'll rethink this based on what you and Peter
have said.

> > > 2. A new file which needs to be packaged/signed/copied.
> > 
> > Eh, again, we had this for RTAS until very recently and it really
> > wasn't that much hassle.
> 
> I keep secure VMs in mind. And SLOF is a separate package, why was
> not RTAS?

I'm not sure why secure VMs are relevant here.  RTAS was a built as a
separate image, but not a separate source package, simply because it
is tiny and even more closely coupled to qemu.

> > > > > +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
> > > > > +{
> > > > > +    spapr->rtas_base = base;
> > > > > +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
> > > > > +                              sizeof(rtas_client_blob.rtas));
> > > > > +}
> > > > > +
> > > > >   static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> > > > >   {
> > > > >       MachineState *ms = MACHINE(spapr);
> > > > > @@ -980,6 +1025,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> > > > >       _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
> > > > >                        lrdr_capacity, sizeof(lrdr_capacity)));
> > > > > +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-base", spapr->rtas_base));
> > > > > +    _FDT(fdt_setprop_cell(fdt, rtas, "linux,rtas-entry", spapr->rtas_base));
> > > > 
> > > > IIUC, we shouldn't need to add these in bios=off mode, yes?  Instead
> > > > the client must instantiate-rtas before attempting to use it.
> > > 
> > > 
> > > Well. Here is what happens:
> > > 
> > > 1. spapr machine reset creates FDT, lets say without rtas-entry
> > > 2. linux calls ibm,client-interface-support
> > 
> > I assume you mean ibm,client-architecture-support?
> 
> Yes.
> 
> > 
> > > 3. the QEMU handler rebuilds the FDT
> > 
> > Oh.. not directly relevant here, but that might be an issue.  Do we
> > need to ensure that phandles are stable across the CAS call?
> 
> We already have predefined values for phandles we care about.

Right that we care about on the qemu side.  But my concern is that a
client could poke around the device tree, retrieving some phandles,
then call CAS, then attempt to examine those nodes again using the
phandles it retrieved beforehand.

I think we're going to have to not support that, because keeping the
phandles consistent across CAS would be really fiddly in most cases
and truly impossible in some cases, I'm pretty sure.

> > > 4. linux calls "instantiate-rtas"
> > > 5. linux stores "linux,rtas-base" and "linux,rtas-entry" in the FDT
> > > using "setprop" client call
> > > 6. linux fetches the tree by traversing it and packs it into FDT blob itself
> > > 7. linux calls quiesce
> > > 8. linux uses "linux,rtas-entry" from FDT to call RTAS.
> > > 
> > > Because in 5) I only allow in-place property update and only for
> > > "linux,rtas-base" and "linux,rtas-entry", these do not make to the tree
> > > and there is no "linux,rtas-entry" in the FDT.
> > > 
> > > There are choices:
> > > 
> > > 1. I can extend "setprop" (tricky as FDT is flattened by then?)
> > 
> > I actually don't think this should be too hard - though maybe I've
> > missed something, so I'd say this is my frontrunning option.
> 
> fdt_splice_() fails because of lack of space in the FDT (you pointed to it
> below). Not calling fdt_pack() fixed it as it just keeps FDT blob rather
> big. Not sure it is worth further optimizations at this point.

Probably not.  Maybe we could fdt_pack() at quiesce time, at which
point we shouldn't have any further dt changes.

> > > 2. simply store rtas base in spapr->rtas_base and do complete FDT rebuild
> > 
> > I don't think that's viable - phandles definitely shouldn't change
> > across an instantiate-rtas call.
> > 
> > > 3. (currently) I basically have 2 placeholders for these
> > > "linux,rtas-base" and "linux,rtas-entry".
> > 
> > Hm, ok.
> > 
> > > Or we can avoid flattening FDT at all for bios=off (as nobody wants
> > > flattened device tree, my "fetch-fdt" never made it to upstream Linux),
> > > then 1) is easy (if it is a problem to begin with).
> > 
> > Right, but we need a live DT library to make it happen, which means
> > someone finding time to write it.
> 
> I dropped fdt_pack() and change properties via fdt_setprop() - works
> just fine.

Great.

> > > > > +    _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
> > > > > +                          sizeof(rtas_client_blob.rtas)));
> > > > > +    _FDT(fdt_add_mem_rsv(fdt, spapr->rtas_base,
> > > > > +                         sizeof(rtas_client_blob.rtas)));
> > > > 
> > > > Nor should we need the mem_rsv with bios=off.
> > > 
> > > True. If not the "setprop" limitation from above, I would only need
> > > "rtas-size" here.
> > > 
> > > 
> > > > 
> > > > > +
> > > > >       spapr_dt_rtas_tokens(fdt, rtas);
> > > > >   }
> > > > > @@ -1057,7 +1109,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
> > > > >       }
> > > > >       if (spapr->kernel_size) {
> > > > > -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
> > > > > +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
> > > > >                                 cpu_to_be64(spapr->kernel_size) };
> > > > >           _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
> > > > > @@ -1104,6 +1156,9 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
> > > > >           _FDT(fdt_setprop_cell(fdt, chosen, "linux,pci-probe-only", 0));
> > > > >       }
> > > > > +    if (!spapr->bios_enabled) {
> > > > > +        _FDT(fdt_setprop_cell(fdt, chosen, "cpu", 0));
> > > > 
> > > > What is this for?
> > > 
> > > Looks like a leftover, I did this for prom_find_boot_cpu() but after
> > > double checking and trying it looks like it can proceed without this chunk.
> > 
> > Ok.
> > 
> > > > 
> > > > > +    }
> > > > >       spapr_dt_ov5_platform_support(spapr, fdt, chosen);
> > > > >       g_free(stdout_path);
> > > > > @@ -1133,6 +1188,28 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
> > > > >       }
> > > > >   }
> > > > > +static void spapr_fdt_add_phandles(void *fdt, int parent, uint32_t *phandle)
> > > > > +{
> > > > > +    int proplen = 0, node;
> > > > > +    const void *prop;
> > > > > +    char tmp[256] = "BAAD";
> > > > > +
> > > > > +    fdt_get_path(fdt, parent, tmp, sizeof(tmp));
> > > > 
> > > > You don't actually appear to use tmp.  Is this a debugging leftover?
> > > 
> > > Yes it is.
> > > 
> > > 
> > > > 
> > > > > +    prop = fdt_getprop_namelen(fdt, parent, "phandle", 7, &proplen);
> > > > > +    if (!prop) {
> > > > > +        _FDT(fdt_setprop_cell(fdt, parent, "phandle", *phandle));
> > > > > +        while (1) {
> > > > > +            ++*phandle;
> > > > > +            if (fdt_node_offset_by_phandle(fdt, *phandle) < 0) {
> > > > > +                break;
> > > > > +            }
> > > > > +        }
> > > > > +    }
> > > > > +    fdt_for_each_subnode(node, fdt, parent) {
> > > > > +        spapr_fdt_add_phandles(fdt, node, phandle);
> > > > > +    }
> > > > 
> > > > Oof, this is pretty inefficient, since fdt_node_offset_by_phandle()
> > > > will have to scan the whole tree, and you do it on every node.
> > > 
> > > 
> > > Inefficient indeed but for a VM with ~30 devices it is not that bad :)
> > > 
> > > 
> > > > Obviously, a "live" tree structure rather than fdt would be better
> > > > suited for this.  I'd like to move to that at some point, but it will
> > > > be a while before we're ready.
> > > 
> > > I could have a cache/hashmap of phandles...
> > 
> > We could, but I doubt it's worth the trouble.  Keeping the offsets in
> > there up to date against setprops or other changes could actually a
> > fair bit of hassle, I suspect.
> > 
> > > > I guess this is fine for a PoC, but I think we can do a bit better
> > > > without too much extra complexity:
> > > >    * Rather than scanning for an existing node with the handle on each
> > > >      iteration, you can use fdt_find_max_phandle() once beforehand,
> > > >      and start your counter from there, being assured that it won't hit
> > > >      any existing phandles
> > > 
> > > I'd like to avoid that as NPU/GPU phandles are quite huge and we will
> > > have huge gaps between phandles numbers which is not so much of a big
> > > deal but annoying to debug.
> > 
> > Hrm.  So huge we'll actually be short of space above them?
> 
> 
> No, but "randomness" is annoying.
> 
> 
> > > >    * The phandles are arbitrary, so you don't actually care about the
> > > >      tree structure here.  So instead of recursively descending the
> > > >      tree, you can just alter each node in the order they appear in the
> > > >      blob.  fdt_next_node() will let you do that, simply by ignoring
> > > >      the 'depth' parameter.  [In fact the structure ensures that will
> > > >      be a depth first traversal of the tree, just like you have here,
> > > >      not that it matters].
> > > 
> > > ... but this one is better.
> > > 
> > > Unlike many others, fdt_next_node() does not have a comment in
> > > dtc/libfdt/libfdt.h so I was not sure about what *depth is for and I
> > > basically ignored fdt_next_node().
> > 
> > Yeah, it's kind of subtle - mostly intended as an internal primitive -
> > but it suits your needs here.  It moves to the next node in the
> > structure block - regardless of whether it's a sibling, child or
> > sibling-of-ancestor.  *depth, if a pointer is supplied, is updated to
> > reflect which case it is.
> > 
> > > >    * You should check for overflows of your phandle counter, though
> > > >      it's fine to treat that as a fatal error.
> > > 
> > > Ok.
> > > 
> > > 
> > > > > +}
> > > > > +
> > > > >   void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
> > > > >   {
> > > > >       MachineState *machine = MACHINE(spapr);
> > > > > @@ -1245,7 +1322,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
> > > > >       /* Build memory reserve map */
> > > > >       if (reset) {
> > > > >           if (spapr->kernel_size) {
> > > > > -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
> > > > > +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
> > > > > +                                  spapr->kernel_size)));
> > > > >           }
> > > > >           if (spapr->initrd_size) {
> > > > >               _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
> > > > > @@ -1268,12 +1346,19 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
> > > > >           }
> > > > >       }
> > > > > +    if (!spapr->bios_enabled) {
> > > > > +        uint32_t phandle = 1;
> > > > > +
> > > > > +        spapr_fdt_add_phandles(fdt, fdt_path_offset(fdt, "/"), &phandle);
> > > > > +    }
> > > > > +
> > > > >       return fdt;
> > > > >   }
> > > > >   static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
> > > > >   {
> > > > > -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
> > > > > +    SpaprMachineState *spapr = opaque;
> > > > > +    return (addr & 0x0fffffff) + spapr->kernel_addr;
> > > > >   }
> > > > >   static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
> > > > > @@ -1660,6 +1745,60 @@ static void spapr_machine_reset(MachineState *machine)
> > > > >        */
> > > > >       fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
> > > > > +    /* Set up the entry state */
> > > > > +    if (!spapr->bios_enabled) {
> > > > > +        if (spapr->claimed) {
> > > > > +            g_array_unref(spapr->claimed);
> > > > > +        }
> > > > > +        spapr->claimed = g_array_new(false, false, sizeof(spapr_of_claimed));
> > > > > +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
> > > > > +
> > > > > +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr, fdt_addr);
> > > > > +
> > > > > +        if (spapr_do_client_claim(spapr, spapr->kernel_addr,
> > > > > +                                  spapr->kernel_size, 0) == -1) {
> > > > > +            error_report("Memory for kernel is in use");
> > > > > +            exit(1);
> > > > > +        }
> > > > > +        if (spapr_do_client_claim(spapr, spapr->initrd_base,
> > > > > +                                  spapr->initrd_size, 0) == -1) {
> > > > > +            error_report("Memory for initramdisk is in use");
> > > > > +            exit(1);
> > > > > +        }
> > > > > +        first_ppc_cpu->env.gpr[1] = spapr_do_client_claim(spapr, 0, 0x40000,
> > > > > +                                                          0x10000);
> > > > > +        if (first_ppc_cpu->env.gpr[1] == -1) {
> > > > > +            error_report("Memory for stack is in use");
> > > > > +            exit(1);
> > > > > +        }
> > > > > +
> > > > > +        first_ppc_cpu->env.gpr[5] =
> > > > > +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
> > > > > +                                  sizeof(rtas_client_blob.client));
> > > > > +        if (first_ppc_cpu->env.gpr[5] == -1) {
> > > > > +            error_report("Memory for OF client is in use");
> > > > > +            exit(1);
> > > > > +        }
> > > > > +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
> > > > > +                                  rtas_client_blob.client,
> > > > > +                                  sizeof(rtas_client_blob.client));
> > > > > +
> > > > > +        /* Allocate space for RTAS here so spapr_build_fdt() below picks it */
> > > > > +        spapr->rtas_base =
> > > > > +            spapr_do_client_claim(spapr, 0, sizeof(rtas_client_blob.rtas),
> > > > > +                                  0x100);
> > > > > +        if (spapr->rtas_base == -1) {
> > > > > +            error_report("Memory for RTAS is in use");
> > > > > +            exit(1);
> > > > > +        }
> > > > > +        cpu_physical_memory_write(spapr->rtas_base,
> > > > > +                                  rtas_client_blob.rtas,
> > > > > +                                  sizeof(rtas_client_blob.rtas));
> > > > 
> > > > The above is complex enough it might be worth having bios and nobios
> > > > variants of spapr_cpu_set_entry_state().
> > > 
> > > Probably yes, when it settles. I'd rather prefer having kernel/initrd
> > > claiming code where we load them but these are Roms and when we "load"
> > > them, they are not loaded, they are loaded from a machine reset handler but:
> > > 
> > > may be we could add a hook to Roms to be called every time an image is
> > > actually written to the guest memory and
> > > 
> > > this is when I would "claim" these chunks and
> > > 
> > > this would make the chunk above shorter and we won't need a separate
> > > spapr_cpu_set_entry_state().
> > 
> > Hm, ok.  Well let's see how it turns out.
> > 
> > Fwiw, I have considered replacing those load_targphys() things with
> > explicit write in at reset time.  load_targphys() is, IIUC, mostly
> > used for loading things like ROMs which exist in physical memory but
> > don't go into regular RAM.  So if it turns out not to really match our
> > needs here, that's ok.
> 
> 
> Agree, may be later. I want to keep this a single patch.
> 
> 
> 
> > > > > +    } else {
> > > > > +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> > > > > +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
> > > > > +    }
> > > > > +
> > > > >       fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
> > > > >       rc = fdt_pack(fdt);
> > > > > @@ -1675,10 +1814,14 @@ static void spapr_machine_reset(MachineState *machine)
> > > > >       spapr->fdt_initial_size = spapr->fdt_size;
> > > > >       spapr->fdt_blob = fdt;
> > > > > -    /* Set up the entry state */
> > > > > -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> > > > > -    first_ppc_cpu->env.gpr[5] = 0;
> > > > > -
> > > > > +    if (!spapr->bios_enabled) {
> > > > > +        /*
> > > > > +         * Claim FDT space so initramdisk/zImage are not unpacked over it.
> > > > > +         * In practice Linux claim additional memory for FDT and fetches it
> > > > > +         * using client interface's "getprop" call.
> > > > > +         */
> > > > 
> > > > IIUC, it shouldn't actually be necessary to put the fdt into guest
> > > > memory in nobios mode - it will be retrieved piecemeal via the client
> > > > interface instead, won't it?
> > > 
> > > Huh. I was under the impression that we pass the FDT to a VM via GPR3
> > > and it is safer to avoid "claim" from claiming this memory (even though
> > > Linux is not using it).
> > 
> > Well, depends what you mean.  I mean that's the "fdt boot" convention,
> > which you've used previously for the nobios boot with an aware guest
> > image.  We use a similar convention for firing up SLOF itself.  But if
> > we're starting with a (simulated) OF entry condition, then as long as
> > we have a working client interface, we don't need an fdt blob anywhere
> > guest visible.
> > 
> > > Now I realize that this GPR3 is QEMU<->SLOF convention (or some OFW
> > > binding - what is it exactly?) and Linux rather expects initramdisk
> > > start/size in GPR3/4 and this did not hit me just because GPR4 is always
> > > 0 so Linux falls back to the device tree for initramdisk location.
> > > 
> > > 
> > > 
> > > 
> > > > 
> > > > > +        spapr_do_client_claim(spapr, fdt_addr, fdt_totalsize(fdt), 0);
> > > > > +    }
> > > > >       spapr->cas_reboot = false;
> > > > >   }
> > > > > @@ -2897,12 +3040,12 @@ static void spapr_machine_init(MachineState *machine)
> > > > >           uint64_t lowaddr = 0;
> > > > >           spapr->kernel_size = load_elf(kernel_filename, NULL,
> > > > > -                                      translate_kernel_address, NULL,
> > > > > +                                      translate_kernel_address, spapr,
> > > > >                                         NULL, &lowaddr, NULL, 1,
> > > > >                                         PPC_ELF_MACHINE, 0, 0);
> > > > >           if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
> > > > >               spapr->kernel_size = load_elf(kernel_filename, NULL,
> > > > > -                                          translate_kernel_address, NULL, NULL,
> > > > > +                                          translate_kernel_address, spapr, NULL,
> > > > >                                             &lowaddr, NULL, 0, PPC_ELF_MACHINE,
> > > > >                                             0, 0);
> > > > >               spapr->kernel_le = spapr->kernel_size > 0;
> > > > > @@ -2918,7 +3061,7 @@ static void spapr_machine_init(MachineState *machine)
> > > > >               /* Try to locate the initrd in the gap between the kernel
> > > > >                * and the firmware. Add a bit of space just in case
> > > > >                */
> > > > > -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
> > > > > +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
> > > > >                                     + 0x1ffff) & ~0xffff;
> > > > >               spapr->initrd_size = load_image_targphys(initrd_filename,
> > > > >                                                        spapr->initrd_base,
> > > > > @@ -2932,20 +3075,22 @@ static void spapr_machine_init(MachineState *machine)
> > > > >           }
> > > > >       }
> > > > > -    if (bios_name == NULL) {
> > > > > -        bios_name = FW_FILE_NAME;
> > > > > +    if (spapr->bios_enabled) {
> > > > > +        if (bios_name == NULL) {
> > > > > +            bios_name = FW_FILE_NAME;
> > > > > +        }
> > > > > +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> > > > > +        if (!filename) {
> > > > > +            error_report("Could not find LPAR firmware '%s'", bios_name);
> > > > > +            exit(1);
> > > > > +        }
> > > > > +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> > > > > +        if (fw_size <= 0) {
> > > > > +            error_report("Could not load LPAR firmware '%s'", filename);
> > > > > +            exit(1);
> > > > > +        }
> > > > > +        g_free(filename);
> > > > >       }
> > > > > -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> > > > > -    if (!filename) {
> > > > > -        error_report("Could not find LPAR firmware '%s'", bios_name);
> > > > > -        exit(1);
> > > > > -    }
> > > > > -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> > > > > -    if (fw_size <= 0) {
> > > > > -        error_report("Could not load LPAR firmware '%s'", filename);
> > > > > -        exit(1);
> > > > > -    }
> > > > > -    g_free(filename);
> > > > >       /* FIXME: Should register things through the MachineState's qdev
> > > > >        * interface, this is a legacy from the sPAPREnvironment structure
> > > > > @@ -3162,6 +3307,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
> > > > >       visit_type_uint32(v, name, (uint32_t *)opaque, errp);
> > > > >   }
> > > > > +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
> > > > > +                                  void *opaque, Error **errp)
> > > > > +{
> > > > > +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> > > > > +}
> > > > > +
> > > > > +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
> > > > > +                                  void *opaque, Error **errp)
> > > > > +{
> > > > > +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> > > > > +}
> > > > 
> > > > Might be clearer to split the adjustable kernel addr stuff out into a
> > > > separate patch - in fact, didn't you have such a patch before?
> > > 
> > > 
> > > I did but it was not useful on its own and the whole "bios=off" feature
> > > does not need to be "series" at this stage. Also having it separate
> > > (before or after "kill slof") means changing same lines in 2 patches of
> > > the same patchset which is never really good.
> > 
> > Hm, ok.  What's the reason you require the changeable load location
> > for nobios?
> 
> 
> I do not require it. It is purely for debugging. You can basically set
> address to 0, run qemu with -s and do step debugging with debug symbols from
> vmlinux (well, keep switching endianness), otherwise you would need to
> calculate offset (0x400000 or whatever it is) for every breakpoint you set.
> Once I passed the prom_init() debugging stage (which did not work because I
> did not provide stack), I have not used it for last couple of days.
> 
> So I could ditch it now but we could use it to reduce (small but still) memory fragmentation by placing vmlinux at 0. Dunno.

Ok.

> > > > > +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
> > > > > +{
> > > > > +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> > > > > +
> > > > > +    return spapr->bios_enabled;
> > > > > +}
> > > > > +
> > > > > +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
> > > > > +{
> > > > > +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> > > > > +
> > > > > +    spapr->bios_enabled = value;
> > > > > +}
> > > > > +
> > > > >   static char *spapr_get_ic_mode(Object *obj, Error **errp)
> > > > >   {
> > > > >       SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> > > > > @@ -3267,6 +3438,20 @@ static void spapr_instance_init(Object *obj)
> > > > >       object_property_add_bool(obj, "vfio-no-msix-emulation",
> > > > >                                spapr_get_msix_emulation, NULL, NULL);
> > > > > +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
> > > > > +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
> > > > > +                        &error_abort);
> > > > > +    object_property_set_description(obj, "kernel-addr",
> > > > > +                                    stringify(KERNEL_LOAD_ADDR)
> > > > > +                                    " for -kernel is the default",
> > > > > +                                    NULL);
> > > > > +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
> > > > > +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
> > > > > +                            spapr_set_bios_enabled, NULL);
> > > > > +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
> > > > > +                                    NULL);
> > > > > +    spapr->bios_enabled = true;
> > > > > +
> > > > >       /* The machine class defines the default interrupt controller mode */
> > > > >       spapr->irq = smc->irq;
> > > > >       object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
> > > > > diff --git a/hw/ppc/spapr_client.c b/hw/ppc/spapr_client.c
> > > > > new file mode 100644
> > > > > index 000000000000..a318eaae3c40
> > > > > --- /dev/null
> > > > > +++ b/hw/ppc/spapr_client.c
> > > > > @@ -0,0 +1,464 @@
> > > > > +#include "qemu/osdep.h"
> > > > > +#include "qemu-common.h"
> > > > > +#include "qapi/error.h"
> > > > > +#include "exec/memory.h"
> > > > > +#include "hw/ppc/spapr.h"
> > > > > +#include "hw/ppc/spapr_vio.h"
> > > > > +#include "trace.h"
> > > > > +
> > > > > +struct prom_args {
> > > > 
> > > > StudlyCaps and typedef, please..
> > > > 
> > > > > +        __be32 service;
> > > > > +        __be32 nargs;
> > > > > +        __be32 nret;
> > > > > +        __be32 args[10];
> > > 
> > > 
> > > btw we do not generally have __be32 (and x86 build fails), I am
> > > replacing these with uint32_t.
> > 
> > Ok, good.
> > 
> > > > > +};
> > > > > +
> > > > > +#define CLI_PH_MASK     0x0FFFFFFF
> > > > > +#define CLI_INST_PREFIX 0x20000000
> > > > > +
> > > > > +#define readstr(pa, buf) cpu_physical_memory_read((pa), (buf), sizeof(buf))
> > > > 
> > > > I'd prefer "readbuf" or something, since this isn't actually looking
> > > > at any string structure.
> > > > 
> > > > > +#define readuint32(pa) ({ \
> > > > > +    uint32_t __tmp__; \
> > > > > +    cpu_physical_memory_read((pa), &__tmp__, sizeof(__tmp__)); \
> > > > > +    be32_to_cpu(__tmp__); })
> > > > 
> > > > You can simplify this with ldl_be_phys().  In fact, then you probably
> > > > don't need any wrapper at all.
> > > 
> > > Ah right. There are too many of these, was quicker to type this :)
> > > 
> > > 
> > > > > +
> > > > > +static bool _cmpservice(const char *s, size_t len,
> > > > > +                        unsigned nargs, unsigned nret,
> > > > > +                        const char *s1, size_t len1,
> > > > > +                        unsigned nargscheck, unsigned nretcheck)
> > > > > +{
> > > > > +    if (strncmp(s, s1, MAX(len, len1))) {
> > > > > +        return false;
> > > > > +    }
> > > > > +    if (nargscheck == 0 && nretcheck == 0) {
> > > > > +        return true;
> > > > > +    }
> > > > > +    if (nargs != nargscheck || nret != nretcheck) {
> > > > > +        trace_spapr_client_error_param(s, nargscheck, nretcheck, nargs, nret);
> > > > > +        return false;
> > > > > +    }
> > > > > +
> > > > > +    return true;
> > > > > +}
> > > > > +
> > > > > +static uint32_t client_finddevice(const void *fdt, uint32_t nodeaddr)
> > > > > +{
> > > > > +    char node[256];
> > > > > +    int ret;
> > > > > +
> > > > > +    readstr(nodeaddr, node);
> > > > > +    ret = fdt_path_offset(fdt, node);
> > > > > +    if (ret >= 0) {
> > > > > +        ret = fdt_get_phandle(fdt, ret);
> > > > > +    }
> > > > > +
> > > > > +    return (uint32_t) ret;
> > > > > +}
> > > > > +
> > > > > +static uint32_t client_getprop(const void *fdt, uint32_t nodeph, uint32_t pname,
> > > > > +            uint32_t valaddr, uint32_t vallen)
> > > > > +{
> > > > > +    char propname[64];
> > > > > +    uint32_t ret = 0;
> > > > > +    int proplen = 0;
> > > > > +    const void *prop;
> > > > > +
> > > > > +    readstr(pname, propname);
> > > > > +
> > > > > +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> > > > > +                               propname, strlen(propname), &proplen);
> > > > 
> > > > You can just use fdt_getprop().  By definition it's equivalent to
> > > > fdt_getprop_namelen(.., propname, strlen(propname), ...)
> > > > 
> > > > > +    if (prop) {
> > > > > +        int cb = MIN(proplen, vallen);
> > > > > +
> > > > > +        cpu_physical_memory_write(valaddr, prop, cb);
> > > > > +        ret = cb;
> > > > > +    } else if (strncmp(propname, "stdout", 6) == 0 && vallen == sizeof(ret)) {
> > > > > +        ret = cpu_to_be32(1);
> > > > > +        cpu_physical_memory_write(valaddr, &ret, MIN(vallen, sizeof(ret)));
> > > > 
> > > > I'm guessing the special casing of stdout is to fake the OF instancing
> > > > stuff, which I don't really understand.  Some comments to explain
> > > > what's going on here would be good.
> > > 
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/prom_init.c#n2219
> > > 
> > > This thing wants ihandle which is stored as /chosen/stdout and which we
> > > do not have as there are no instances.
> > > 
> > > If this property is not there, prom_panic() happens.
> > > 
> > > I could implement some primitive intances, I'll look into this.
> > 
> > Ok, that's roughly as I guessed.  This seems like a reasonable way of
> > handling this for now without having to implement a notion of
> > instances - but it would be good to have some comment saying what's
> > going on here.
> > 
> > We might need to implement at least basic awareness of instances
> > if/when we need to be a bit more flexible about what input/output
> > devices we use for the console.
> > 
> > > > > +    } else {
> > > > > +        ret = -1;
> > > > > +    }
> > > > > +    trace_spapr_client_getprop(nodeph, propname, ret);
> > > > > +
> > > > > +    return ret;
> > > > > +}
> > > > > +
> > > > > +static uint32_t client_setprop(SpaprMachineState *sm,
> > > > > +                               uint32_t nodeph, uint32_t pname,
> > > > > +                               uint32_t valaddr, uint32_t vallen)
> > > > > +{
> > > > > +    char propname[64];
> > > > > +    uint32_t ret = -1;
> > > > > +    int proplen = 0;
> > > > > +    const void *prop;
> > > > > +
> > > > > +    readstr(pname, propname);
> > > > > +    if (vallen == sizeof(uint32_t) &&
> > > > > +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
> > > > > +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
> > > > > +
> > > > > +        sm->rtas_base = readuint32(valaddr);
> > > > > +        prop = fdt_getprop_namelen(sm->fdt_blob,
> > > > > +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
> > > > > +                                                              nodeph),
> > > > > +                                   propname, strlen(propname), &proplen);
> > > > > +        if (proplen == vallen) {
> > > > > +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
> > > > > +            ret = proplen;
> > > > > +        }
> > > > 
> > > > Is there a particular reason to restrict this to the rtas properties,
> > > > rather than just allowing the guest to fdt_setprop() something
> > > > arbitrary?
> > > 
> > > The FDT is flatten and I am not quite sure if libfdt can handle updating
> > > properties if the length has changed.
> > 
> > fdt_setprop() will handle updating properties with changed length (in
> > fact there's a special fdt_setprop_inplace() optimized for the case
> > where you don't need that).  It's not particularly efficient, but it
> > should work fine for the cases we have here.  In fact, I think you're
> > already relying on this for the code that adds the phandles to
> > everything.
> 
> Well, I used to add phandles before calling fdt_pack() so it is not exactly the same.

Ah, right, that's why adding the phandles worked.

> > One complication is that it can return FDT_ERR_NOSPACE if there isn't
> > enough buffer for the increased thing.  We could either trap that,
> > resize and retry, or we could leave a bunch of extra space.  The
> > latter would be basically equivalent to not doing fdt_pack() on the
> > blob in the nobios case.
> 
> 
> This is what I ended up doing.
> 
> 
> > > Also, more importantly, potentially property changes like this may have
> > > to be reflected in the QEMU device tree so I allowed only the properties
> > > which I know how to deal with.
> > 
> > That's a reasonable concern, but the nice thing about not having SLOF
> > is that there's only one copy of the device tree - the blob in qemu.
> > So a setprop() on that is automatically a setprop() everywhere (this
> > is another reason not to write the fdt into guest memory in the nobios
> > case - it will become stale as soon as the client changes anything).
> 
> 
> True to a degree. It is "setprop" to the current fdt blob which we do not
> analyze after we build the fdt. We either need to do parse the tree before
> we rebuild it as CAS so we do not lose the updates or do selective changes
> to the QEMUs objects from the "setprop" handler (this is what I do
> now).

Hrm.. do those setprops happen before CAS?  I would have thought we'd
call CAS before instantiating RTAS.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-07  4:44   ` Alexey Kardashevskiy
@ 2020-01-07  5:54     ` David Gibson
  2020-01-08  4:20       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2020-01-07  5:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1457 bytes --]

On Tue, Jan 07, 2020 at 03:44:35PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 06/01/2020 15:19, David Gibson wrote:
> >> +
> >> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
> >> +                                       uint32_t buf, uint32_t len)
> >> +{
> >> +    char tmp[256];
> > 
> > Fixed sized buffers are icky.  You could either dynamically allocate
> > this based on the size the client gives, or you could use
> > memory_region_get_ram_ptr() to read the data from the tree directly
> > into guest memory.
> 
> @len comes from the guest, I am really not comfortable with allocating
> whatever (broken) guest requested. And if I limit @len by 1024 or
> similar, then a fixed size buffer will do too, no?

I see your point.  Does this call have a way to report failure?  In
that case you could outright fail the call if it requests too long a
length.

> btw how exactly can I use memory_region_get_ram_ptr()?
> get_system_memory() returns a root MR which is not RAM, RAM is a
> "spapr.ram" sub-MR.

Right, but you know that RAM is always at offset 0 within that root
MR.  That said, it doesn't look like it's that easy to bounds check
that pointer, so maybe that's not a good idea after all.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-07  5:26         ` David Gibson
@ 2020-01-08  4:07           ` Alexey Kardashevskiy
  2020-01-09  4:07             ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-08  4:07 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 07/01/2020 16:26, David Gibson wrote:

>>>>>> +static uint32_t client_setprop(SpaprMachineState *sm,
>>>>>> +                               uint32_t nodeph, uint32_t pname,
>>>>>> +                               uint32_t valaddr, uint32_t vallen)
>>>>>> +{
>>>>>> +    char propname[64];
>>>>>> +    uint32_t ret = -1;
>>>>>> +    int proplen = 0;
>>>>>> +    const void *prop;
>>>>>> +
>>>>>> +    readstr(pname, propname);
>>>>>> +    if (vallen == sizeof(uint32_t) &&
>>>>>> +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
>>>>>> +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
>>>>>> +
>>>>>> +        sm->rtas_base = readuint32(valaddr);
>>>>>> +        prop = fdt_getprop_namelen(sm->fdt_blob,
>>>>>> +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
>>>>>> +                                                              nodeph),
>>>>>> +                                   propname, strlen(propname), &proplen);
>>>>>> +        if (proplen == vallen) {
>>>>>> +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
>>>>>> +            ret = proplen;
>>>>>> +        }
>>>>>
>>>>> Is there a particular reason to restrict this to the rtas properties,
>>>>> rather than just allowing the guest to fdt_setprop() something
>>>>> arbitrary?
>>>>
>>>> The FDT is flatten and I am not quite sure if libfdt can handle updating
>>>> properties if the length has changed.
>>>
>>> fdt_setprop() will handle updating properties with changed length (in
>>> fact there's a special fdt_setprop_inplace() optimized for the case
>>> where you don't need that).  It's not particularly efficient, but it
>>> should work fine for the cases we have here.  In fact, I think you're
>>> already relying on this for the code that adds the phandles to
>>> everything.
>>
>> Well, I used to add phandles before calling fdt_pack() so it is not exactly the same.
> 
> Ah, right, that's why adding the phandles worked.
> 
>>> One complication is that it can return FDT_ERR_NOSPACE if there isn't
>>> enough buffer for the increased thing.  We could either trap that,
>>> resize and retry, or we could leave a bunch of extra space.  The
>>> latter would be basically equivalent to not doing fdt_pack() on the
>>> blob in the nobios case.
>>
>>
>> This is what I ended up doing.
>>
>>
>>>> Also, more importantly, potentially property changes like this may have
>>>> to be reflected in the QEMU device tree so I allowed only the properties
>>>> which I know how to deal with.
>>>
>>> That's a reasonable concern, but the nice thing about not having SLOF
>>> is that there's only one copy of the device tree - the blob in qemu.
>>> So a setprop() on that is automatically a setprop() everywhere (this
>>> is another reason not to write the fdt into guest memory in the nobios
>>> case - it will become stale as soon as the client changes anything).
>>
>>
>> True to a degree. It is "setprop" to the current fdt blob which we do not
>> analyze after we build the fdt. We either need to do parse the tree before
>> we rebuild it as CAS so we do not lose the updates or do selective changes
>> to the QEMUs objects from the "setprop" handler (this is what I do
>> now).
> 
> Hrm.. do those setprops happen before CAS?

Yes, vmlinux/zimage call "setprop" for "linux,initrd-start",
"linux,initrd-end", "bootargs", "linux,stdout-path"; vmlinux sets
properties if linux,initrd-* came from r3/r4 and zImage sets properties
no matter how it discovered them - from r3/r4 or the device tree.

btw we write them as "cells" (==4bytes long) in qemu but vminux changes
them to 8 bytes and zImage keeps it 4 bytes. Not a problem but an
interesting fact, this is why I had to allow extending the properties in
"setprop" :)


>  I would have thought we'd
> call CAS before instantiating RTAS.

This is correct but I do not think the order is mandatory.


-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-07  5:54     ` David Gibson
@ 2020-01-08  4:20       ` Alexey Kardashevskiy
  2020-01-08  5:53         ` Alexey Kardashevskiy
  2020-01-09  4:18         ` David Gibson
  0 siblings, 2 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-08  4:20 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 07/01/2020 16:54, David Gibson wrote:
> On Tue, Jan 07, 2020 at 03:44:35PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 06/01/2020 15:19, David Gibson wrote:
>>>> +
>>>> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
>>>> +                                       uint32_t buf, uint32_t len)
>>>> +{
>>>> +    char tmp[256];
>>>
>>> Fixed sized buffers are icky.  You could either dynamically allocate
>>> this based on the size the client gives, or you could use
>>> memory_region_get_ram_ptr() to read the data from the tree directly
>>> into guest memory.
>>
>> @len comes from the guest, I am really not comfortable with allocating
>> whatever (broken) guest requested. And if I limit @len by 1024 or
>> similar, then a fixed size buffer will do too, no?
> 
> I see your point.  Does this call have a way to report failure?  In
> that case you could outright fail the call if it requests too long a
> length.

It returns length which can be 0 to signal an error.

but with this particular method the bigger problem is that I cannot know
in advance the actual path length from fdt_get_path(). I could double
the size until fdt_get_path() succeeded, just seems overkill here.

Property names seem to be limited by 32:

OF1275:
===
nextprop
IN:phandle, [string] previous, [address] buf
OUT:  flag

Copies the name of the property following previous in the property list
of the device node identified by phandle into buf, as a null-terminated
string. Buf is the address of a 32-byte region of memory. If previous is
zero or a pointer to a null string, copies the name of the device node’s
first property.
===


>> btw how exactly can I use memory_region_get_ram_ptr()?
>> get_system_memory() returns a root MR which is not RAM, RAM is a
>> "spapr.ram" sub-MR.
> 
> Right, but you know that RAM is always at offset 0 within that root
> MR. 

Well, it could potentially be more than just one level down in the MR
tree, for example we could add NUMA MRs and place actual RAM MR under these.


> That said, it doesn't look like it's that easy to bounds check
> that pointer, so maybe that's not a good idea after all.

ok.


-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-08  4:20       ` Alexey Kardashevskiy
@ 2020-01-08  5:53         ` Alexey Kardashevskiy
  2020-01-09  4:25           ` David Gibson
  2020-01-09  4:18         ` David Gibson
  1 sibling, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-08  5:53 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 08/01/2020 15:20, Alexey Kardashevskiy wrote:
> 
> 
> On 07/01/2020 16:54, David Gibson wrote:
>> On Tue, Jan 07, 2020 at 03:44:35PM +1100, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 06/01/2020 15:19, David Gibson wrote:
>>>>> +
>>>>> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
>>>>> +                                       uint32_t buf, uint32_t len)
>>>>> +{
>>>>> +    char tmp[256];
>>>>
>>>> Fixed sized buffers are icky.  You could either dynamically allocate
>>>> this based on the size the client gives, or you could use
>>>> memory_region_get_ram_ptr() to read the data from the tree directly
>>>> into guest memory.
>>>
>>> @len comes from the guest, I am really not comfortable with allocating
>>> whatever (broken) guest requested. And if I limit @len by 1024 or
>>> similar, then a fixed size buffer will do too, no?
>>
>> I see your point.  Does this call have a way to report failure?  In
>> that case you could outright fail the call if it requests too long a
>> length.
> 
> It returns length which can be 0 to signal an error.
> 
> but with this particular method the bigger problem is that I cannot know
> in advance the actual path length from fdt_get_path(). I could double
> the size until fdt_get_path() succeeded, just seems overkill here.
> 
> Property names seem to be limited by 32:


>>> len("ibm,query-interrupt-source-number")
33

Awesome. Oh well :(


> 
> OF1275:
> ===
> nextprop
> IN:phandle, [string] previous, [address] buf
> OUT:  flag
> 
> Copies the name of the property following previous in the property list
> of the device node identified by phandle into buf, as a null-terminated
> string. Buf is the address of a 32-byte region of memory. If previous is
> zero or a pointer to a null string, copies the name of the device node’s
> first property.
> ===
> 
> 
>>> btw how exactly can I use memory_region_get_ram_ptr()?
>>> get_system_memory() returns a root MR which is not RAM, RAM is a
>>> "spapr.ram" sub-MR.
>>
>> Right, but you know that RAM is always at offset 0 within that root
>> MR. 
> 
> Well, it could potentially be more than just one level down in the MR
> tree, for example we could add NUMA MRs and place actual RAM MR under these.
> 
> 
>> That said, it doesn't look like it's that easy to bounds check
>> that pointer, so maybe that's not a good idea after all.
> 
> ok.
> 
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-08  4:07           ` Alexey Kardashevskiy
@ 2020-01-09  4:07             ` David Gibson
  2020-01-09  6:31               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2020-01-09  4:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4836 bytes --]

On Wed, Jan 08, 2020 at 03:07:41PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 07/01/2020 16:26, David Gibson wrote:
> 
> >>>>>> +static uint32_t client_setprop(SpaprMachineState *sm,
> >>>>>> +                               uint32_t nodeph, uint32_t pname,
> >>>>>> +                               uint32_t valaddr, uint32_t vallen)
> >>>>>> +{
> >>>>>> +    char propname[64];
> >>>>>> +    uint32_t ret = -1;
> >>>>>> +    int proplen = 0;
> >>>>>> +    const void *prop;
> >>>>>> +
> >>>>>> +    readstr(pname, propname);
> >>>>>> +    if (vallen == sizeof(uint32_t) &&
> >>>>>> +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
> >>>>>> +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
> >>>>>> +
> >>>>>> +        sm->rtas_base = readuint32(valaddr);
> >>>>>> +        prop = fdt_getprop_namelen(sm->fdt_blob,
> >>>>>> +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
> >>>>>> +                                                              nodeph),
> >>>>>> +                                   propname, strlen(propname), &proplen);
> >>>>>> +        if (proplen == vallen) {
> >>>>>> +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
> >>>>>> +            ret = proplen;
> >>>>>> +        }
> >>>>>
> >>>>> Is there a particular reason to restrict this to the rtas properties,
> >>>>> rather than just allowing the guest to fdt_setprop() something
> >>>>> arbitrary?
> >>>>
> >>>> The FDT is flatten and I am not quite sure if libfdt can handle updating
> >>>> properties if the length has changed.
> >>>
> >>> fdt_setprop() will handle updating properties with changed length (in
> >>> fact there's a special fdt_setprop_inplace() optimized for the case
> >>> where you don't need that).  It's not particularly efficient, but it
> >>> should work fine for the cases we have here.  In fact, I think you're
> >>> already relying on this for the code that adds the phandles to
> >>> everything.
> >>
> >> Well, I used to add phandles before calling fdt_pack() so it is not exactly the same.
> > 
> > Ah, right, that's why adding the phandles worked.
> > 
> >>> One complication is that it can return FDT_ERR_NOSPACE if there isn't
> >>> enough buffer for the increased thing.  We could either trap that,
> >>> resize and retry, or we could leave a bunch of extra space.  The
> >>> latter would be basically equivalent to not doing fdt_pack() on the
> >>> blob in the nobios case.
> >>
> >>
> >> This is what I ended up doing.
> >>
> >>
> >>>> Also, more importantly, potentially property changes like this may have
> >>>> to be reflected in the QEMU device tree so I allowed only the properties
> >>>> which I know how to deal with.
> >>>
> >>> That's a reasonable concern, but the nice thing about not having SLOF
> >>> is that there's only one copy of the device tree - the blob in qemu.
> >>> So a setprop() on that is automatically a setprop() everywhere (this
> >>> is another reason not to write the fdt into guest memory in the nobios
> >>> case - it will become stale as soon as the client changes anything).
> >>
> >>
> >> True to a degree. It is "setprop" to the current fdt blob which we do not
> >> analyze after we build the fdt. We either need to do parse the tree before
> >> we rebuild it as CAS so we do not lose the updates or do selective changes
> >> to the QEMUs objects from the "setprop" handler (this is what I do
> >> now).
> > 
> > Hrm.. do those setprops happen before CAS?
> 
> Yes, vmlinux/zimage call "setprop" for "linux,initrd-start",
> "linux,initrd-end", "bootargs", "linux,stdout-path"; vmlinux sets
> properties if linux,initrd-* came from r3/r4 and zImage sets properties
> no matter how it discovered them - from r3/r4 or the device tree.

Ok, and those setprops happen before CAS?

In a sense this is kind of a fundamental problem with rebuilding the
whole DT at CAS time.  Except that strictly speaking it's a problem
even without that: we just get away with it by accident because CAS
isn't likely to change the same things that guest setprops do.

It's still basically unsynchronized mutations by two parties to a
shared data structure.

> btw we write them as "cells" (==4bytes long) in qemu but vminux changes
> them to 8 bytes and zImage keeps it 4 bytes. Not a problem but an
> interesting fact, this is why I had to allow extending the properties in
> "setprop" :)
> 
> 
> >  I would have thought we'd
> > call CAS before instantiating RTAS.
> 
> This is correct but I do not think the order is mandatory.

Hm, right.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-08  4:20       ` Alexey Kardashevskiy
  2020-01-08  5:53         ` Alexey Kardashevskiy
@ 2020-01-09  4:18         ` David Gibson
  1 sibling, 0 replies; 22+ messages in thread
From: David Gibson @ 2020-01-09  4:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3654 bytes --]

On Wed, Jan 08, 2020 at 03:20:22PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 07/01/2020 16:54, David Gibson wrote:
> > On Tue, Jan 07, 2020 at 03:44:35PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 06/01/2020 15:19, David Gibson wrote:
> >>>> +
> >>>> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
> >>>> +                                       uint32_t buf, uint32_t len)
> >>>> +{
> >>>> +    char tmp[256];
> >>>
> >>> Fixed sized buffers are icky.  You could either dynamically allocate
> >>> this based on the size the client gives, or you could use
> >>> memory_region_get_ram_ptr() to read the data from the tree directly
> >>> into guest memory.
> >>
> >> @len comes from the guest, I am really not comfortable with allocating
> >> whatever (broken) guest requested. And if I limit @len by 1024 or
> >> similar, then a fixed size buffer will do too, no?
> > 
> > I see your point.  Does this call have a way to report failure?  In
> > that case you could outright fail the call if it requests too long a
> > length.
> 
> It returns length which can be 0 to signal an error.
> 
> but with this particular method the bigger problem is that I cannot know
> in advance the actual path length from fdt_get_path(). I could double
> the size until fdt_get_path() succeeded, just seems overkill here.

fdt_get_path() will return -FDT_ERR_NOSPACE if the path doesn't fit in
the provided buffer.  I think that's enough to fail in the relevant
cases.  You could then use a buffer of size min(client provided size,
fixed max size).

I believe I've thought of trying to implement something that returns
what the path length would be without constructing it, but it turns
out to be essentially impossible given the fdt format and the fact
that we don't allocate inside libfdt (basically we have to use the
partially constructed path buf as state information to know how to
proceed with our scan).

> Property names seem to be limited by 32:
> 
> OF1275:
> ===
> nextprop
> IN:phandle, [string] previous, [address] buf
> OUT:  flag
> 
> Copies the name of the property following previous in the property list
> of the device node identified by phandle into buf, as a null-terminated
> string. Buf is the address of a 32-byte region of memory. If previous is
> zero or a pointer to a null string, copies the name of the device node’s
> first property.
> ===

Yeah... IEEE1275 says that, but I don't think most OF implementations
enforce it.  I'm also pretty sure that limit is broken by device trees
in the wild (I remember investigating this when implementing dtc &
libfdt).

> >> btw how exactly can I use memory_region_get_ram_ptr()?
> >> get_system_memory() returns a root MR which is not RAM, RAM is a
> >> "spapr.ram" sub-MR.
> > 
> > Right, but you know that RAM is always at offset 0 within that root
> > MR. 
> 
> Well, it could potentially be more than just one level down in the MR
> tree, for example we could add NUMA MRs and place actual RAM MR
> under these.

Oh.. yeah, sorry, I think you'd need address_space_translate() or
something like it to locate the right MR first.  Which come to think
of it, validates / truncates the length to, so..

> > That said, it doesn't look like it's that easy to bounds check
> > that pointer, so maybe that's not a good idea after all.
> 
> ok.

.. I think that would work after all.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-08  5:53         ` Alexey Kardashevskiy
@ 2020-01-09  4:25           ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2020-01-09  4:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1880 bytes --]

On Wed, Jan 08, 2020 at 04:53:06PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 08/01/2020 15:20, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 07/01/2020 16:54, David Gibson wrote:
> >> On Tue, Jan 07, 2020 at 03:44:35PM +1100, Alexey Kardashevskiy wrote:
> >>>
> >>>
> >>> On 06/01/2020 15:19, David Gibson wrote:
> >>>>> +
> >>>>> +static uint32_t client_package_to_path(const void *fdt, uint32_t phandle,
> >>>>> +                                       uint32_t buf, uint32_t len)
> >>>>> +{
> >>>>> +    char tmp[256];
> >>>>
> >>>> Fixed sized buffers are icky.  You could either dynamically allocate
> >>>> this based on the size the client gives, or you could use
> >>>> memory_region_get_ram_ptr() to read the data from the tree directly
> >>>> into guest memory.
> >>>
> >>> @len comes from the guest, I am really not comfortable with allocating
> >>> whatever (broken) guest requested. And if I limit @len by 1024 or
> >>> similar, then a fixed size buffer will do too, no?
> >>
> >> I see your point.  Does this call have a way to report failure?  In
> >> that case you could outright fail the call if it requests too long a
> >> length.
> > 
> > It returns length which can be 0 to signal an error.
> > 
> > but with this particular method the bigger problem is that I cannot know
> > in advance the actual path length from fdt_get_path(). I could double
> > the size until fdt_get_path() succeeded, just seems overkill here.
> > 
> > Property names seem to be limited by 32:
> 
> 
> >>> len("ibm,query-interrupt-source-number")
> 33
> 
> Awesome. Oh well :(

Yeah, as I suspected.  Also 'ibm,associativity-reference-points'.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-09  4:07             ` David Gibson
@ 2020-01-09  6:31               ` Alexey Kardashevskiy
  2020-01-13  3:32                 ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2020-01-09  6:31 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 09/01/2020 15:07, David Gibson wrote:
> On Wed, Jan 08, 2020 at 03:07:41PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 07/01/2020 16:26, David Gibson wrote:
>>
>>>>>>>> +static uint32_t client_setprop(SpaprMachineState *sm,
>>>>>>>> +                               uint32_t nodeph, uint32_t pname,
>>>>>>>> +                               uint32_t valaddr, uint32_t vallen)
>>>>>>>> +{
>>>>>>>> +    char propname[64];
>>>>>>>> +    uint32_t ret = -1;
>>>>>>>> +    int proplen = 0;
>>>>>>>> +    const void *prop;
>>>>>>>> +
>>>>>>>> +    readstr(pname, propname);
>>>>>>>> +    if (vallen == sizeof(uint32_t) &&
>>>>>>>> +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
>>>>>>>> +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
>>>>>>>> +
>>>>>>>> +        sm->rtas_base = readuint32(valaddr);
>>>>>>>> +        prop = fdt_getprop_namelen(sm->fdt_blob,
>>>>>>>> +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
>>>>>>>> +                                                              nodeph),
>>>>>>>> +                                   propname, strlen(propname), &proplen);
>>>>>>>> +        if (proplen == vallen) {
>>>>>>>> +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
>>>>>>>> +            ret = proplen;
>>>>>>>> +        }
>>>>>>>
>>>>>>> Is there a particular reason to restrict this to the rtas properties,
>>>>>>> rather than just allowing the guest to fdt_setprop() something
>>>>>>> arbitrary?
>>>>>>
>>>>>> The FDT is flatten and I am not quite sure if libfdt can handle updating
>>>>>> properties if the length has changed.
>>>>>
>>>>> fdt_setprop() will handle updating properties with changed length (in
>>>>> fact there's a special fdt_setprop_inplace() optimized for the case
>>>>> where you don't need that).  It's not particularly efficient, but it
>>>>> should work fine for the cases we have here.  In fact, I think you're
>>>>> already relying on this for the code that adds the phandles to
>>>>> everything.
>>>>
>>>> Well, I used to add phandles before calling fdt_pack() so it is not exactly the same.
>>>
>>> Ah, right, that's why adding the phandles worked.
>>>
>>>>> One complication is that it can return FDT_ERR_NOSPACE if there isn't
>>>>> enough buffer for the increased thing.  We could either trap that,
>>>>> resize and retry, or we could leave a bunch of extra space.  The
>>>>> latter would be basically equivalent to not doing fdt_pack() on the
>>>>> blob in the nobios case.
>>>>
>>>>
>>>> This is what I ended up doing.
>>>>
>>>>
>>>>>> Also, more importantly, potentially property changes like this may have
>>>>>> to be reflected in the QEMU device tree so I allowed only the properties
>>>>>> which I know how to deal with.
>>>>>
>>>>> That's a reasonable concern, but the nice thing about not having SLOF
>>>>> is that there's only one copy of the device tree - the blob in qemu.
>>>>> So a setprop() on that is automatically a setprop() everywhere (this
>>>>> is another reason not to write the fdt into guest memory in the nobios
>>>>> case - it will become stale as soon as the client changes anything).
>>>>
>>>>
>>>> True to a degree. It is "setprop" to the current fdt blob which we do not
>>>> analyze after we build the fdt. We either need to do parse the tree before
>>>> we rebuild it as CAS so we do not lose the updates or do selective changes
>>>> to the QEMUs objects from the "setprop" handler (this is what I do
>>>> now).
>>>
>>> Hrm.. do those setprops happen before CAS?
>>
>> Yes, vmlinux/zimage call "setprop" for "linux,initrd-start",
>> "linux,initrd-end", "bootargs", "linux,stdout-path"; vmlinux sets
>> properties if linux,initrd-* came from r3/r4 and zImage sets properties
>> no matter how it discovered them - from r3/r4 or the device tree.
> 
> Ok, and those setprops happen before CAS?

Yes.

> In a sense this is kind of a fundamental problem with rebuilding the
> whole DT at CAS time.  Except that strictly speaking it's a problem
> even without that: we just get away with it by accident because CAS
> isn't likely to change the same things that guest setprops do.

> It's still basically unsynchronized mutations by two parties to a
> shared data structure.

True... We may end up not having FDT at all and reuse QOM objects for
that, can even use hashes of QObject pointers as phandles :)



>> btw we write them as "cells" (==4bytes long) in qemu but vminux changes
>> them to 8 bytes and zImage keeps it 4 bytes. Not a problem but an
>> interesting fact, this is why I had to allow extending the properties in
>> "setprop" :)
>>
>>
>>>  I would have thought we'd
>>> call CAS before instantiating RTAS.
>>
>> This is correct but I do not think the order is mandatory.
> 
> Hm, right.
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH qemu v2] spapr: Kill SLOF
  2020-01-09  6:31               ` Alexey Kardashevskiy
@ 2020-01-13  3:32                 ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2020-01-13  3:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5112 bytes --]

On Thu, Jan 09, 2020 at 05:31:24PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 09/01/2020 15:07, David Gibson wrote:
> > On Wed, Jan 08, 2020 at 03:07:41PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 07/01/2020 16:26, David Gibson wrote:
> >>
> >>>>>>>> +static uint32_t client_setprop(SpaprMachineState *sm,
> >>>>>>>> +                               uint32_t nodeph, uint32_t pname,
> >>>>>>>> +                               uint32_t valaddr, uint32_t vallen)
> >>>>>>>> +{
> >>>>>>>> +    char propname[64];
> >>>>>>>> +    uint32_t ret = -1;
> >>>>>>>> +    int proplen = 0;
> >>>>>>>> +    const void *prop;
> >>>>>>>> +
> >>>>>>>> +    readstr(pname, propname);
> >>>>>>>> +    if (vallen == sizeof(uint32_t) &&
> >>>>>>>> +        ((strncmp(propname, "linux,rtas-base", 15) == 0) ||
> >>>>>>>> +         (strncmp(propname, "linux,rtas-entry", 16) == 0))) {
> >>>>>>>> +
> >>>>>>>> +        sm->rtas_base = readuint32(valaddr);
> >>>>>>>> +        prop = fdt_getprop_namelen(sm->fdt_blob,
> >>>>>>>> +                                   fdt_node_offset_by_phandle(sm->fdt_blob,
> >>>>>>>> +                                                              nodeph),
> >>>>>>>> +                                   propname, strlen(propname), &proplen);
> >>>>>>>> +        if (proplen == vallen) {
> >>>>>>>> +            *(uint32_t *) prop = cpu_to_be32(sm->rtas_base);
> >>>>>>>> +            ret = proplen;
> >>>>>>>> +        }
> >>>>>>>
> >>>>>>> Is there a particular reason to restrict this to the rtas properties,
> >>>>>>> rather than just allowing the guest to fdt_setprop() something
> >>>>>>> arbitrary?
> >>>>>>
> >>>>>> The FDT is flatten and I am not quite sure if libfdt can handle updating
> >>>>>> properties if the length has changed.
> >>>>>
> >>>>> fdt_setprop() will handle updating properties with changed length (in
> >>>>> fact there's a special fdt_setprop_inplace() optimized for the case
> >>>>> where you don't need that).  It's not particularly efficient, but it
> >>>>> should work fine for the cases we have here.  In fact, I think you're
> >>>>> already relying on this for the code that adds the phandles to
> >>>>> everything.
> >>>>
> >>>> Well, I used to add phandles before calling fdt_pack() so it is not exactly the same.
> >>>
> >>> Ah, right, that's why adding the phandles worked.
> >>>
> >>>>> One complication is that it can return FDT_ERR_NOSPACE if there isn't
> >>>>> enough buffer for the increased thing.  We could either trap that,
> >>>>> resize and retry, or we could leave a bunch of extra space.  The
> >>>>> latter would be basically equivalent to not doing fdt_pack() on the
> >>>>> blob in the nobios case.
> >>>>
> >>>>
> >>>> This is what I ended up doing.
> >>>>
> >>>>
> >>>>>> Also, more importantly, potentially property changes like this may have
> >>>>>> to be reflected in the QEMU device tree so I allowed only the properties
> >>>>>> which I know how to deal with.
> >>>>>
> >>>>> That's a reasonable concern, but the nice thing about not having SLOF
> >>>>> is that there's only one copy of the device tree - the blob in qemu.
> >>>>> So a setprop() on that is automatically a setprop() everywhere (this
> >>>>> is another reason not to write the fdt into guest memory in the nobios
> >>>>> case - it will become stale as soon as the client changes anything).
> >>>>
> >>>>
> >>>> True to a degree. It is "setprop" to the current fdt blob which we do not
> >>>> analyze after we build the fdt. We either need to do parse the tree before
> >>>> we rebuild it as CAS so we do not lose the updates or do selective changes
> >>>> to the QEMUs objects from the "setprop" handler (this is what I do
> >>>> now).
> >>>
> >>> Hrm.. do those setprops happen before CAS?
> >>
> >> Yes, vmlinux/zimage call "setprop" for "linux,initrd-start",
> >> "linux,initrd-end", "bootargs", "linux,stdout-path"; vmlinux sets
> >> properties if linux,initrd-* came from r3/r4 and zImage sets properties
> >> no matter how it discovered them - from r3/r4 or the device tree.
> > 
> > Ok, and those setprops happen before CAS?
> 
> Yes.
> 
> > In a sense this is kind of a fundamental problem with rebuilding the
> > whole DT at CAS time.  Except that strictly speaking it's a problem
> > even without that: we just get away with it by accident because CAS
> > isn't likely to change the same things that guest setprops do.
> 
> > It's still basically unsynchronized mutations by two parties to a
> > shared data structure.
> 
> True... We may end up not having FDT at all and reuse QOM objects for
> that, can even use hashes of QObject pointers as phandles :)

Hm, interesting idea.  I suspect the QOM hierarchy won't be quite
similar enough to the fdt hierarchy (or at least not guaranteed to be)
to make it work, but it's worth thinking about at least.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-01-13  3:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-05 23:42 [PATCH qemu v2] spapr: Kill SLOF Alexey Kardashevskiy
2020-01-06  4:19 ` David Gibson
2020-01-06  6:28   ` Alexey Kardashevskiy
2020-01-06  8:50     ` David Gibson
2020-01-06 12:34       ` Alexey Kardashevskiy
2020-01-06 17:09         ` Cédric Le Goater
2020-01-06 17:25           ` Peter Maydell
2020-01-06 18:56             ` Philippe Mathieu-Daudé
2020-01-07  1:44             ` Alexey Kardashevskiy
2020-01-07  5:26         ` David Gibson
2020-01-08  4:07           ` Alexey Kardashevskiy
2020-01-09  4:07             ` David Gibson
2020-01-09  6:31               ` Alexey Kardashevskiy
2020-01-13  3:32                 ` David Gibson
2020-01-06 13:39   ` Alexey Kardashevskiy
2020-01-06 23:56     ` David Gibson
2020-01-07  4:44   ` Alexey Kardashevskiy
2020-01-07  5:54     ` David Gibson
2020-01-08  4:20       ` Alexey Kardashevskiy
2020-01-08  5:53         ` Alexey Kardashevskiy
2020-01-09  4:25           ` David Gibson
2020-01-09  4:18         ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.