qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] ppc: record-replay fixes and enablement
@ 2023-07-26 18:35 Nicholas Piggin
  2023-07-26 18:35 ` [PATCH 1/7] target/ppc: Fix CPU reservation migration for record-replay Nicholas Piggin
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Nicholas Piggin @ 2023-07-26 18:35 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Nicholas Piggin, Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

Here is a series that gets ppc pseries and powernv machines into
better shape for record-replay, maybe for 8.2. It's likely got a
few deficiencies but it does run test cases and helped find bugs
in migration already. It requires previous decrementer fixes to
work well.

I think we can get away without patch 1 for 8.1, because we
inadvertently fixed regular (non-rr) migration of reservation
with  commit 392d328abe753. But opinions welcome.

For record/replay and avocado test reviewers, I would mainly
be interested in opinions about patch 6. I tried not to affect
existing archs much.

Thanks,
Nick

Nicholas Piggin (7):
  target/ppc: Fix CPU reservation migration for record-replay
  target/ppc: Fix timebase reset with record-replay
  spapr: Fix machine reset deadlock from replay-record
  spapr: Fix record-replay machine reset consuming too many events
  tests/avocado: boot ppc64 pseries replay-record test to Linux VFS
    mount
  tests/avocado: reverse-debugging cope with re-executing breakpoints
  tests/avocado: ppc64 reverse debugging tests for pseries and powernv

 hw/ppc/ppc.c                       | 11 ++++--
 hw/ppc/spapr.c                     | 32 +++++++++++++++---
 include/hw/ppc/spapr.h             |  2 ++
 target/ppc/compat.c                | 19 +++++++++++
 target/ppc/cpu.h                   |  3 ++
 target/ppc/machine.c               | 26 ++++++++++++--
 target/ppc/translate.c             |  4 +++
 tests/avocado/replay_kernel.py     |  3 +-
 tests/avocado/reverse_debugging.py | 54 +++++++++++++++++++++++++++---
 9 files changed, 139 insertions(+), 15 deletions(-)

-- 
2.40.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/7] target/ppc: Fix CPU reservation migration for record-replay
  2023-07-26 18:35 [PATCH 0/7] ppc: record-replay fixes and enablement Nicholas Piggin
@ 2023-07-26 18:35 ` Nicholas Piggin
  2023-07-26 18:35 ` [PATCH 2/7] target/ppc: Fix timebase reset with record-replay Nicholas Piggin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Nicholas Piggin @ 2023-07-26 18:35 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Nicholas Piggin, Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Pavel Dovgalyuk

ppc only migrates reserve_addr, so the destination machine can get a
valid reservation with an incorrect reservation value of 0. Prior to
commit 392d328abe753 ("target/ppc: Ensure stcx size matches larx"),
this could permit a stcx. to incorrectly succeed. That commit
inadvertently fixed that bug because the target machine starts with an
impossible reservation size of 0, so any stcx. will fail.

This behaviour is permitted by the ISA because reservation loss may
have implementation-dependent cause. What's more, with KVM machines it
is impossible save or reasonably restore reservation state. However if
the vmstate is being used for record-replay, the reservation must be
saved and restored exactly in order for execution from snapshot to
match the record.

This patch deprecates the existing incomplete reserve_addr vmstate,
and adds a new vmstate subsection with complete reservation state.
The new vmstate is needed only when record-replay mode is active.

Acked-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 target/ppc/cpu.h       |  2 ++
 target/ppc/machine.c   | 26 ++++++++++++++++++++++++--
 target/ppc/translate.c |  4 ++++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 25fac9577a..27532d8f81 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -1121,7 +1121,9 @@ struct CPUArchState {
     target_ulong reserve_addr;   /* Reservation address */
     target_ulong reserve_length; /* Reservation larx op size (bytes) */
     target_ulong reserve_val;    /* Reservation value */
+#if defined(TARGET_PPC64)
     target_ulong reserve_val2;
+#endif
 
     /* These are used in supervisor mode only */
     target_ulong msr;      /* machine state register */
diff --git a/target/ppc/machine.c b/target/ppc/machine.c
index ebb37e07d0..9f956b972c 100644
--- a/target/ppc/machine.c
+++ b/target/ppc/machine.c
@@ -10,6 +10,7 @@
 #include "qemu/main-loop.h"
 #include "kvm_ppc.h"
 #include "power8-pmu.h"
+#include "sysemu/replay.h"
 
 static void post_load_update_msr(CPUPPCState *env)
 {
@@ -685,6 +686,27 @@ static const VMStateDescription vmstate_compat = {
     }
 };
 
+static bool reservation_needed(void *opaque)
+{
+    return (replay_mode != REPLAY_MODE_NONE);
+}
+
+static const VMStateDescription vmstate_reservation = {
+    .name = "cpu/reservation",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = reservation_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINTTL(env.reserve_addr, PowerPCCPU),
+        VMSTATE_UINTTL(env.reserve_length, PowerPCCPU),
+        VMSTATE_UINTTL(env.reserve_val, PowerPCCPU),
+#if defined(TARGET_PPC64)
+        VMSTATE_UINTTL(env.reserve_val2, PowerPCCPU),
+#endif
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 const VMStateDescription vmstate_ppc_cpu = {
     .name = "cpu",
     .version_id = 5,
@@ -706,8 +728,7 @@ const VMStateDescription vmstate_ppc_cpu = {
         VMSTATE_UINTTL_ARRAY(env.spr, PowerPCCPU, 1024),
         VMSTATE_UINT64(env.spe_acc, PowerPCCPU),
 
-        /* Reservation */
-        VMSTATE_UINTTL(env.reserve_addr, PowerPCCPU),
+        VMSTATE_UNUSED(sizeof(target_ulong)), /* was env.reserve_addr */
 
         /* Supervisor mode architected state */
         VMSTATE_UINTTL(env.msr, PowerPCCPU),
@@ -736,6 +757,7 @@ const VMStateDescription vmstate_ppc_cpu = {
         &vmstate_tlbemb,
         &vmstate_tlbmas,
         &vmstate_compat,
+        &vmstate_reservation,
         NULL
     }
 };
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index e6a0709066..b88c00b4b0 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -77,7 +77,9 @@ static TCGv cpu_xer, cpu_so, cpu_ov, cpu_ca, cpu_ov32, cpu_ca32;
 static TCGv cpu_reserve;
 static TCGv cpu_reserve_length;
 static TCGv cpu_reserve_val;
+#if defined(TARGET_PPC64)
 static TCGv cpu_reserve_val2;
+#endif
 static TCGv cpu_fpscr;
 static TCGv_i32 cpu_access_type;
 
@@ -151,9 +153,11 @@ void ppc_translate_init(void)
     cpu_reserve_val = tcg_global_mem_new(cpu_env,
                                          offsetof(CPUPPCState, reserve_val),
                                          "reserve_val");
+#if defined(TARGET_PPC64)
     cpu_reserve_val2 = tcg_global_mem_new(cpu_env,
                                           offsetof(CPUPPCState, reserve_val2),
                                           "reserve_val2");
+#endif
 
     cpu_fpscr = tcg_global_mem_new(cpu_env,
                                    offsetof(CPUPPCState, fpscr), "fpscr");
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/7] target/ppc: Fix timebase reset with record-replay
  2023-07-26 18:35 [PATCH 0/7] ppc: record-replay fixes and enablement Nicholas Piggin
  2023-07-26 18:35 ` [PATCH 1/7] target/ppc: Fix CPU reservation migration for record-replay Nicholas Piggin
@ 2023-07-26 18:35 ` Nicholas Piggin
  2023-07-26 18:35 ` [PATCH 3/7] spapr: Fix machine reset deadlock from replay-record Nicholas Piggin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Nicholas Piggin @ 2023-07-26 18:35 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Nicholas Piggin, Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Pavel Dovgalyuk

Timebase save uses a random number for a legacy vmstate field, which
makes rr snapshot loading unbalanced. The easiest way to deal with this
is just to skip the rng if record-replay is active.

Reviewed-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 hw/ppc/ppc.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
index cd1993e9c1..2476c4c4d3 100644
--- a/hw/ppc/ppc.c
+++ b/hw/ppc/ppc.c
@@ -32,6 +32,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/error-report.h"
 #include "sysemu/kvm.h"
+#include "sysemu/replay.h"
 #include "sysemu/runstate.h"
 #include "kvm_ppc.h"
 #include "migration/vmstate.h"
@@ -937,8 +938,14 @@ static void timebase_save(PPCTimebase *tb)
         return;
     }
 
-    /* not used anymore, we keep it for compatibility */
-    tb->time_of_the_day_ns = qemu_clock_get_ns(QEMU_CLOCK_HOST);
+    if (replay_mode == REPLAY_MODE_NONE) {
+        /* not used anymore, we keep it for compatibility */
+        tb->time_of_the_day_ns = qemu_clock_get_ns(QEMU_CLOCK_HOST);
+    } else {
+        /* simpler for record-replay to avoid this event, compat not needed */
+        tb->time_of_the_day_ns = 0;
+    }
+
     /*
      * tb_offset is only expected to be changed by QEMU so
      * there is no need to update it from KVM here
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/7] spapr: Fix machine reset deadlock from replay-record
  2023-07-26 18:35 [PATCH 0/7] ppc: record-replay fixes and enablement Nicholas Piggin
  2023-07-26 18:35 ` [PATCH 1/7] target/ppc: Fix CPU reservation migration for record-replay Nicholas Piggin
  2023-07-26 18:35 ` [PATCH 2/7] target/ppc: Fix timebase reset with record-replay Nicholas Piggin
@ 2023-07-26 18:35 ` Nicholas Piggin
  2023-07-26 18:35 ` [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events Nicholas Piggin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Nicholas Piggin @ 2023-07-26 18:35 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Nicholas Piggin, Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Pavel Dovgalyuk

When the machine is reset to load a new snapshot while being debugged
with replay-record, it is done from another thread, so the CPU does
not run the register setting operations. Set CPU registers directly in
machine reset.

Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 hw/ppc/spapr.c         | 20 ++++++++++++++++++--
 include/hw/ppc/spapr.h |  1 +
 target/ppc/compat.c    | 19 +++++++++++++++++++
 target/ppc/cpu.h       |  1 +
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 1c8b8d57a7..7d84244f03 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1322,6 +1322,22 @@ void spapr_set_all_lpcrs(target_ulong value, target_ulong mask)
     }
 }
 
+/* May be used when the machine is not running */
+void spapr_init_all_lpcrs(target_ulong value, target_ulong mask)
+{
+    CPUState *cs;
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+        CPUPPCState *env = &cpu->env;
+        target_ulong lpcr;
+
+        lpcr = env->spr[SPR_LPCR];
+        lpcr &= ~(LPCR_HR | LPCR_UPRT);
+        ppc_store_lpcr(cpu, lpcr);
+    }
+}
+
+
 static bool spapr_get_pate(PPCVirtualHypervisor *vhyp, PowerPCCPU *cpu,
                            target_ulong lpid, ppc_v3_pate_t *entry)
 {
@@ -1583,7 +1599,7 @@ int spapr_reallocate_hpt(SpaprMachineState *spapr, int shift, Error **errp)
     }
     /* We're setting up a hash table, so that means we're not radix */
     spapr->patb_entry = 0;
-    spapr_set_all_lpcrs(0, LPCR_HR | LPCR_UPRT);
+    spapr_init_all_lpcrs(0, LPCR_HR | LPCR_UPRT);
     return 0;
 }
 
@@ -1661,7 +1677,7 @@ static void spapr_machine_reset(MachineState *machine, ShutdownCause reason)
     spapr_ovec_cleanup(spapr->ov5_cas);
     spapr->ov5_cas = spapr_ovec_new();
 
-    ppc_set_compat_all(spapr->max_compat_pvr, &error_fatal);
+    ppc_init_compat_all(spapr->max_compat_pvr, &error_fatal);
 
     /*
      * This is fixing some of the default configuration of the XIVE
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 538b2dfb89..f47e8419a5 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -1012,6 +1012,7 @@ bool spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
 #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
 
 void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
+void spapr_init_all_lpcrs(target_ulong value, target_ulong mask);
 hwaddr spapr_get_rtas_addr(void);
 bool spapr_memory_hot_unplug_supported(SpaprMachineState *spapr);
 
diff --git a/target/ppc/compat.c b/target/ppc/compat.c
index 7949a24f5a..ebef2cccec 100644
--- a/target/ppc/compat.c
+++ b/target/ppc/compat.c
@@ -229,6 +229,25 @@ int ppc_set_compat_all(uint32_t compat_pvr, Error **errp)
     return 0;
 }
 
+/* To be used when the machine is not running */
+int ppc_init_compat_all(uint32_t compat_pvr, Error **errp)
+{
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+        int ret;
+
+        ret = ppc_set_compat(cpu, compat_pvr, errp);
+
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 int ppc_compat_max_vthreads(PowerPCCPU *cpu)
 {
     const CompatInfo *compat = compat_by_pvr(cpu->compat_pvr);
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 27532d8f81..40be0c362a 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -1497,6 +1497,7 @@ int ppc_set_compat(PowerPCCPU *cpu, uint32_t compat_pvr, Error **errp);
 
 #if !defined(CONFIG_USER_ONLY)
 int ppc_set_compat_all(uint32_t compat_pvr, Error **errp);
+int ppc_init_compat_all(uint32_t compat_pvr, Error **errp);
 #endif
 int ppc_compat_max_vthreads(PowerPCCPU *cpu);
 void ppc_compat_add_property(Object *obj, const char *name,
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events
  2023-07-26 18:35 [PATCH 0/7] ppc: record-replay fixes and enablement Nicholas Piggin
                   ` (2 preceding siblings ...)
  2023-07-26 18:35 ` [PATCH 3/7] spapr: Fix machine reset deadlock from replay-record Nicholas Piggin
@ 2023-07-26 18:35 ` Nicholas Piggin
  2023-07-31 11:40   ` Pavel Dovgalyuk
  2023-08-04  8:50   ` Pavel Dovgalyuk
  2023-07-26 18:35 ` [PATCH 5/7] tests/avocado: boot ppc64 pseries replay-record test to Linux VFS mount Nicholas Piggin
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Nicholas Piggin @ 2023-07-26 18:35 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Nicholas Piggin, Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Pavel Dovgalyuk

spapr_machine_reset gets a random number to populate the device-tree
rng seed with. When loading a snapshot for record-replay, the machine
is reset again, and that tries to consume the random event record
again, crashing due to inconsistent record

Fix this by saving the seed to populate the device tree with, and
skipping the rng on snapshot load.

Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 hw/ppc/spapr.c         | 12 +++++++++---
 include/hw/ppc/spapr.h |  1 +
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7d84244f03..ecfbdb0030 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1022,7 +1022,6 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt, bool reset)
 {
     MachineState *machine = MACHINE(spapr);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
-    uint8_t rng_seed[32];
     int chosen;
 
     _FDT(chosen = fdt_add_subnode(fdt, 0, "chosen"));
@@ -1100,8 +1099,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt, bool reset)
         spapr_dt_ov5_platform_support(spapr, fdt, chosen);
     }
 
-    qemu_guest_getrandom_nofail(rng_seed, sizeof(rng_seed));
-    _FDT(fdt_setprop(fdt, chosen, "rng-seed", rng_seed, sizeof(rng_seed)));
+    _FDT(fdt_setprop(fdt, chosen, "rng-seed", spapr->fdt_rng_seed, 32));
 
     _FDT(spapr_dt_ovec(fdt, chosen, spapr->ov5_cas, "ibm,architecture-vec-5"));
 }
@@ -1654,6 +1652,14 @@ static void spapr_machine_reset(MachineState *machine, ShutdownCause reason)
     void *fdt;
     int rc;
 
+    if (reason != SHUTDOWN_CAUSE_SNAPSHOT_LOAD) {
+        /*
+         * Record-replay snapshot load must not consume random, this was
+         * already replayed from initial machine reset.
+         */
+        qemu_guest_getrandom_nofail(spapr->fdt_rng_seed, 32);
+    }
+
     pef_kvm_reset(machine->cgs, &error_fatal);
     spapr_caps_apply(spapr);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index f47e8419a5..f4bd204d86 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -204,6 +204,7 @@ struct SpaprMachineState {
     uint32_t fdt_size;
     uint32_t fdt_initial_size;
     void *fdt_blob;
+    uint8_t fdt_rng_seed[32];
     long kernel_size;
     bool kernel_le;
     uint64_t kernel_addr;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/7] tests/avocado: boot ppc64 pseries replay-record test to Linux VFS mount
  2023-07-26 18:35 [PATCH 0/7] ppc: record-replay fixes and enablement Nicholas Piggin
                   ` (3 preceding siblings ...)
  2023-07-26 18:35 ` [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events Nicholas Piggin
@ 2023-07-26 18:35 ` Nicholas Piggin
  2023-07-31 11:41   ` Pavel Dovgalyuk
  2023-07-26 18:35 ` [PATCH 6/7] tests/avocado: reverse-debugging cope with re-executing breakpoints Nicholas Piggin
  2023-07-26 18:35 ` [PATCH 7/7] tests/avocado: ppc64 reverse debugging tests for pseries and powernv Nicholas Piggin
  6 siblings, 1 reply; 17+ messages in thread
From: Nicholas Piggin @ 2023-07-26 18:35 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Nicholas Piggin, Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Pavel Dovgalyuk

This the ppc64 record-replay test is able to replay the full kernel boot
so try enabling it.

Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 tests/avocado/replay_kernel.py | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tests/avocado/replay_kernel.py b/tests/avocado/replay_kernel.py
index 79c607b0e7..a18610542e 100644
--- a/tests/avocado/replay_kernel.py
+++ b/tests/avocado/replay_kernel.py
@@ -255,8 +255,7 @@ def test_ppc64_pseries(self):
         kernel_path = self.fetch_asset(kernel_url, asset_hash=kernel_hash)
 
         kernel_command_line = self.KERNEL_COMMON_COMMAND_LINE + 'console=hvc0'
-        # icount is not good enough for PPC64 for complete boot yet
-        console_pattern = 'Kernel command line: %s' % kernel_command_line
+        console_pattern = 'VFS: Cannot open root device'
         self.run_rr(kernel_path, kernel_command_line, console_pattern)
 
     def test_ppc64_powernv(self):
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/7] tests/avocado: reverse-debugging cope with re-executing breakpoints
  2023-07-26 18:35 [PATCH 0/7] ppc: record-replay fixes and enablement Nicholas Piggin
                   ` (4 preceding siblings ...)
  2023-07-26 18:35 ` [PATCH 5/7] tests/avocado: boot ppc64 pseries replay-record test to Linux VFS mount Nicholas Piggin
@ 2023-07-26 18:35 ` Nicholas Piggin
  2023-07-31 12:08   ` Pavel Dovgalyuk
  2023-07-26 18:35 ` [PATCH 7/7] tests/avocado: ppc64 reverse debugging tests for pseries and powernv Nicholas Piggin
  6 siblings, 1 reply; 17+ messages in thread
From: Nicholas Piggin @ 2023-07-26 18:35 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Nicholas Piggin, Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Pavel Dovgalyuk

The reverse-debugging test creates a trace, then replays it and:

1. Steps the first 10 instructions and records their addresses.
2. Steps backward and verifies their addresses match.
3. Runs to (near) the end of the trace.
4. Sets breakpoints on the first 10 instructions.
5. Continues backward and verifies execution stops at the last
   breakpoint.

Step 5 breaks if any of the other 9 breakpoints are re-executed in the
trace after the 10th instruction is run, because those will be
unexpectedly hit when reverse continuing. This situation does arise
with the ppc pseries machine, the SLOF bios branches to its own entry
point.

Permit this breakpoint re-execution by switching steps 4 and 5, so that
the trace will be run to the end *or* the next breakpoint hit.
Reversing from there to the 10th intsruction will not hit another
breakpoint, by definition.

Another step is added between steps 2 and 3, which steps forward over
the first 10 instructions and verifies their addresses, to support this.

Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 tests/avocado/reverse_debugging.py | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
index 680c314cfc..7d1a478df1 100644
--- a/tests/avocado/reverse_debugging.py
+++ b/tests/avocado/reverse_debugging.py
@@ -150,16 +150,33 @@ def reverse_debugging(self, shift=7, args=None):
             self.check_pc(g, addr)
             logger.info('found position %x' % addr)
 
-        logger.info('seeking to the end (icount %s)' % (last_icount - 1))
-        vm.qmp('replay-break', icount=last_icount - 1)
-        # continue - will return after pausing
-        g.cmd(b'c', b'T02thread:01;')
+        # visit the recorded instruction in forward order
+        logger.info('stepping forward')
+        for addr in steps:
+            self.check_pc(g, addr)
+            self.gdb_step(g)
+            logger.info('found position %x' % addr)
 
+        # set breakpoints for the instructions just stepped over
         logger.info('setting breakpoints')
         for addr in steps:
             # hardware breakpoint at addr with len=1
             g.cmd(b'Z1,%x,1' % addr, b'OK')
 
+        # this may hit a breakpoint if first instructions are executed
+        # again
+        logger.info('continuing execution')
+        vm.qmp('replay-break', icount=last_icount - 1)
+        # continue - will return after pausing
+        # This could stop at the end and get a T02 return, or by
+        # re-executing one of the breakpoints and get a T05 return.
+        g.cmd(b'c')
+        if self.vm_get_icount(vm) == last_icount - 1:
+            logger.info('reached the end (icount %s)' % (last_icount - 1))
+        else:
+            logger.info('hit a breakpoint again at %x (icount %s)' %
+                        (self.get_pc(g), self.vm_get_icount(vm)))
+
         logger.info('running reverse continue to reach %x' % steps[-1])
         # reverse continue - will return after stopping at the breakpoint
         g.cmd(b'bc', b'T05thread:01;')
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 7/7] tests/avocado: ppc64 reverse debugging tests for pseries and powernv
  2023-07-26 18:35 [PATCH 0/7] ppc: record-replay fixes and enablement Nicholas Piggin
                   ` (5 preceding siblings ...)
  2023-07-26 18:35 ` [PATCH 6/7] tests/avocado: reverse-debugging cope with re-executing breakpoints Nicholas Piggin
@ 2023-07-26 18:35 ` Nicholas Piggin
  2023-07-31 12:09   ` Pavel Dovgalyuk
  6 siblings, 1 reply; 17+ messages in thread
From: Nicholas Piggin @ 2023-07-26 18:35 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Nicholas Piggin, Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Pavel Dovgalyuk

These machines run reverse-debugging well enough to pass basic tests.
Wire them up.

Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 tests/avocado/reverse_debugging.py | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
index 7d1a478df1..fc47874eda 100644
--- a/tests/avocado/reverse_debugging.py
+++ b/tests/avocado/reverse_debugging.py
@@ -233,3 +233,32 @@ def test_aarch64_virt(self):
 
         self.reverse_debugging(
             args=('-kernel', kernel_path))
+
+class ReverseDebugging_ppc64(ReverseDebugging):
+    """
+    :avocado: tags=accel:tcg
+    """
+
+    REG_PC = 0x40
+
+    # unidentified gitlab timeout problem
+    @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab')
+    def test_ppc64_pseries(self):
+        """
+        :avocado: tags=arch:ppc64
+        :avocado: tags=machine:pseries
+        """
+        # SLOF branches back to its entry point, which causes this test
+        # to take the 'hit a breakpoint again' path. That's not a problem,
+        # just slightly different than the other machines.
+        self.endian_is_le = False
+        self.reverse_debugging()
+
+    @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab')
+    def test_ppc64_powernv(self):
+        """
+        :avocado: tags=arch:ppc64
+        :avocado: tags=machine:powernv
+        """
+        self.endian_is_le = False
+        self.reverse_debugging()
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events
  2023-07-26 18:35 ` [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events Nicholas Piggin
@ 2023-07-31 11:40   ` Pavel Dovgalyuk
  2023-08-04  8:50   ` Pavel Dovgalyuk
  1 sibling, 0 replies; 17+ messages in thread
From: Pavel Dovgalyuk @ 2023-07-31 11:40 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

Acked-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>

On 26.07.2023 21:35, Nicholas Piggin wrote:
> spapr_machine_reset gets a random number to populate the device-tree
> rng seed with. When loading a snapshot for record-replay, the machine
> is reset again, and that tries to consume the random event record
> again, crashing due to inconsistent record
> 
> Fix this by saving the seed to populate the device tree with, and
> skipping the rng on snapshot load.
> 
> Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   hw/ppc/spapr.c         | 12 +++++++++---
>   include/hw/ppc/spapr.h |  1 +
>   2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 7d84244f03..ecfbdb0030 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1022,7 +1022,6 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt, bool reset)
>   {
>       MachineState *machine = MACHINE(spapr);
>       SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> -    uint8_t rng_seed[32];
>       int chosen;
>   
>       _FDT(chosen = fdt_add_subnode(fdt, 0, "chosen"));
> @@ -1100,8 +1099,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt, bool reset)
>           spapr_dt_ov5_platform_support(spapr, fdt, chosen);
>       }
>   
> -    qemu_guest_getrandom_nofail(rng_seed, sizeof(rng_seed));
> -    _FDT(fdt_setprop(fdt, chosen, "rng-seed", rng_seed, sizeof(rng_seed)));
> +    _FDT(fdt_setprop(fdt, chosen, "rng-seed", spapr->fdt_rng_seed, 32));
>   
>       _FDT(spapr_dt_ovec(fdt, chosen, spapr->ov5_cas, "ibm,architecture-vec-5"));
>   }
> @@ -1654,6 +1652,14 @@ static void spapr_machine_reset(MachineState *machine, ShutdownCause reason)
>       void *fdt;
>       int rc;
>   
> +    if (reason != SHUTDOWN_CAUSE_SNAPSHOT_LOAD) {
> +        /*
> +         * Record-replay snapshot load must not consume random, this was
> +         * already replayed from initial machine reset.
> +         */
> +        qemu_guest_getrandom_nofail(spapr->fdt_rng_seed, 32);
> +    }
> +
>       pef_kvm_reset(machine->cgs, &error_fatal);
>       spapr_caps_apply(spapr);
>   
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index f47e8419a5..f4bd204d86 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -204,6 +204,7 @@ struct SpaprMachineState {
>       uint32_t fdt_size;
>       uint32_t fdt_initial_size;
>       void *fdt_blob;
> +    uint8_t fdt_rng_seed[32];
>       long kernel_size;
>       bool kernel_le;
>       uint64_t kernel_addr;



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/7] tests/avocado: boot ppc64 pseries replay-record test to Linux VFS mount
  2023-07-26 18:35 ` [PATCH 5/7] tests/avocado: boot ppc64 pseries replay-record test to Linux VFS mount Nicholas Piggin
@ 2023-07-31 11:41   ` Pavel Dovgalyuk
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Dovgalyuk @ 2023-07-31 11:41 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

Acked-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>

On 26.07.2023 21:35, Nicholas Piggin wrote:
> This the ppc64 record-replay test is able to replay the full kernel boot
> so try enabling it.
> 
> Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   tests/avocado/replay_kernel.py | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/tests/avocado/replay_kernel.py b/tests/avocado/replay_kernel.py
> index 79c607b0e7..a18610542e 100644
> --- a/tests/avocado/replay_kernel.py
> +++ b/tests/avocado/replay_kernel.py
> @@ -255,8 +255,7 @@ def test_ppc64_pseries(self):
>           kernel_path = self.fetch_asset(kernel_url, asset_hash=kernel_hash)
>   
>           kernel_command_line = self.KERNEL_COMMON_COMMAND_LINE + 'console=hvc0'
> -        # icount is not good enough for PPC64 for complete boot yet
> -        console_pattern = 'Kernel command line: %s' % kernel_command_line
> +        console_pattern = 'VFS: Cannot open root device'
>           self.run_rr(kernel_path, kernel_command_line, console_pattern)
>   
>       def test_ppc64_powernv(self):



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/7] tests/avocado: reverse-debugging cope with re-executing breakpoints
  2023-07-26 18:35 ` [PATCH 6/7] tests/avocado: reverse-debugging cope with re-executing breakpoints Nicholas Piggin
@ 2023-07-31 12:08   ` Pavel Dovgalyuk
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Dovgalyuk @ 2023-07-31 12:08 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

Reviewed-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>

On 26.07.2023 21:35, Nicholas Piggin wrote:
> The reverse-debugging test creates a trace, then replays it and:
> 
> 1. Steps the first 10 instructions and records their addresses.
> 2. Steps backward and verifies their addresses match.
> 3. Runs to (near) the end of the trace.
> 4. Sets breakpoints on the first 10 instructions.
> 5. Continues backward and verifies execution stops at the last
>     breakpoint.
> 
> Step 5 breaks if any of the other 9 breakpoints are re-executed in the
> trace after the 10th instruction is run, because those will be
> unexpectedly hit when reverse continuing. This situation does arise
> with the ppc pseries machine, the SLOF bios branches to its own entry
> point.
> 
> Permit this breakpoint re-execution by switching steps 4 and 5, so that
> the trace will be run to the end *or* the next breakpoint hit.
> Reversing from there to the 10th intsruction will not hit another
> breakpoint, by definition.
> 
> Another step is added between steps 2 and 3, which steps forward over
> the first 10 instructions and verifies their addresses, to support this.
> 
> Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   tests/avocado/reverse_debugging.py | 25 +++++++++++++++++++++----
>   1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
> index 680c314cfc..7d1a478df1 100644
> --- a/tests/avocado/reverse_debugging.py
> +++ b/tests/avocado/reverse_debugging.py
> @@ -150,16 +150,33 @@ def reverse_debugging(self, shift=7, args=None):
>               self.check_pc(g, addr)
>               logger.info('found position %x' % addr)
>   
> -        logger.info('seeking to the end (icount %s)' % (last_icount - 1))
> -        vm.qmp('replay-break', icount=last_icount - 1)
> -        # continue - will return after pausing
> -        g.cmd(b'c', b'T02thread:01;')
> +        # visit the recorded instruction in forward order
> +        logger.info('stepping forward')
> +        for addr in steps:
> +            self.check_pc(g, addr)
> +            self.gdb_step(g)
> +            logger.info('found position %x' % addr)
>   
> +        # set breakpoints for the instructions just stepped over
>           logger.info('setting breakpoints')
>           for addr in steps:
>               # hardware breakpoint at addr with len=1
>               g.cmd(b'Z1,%x,1' % addr, b'OK')
>   
> +        # this may hit a breakpoint if first instructions are executed
> +        # again
> +        logger.info('continuing execution')
> +        vm.qmp('replay-break', icount=last_icount - 1)
> +        # continue - will return after pausing
> +        # This could stop at the end and get a T02 return, or by
> +        # re-executing one of the breakpoints and get a T05 return.
> +        g.cmd(b'c')
> +        if self.vm_get_icount(vm) == last_icount - 1:
> +            logger.info('reached the end (icount %s)' % (last_icount - 1))
> +        else:
> +            logger.info('hit a breakpoint again at %x (icount %s)' %
> +                        (self.get_pc(g), self.vm_get_icount(vm)))
> +
>           logger.info('running reverse continue to reach %x' % steps[-1])
>           # reverse continue - will return after stopping at the breakpoint
>           g.cmd(b'bc', b'T05thread:01;')



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/7] tests/avocado: ppc64 reverse debugging tests for pseries and powernv
  2023-07-26 18:35 ` [PATCH 7/7] tests/avocado: ppc64 reverse debugging tests for pseries and powernv Nicholas Piggin
@ 2023-07-31 12:09   ` Pavel Dovgalyuk
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Dovgalyuk @ 2023-07-31 12:09 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

Reviewed-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>

On 26.07.2023 21:35, Nicholas Piggin wrote:
> These machines run reverse-debugging well enough to pass basic tests.
> Wire them up.
> 
> Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   tests/avocado/reverse_debugging.py | 29 +++++++++++++++++++++++++++++
>   1 file changed, 29 insertions(+)
> 
> diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
> index 7d1a478df1..fc47874eda 100644
> --- a/tests/avocado/reverse_debugging.py
> +++ b/tests/avocado/reverse_debugging.py
> @@ -233,3 +233,32 @@ def test_aarch64_virt(self):
>   
>           self.reverse_debugging(
>               args=('-kernel', kernel_path))
> +
> +class ReverseDebugging_ppc64(ReverseDebugging):
> +    """
> +    :avocado: tags=accel:tcg
> +    """
> +
> +    REG_PC = 0x40
> +
> +    # unidentified gitlab timeout problem
> +    @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab')
> +    def test_ppc64_pseries(self):
> +        """
> +        :avocado: tags=arch:ppc64
> +        :avocado: tags=machine:pseries
> +        """
> +        # SLOF branches back to its entry point, which causes this test
> +        # to take the 'hit a breakpoint again' path. That's not a problem,
> +        # just slightly different than the other machines.
> +        self.endian_is_le = False
> +        self.reverse_debugging()
> +
> +    @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab')
> +    def test_ppc64_powernv(self):
> +        """
> +        :avocado: tags=arch:ppc64
> +        :avocado: tags=machine:powernv
> +        """
> +        self.endian_is_le = False
> +        self.reverse_debugging()



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events
  2023-07-26 18:35 ` [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events Nicholas Piggin
  2023-07-31 11:40   ` Pavel Dovgalyuk
@ 2023-08-04  8:50   ` Pavel Dovgalyuk
  2023-08-06 11:46     ` Nicholas Piggin
  1 sibling, 1 reply; 17+ messages in thread
From: Pavel Dovgalyuk @ 2023-08-04  8:50 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

BTW, there is a function qemu_register_reset_nosnapshotload that can be 
used in similar cases.
Can you just use it without changing the code of the reset handler?

On 26.07.2023 21:35, Nicholas Piggin wrote:
> spapr_machine_reset gets a random number to populate the device-tree
> rng seed with. When loading a snapshot for record-replay, the machine
> is reset again, and that tries to consume the random event record
> again, crashing due to inconsistent record
> 
> Fix this by saving the seed to populate the device tree with, and
> skipping the rng on snapshot load.
> 
> Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   hw/ppc/spapr.c         | 12 +++++++++---
>   include/hw/ppc/spapr.h |  1 +
>   2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 7d84244f03..ecfbdb0030 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1022,7 +1022,6 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt, bool reset)
>   {
>       MachineState *machine = MACHINE(spapr);
>       SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> -    uint8_t rng_seed[32];
>       int chosen;
>   
>       _FDT(chosen = fdt_add_subnode(fdt, 0, "chosen"));
> @@ -1100,8 +1099,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt, bool reset)
>           spapr_dt_ov5_platform_support(spapr, fdt, chosen);
>       }
>   
> -    qemu_guest_getrandom_nofail(rng_seed, sizeof(rng_seed));
> -    _FDT(fdt_setprop(fdt, chosen, "rng-seed", rng_seed, sizeof(rng_seed)));
> +    _FDT(fdt_setprop(fdt, chosen, "rng-seed", spapr->fdt_rng_seed, 32));
>   
>       _FDT(spapr_dt_ovec(fdt, chosen, spapr->ov5_cas, "ibm,architecture-vec-5"));
>   }
> @@ -1654,6 +1652,14 @@ static void spapr_machine_reset(MachineState *machine, ShutdownCause reason)
>       void *fdt;
>       int rc;
>   
> +    if (reason != SHUTDOWN_CAUSE_SNAPSHOT_LOAD) {
> +        /*
> +         * Record-replay snapshot load must not consume random, this was
> +         * already replayed from initial machine reset.
> +         */
> +        qemu_guest_getrandom_nofail(spapr->fdt_rng_seed, 32);
> +    }
> +
>       pef_kvm_reset(machine->cgs, &error_fatal);
>       spapr_caps_apply(spapr);
>   
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index f47e8419a5..f4bd204d86 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -204,6 +204,7 @@ struct SpaprMachineState {
>       uint32_t fdt_size;
>       uint32_t fdt_initial_size;
>       void *fdt_blob;
> +    uint8_t fdt_rng_seed[32];
>       long kernel_size;
>       bool kernel_le;
>       uint64_t kernel_addr;



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events
  2023-08-04  8:50   ` Pavel Dovgalyuk
@ 2023-08-06 11:46     ` Nicholas Piggin
  2023-08-08  3:09       ` Nicholas Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Nicholas Piggin @ 2023-08-06 11:46 UTC (permalink / raw)
  To: Pavel Dovgalyuk, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

On Fri Aug 4, 2023 at 6:50 PM AEST, Pavel Dovgalyuk wrote:
> BTW, there is a function qemu_register_reset_nosnapshotload that can be 
> used in similar cases.
> Can you just use it without changing the code of the reset handler?

I didn't know that, thanks for pointing it out. I'll take a closer look
at it before reposting.

Thanks,
Nick

>
> On 26.07.2023 21:35, Nicholas Piggin wrote:
> > spapr_machine_reset gets a random number to populate the device-tree
> > rng seed with. When loading a snapshot for record-replay, the machine
> > is reset again, and that tries to consume the random event record
> > again, crashing due to inconsistent record
> > 
> > Fix this by saving the seed to populate the device tree with, and
> > skipping the rng on snapshot load.
> > 
> > Cc: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >   hw/ppc/spapr.c         | 12 +++++++++---
> >   include/hw/ppc/spapr.h |  1 +
> >   2 files changed, 10 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 7d84244f03..ecfbdb0030 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -1022,7 +1022,6 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt, bool reset)
> >   {
> >       MachineState *machine = MACHINE(spapr);
> >       SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> > -    uint8_t rng_seed[32];
> >       int chosen;
> >   
> >       _FDT(chosen = fdt_add_subnode(fdt, 0, "chosen"));
> > @@ -1100,8 +1099,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt, bool reset)
> >           spapr_dt_ov5_platform_support(spapr, fdt, chosen);
> >       }
> >   
> > -    qemu_guest_getrandom_nofail(rng_seed, sizeof(rng_seed));
> > -    _FDT(fdt_setprop(fdt, chosen, "rng-seed", rng_seed, sizeof(rng_seed)));
> > +    _FDT(fdt_setprop(fdt, chosen, "rng-seed", spapr->fdt_rng_seed, 32));
> >   
> >       _FDT(spapr_dt_ovec(fdt, chosen, spapr->ov5_cas, "ibm,architecture-vec-5"));
> >   }
> > @@ -1654,6 +1652,14 @@ static void spapr_machine_reset(MachineState *machine, ShutdownCause reason)
> >       void *fdt;
> >       int rc;
> >   
> > +    if (reason != SHUTDOWN_CAUSE_SNAPSHOT_LOAD) {
> > +        /*
> > +         * Record-replay snapshot load must not consume random, this was
> > +         * already replayed from initial machine reset.
> > +         */
> > +        qemu_guest_getrandom_nofail(spapr->fdt_rng_seed, 32);
> > +    }
> > +
> >       pef_kvm_reset(machine->cgs, &error_fatal);
> >       spapr_caps_apply(spapr);
> >   
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index f47e8419a5..f4bd204d86 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -204,6 +204,7 @@ struct SpaprMachineState {
> >       uint32_t fdt_size;
> >       uint32_t fdt_initial_size;
> >       void *fdt_blob;
> > +    uint8_t fdt_rng_seed[32];
> >       long kernel_size;
> >       bool kernel_le;
> >       uint64_t kernel_addr;



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events
  2023-08-06 11:46     ` Nicholas Piggin
@ 2023-08-08  3:09       ` Nicholas Piggin
  2023-08-08  3:52         ` Pavel Dovgalyuk
  0 siblings, 1 reply; 17+ messages in thread
From: Nicholas Piggin @ 2023-08-08  3:09 UTC (permalink / raw)
  To: Nicholas Piggin, Pavel Dovgalyuk, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

On Sun Aug 6, 2023 at 9:46 PM AEST, Nicholas Piggin wrote:
> On Fri Aug 4, 2023 at 6:50 PM AEST, Pavel Dovgalyuk wrote:
> > BTW, there is a function qemu_register_reset_nosnapshotload that can be 
> > used in similar cases.
> > Can you just use it without changing the code of the reset handler?
>
> I didn't know that, thanks for pointing it out. I'll take a closer look
> at it before reposting.

Seems a bit tricky because the device tree has to be rebuilt at reset
time (including snapshot load), but it uses the random number. So
having a second nosnapshotload reset function might not be called in
the correct order, I think?  For now I will keep it as is.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events
  2023-08-08  3:09       ` Nicholas Piggin
@ 2023-08-08  3:52         ` Pavel Dovgalyuk
  2023-08-09  9:25           ` Nicholas Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Dovgalyuk @ 2023-08-08  3:52 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

On 08.08.2023 06:09, Nicholas Piggin wrote:
> On Sun Aug 6, 2023 at 9:46 PM AEST, Nicholas Piggin wrote:
>> On Fri Aug 4, 2023 at 6:50 PM AEST, Pavel Dovgalyuk wrote:
>>> BTW, there is a function qemu_register_reset_nosnapshotload that can be
>>> used in similar cases.
>>> Can you just use it without changing the code of the reset handler?
>>
>> I didn't know that, thanks for pointing it out. I'll take a closer look
>> at it before reposting.
> 
> Seems a bit tricky because the device tree has to be rebuilt at reset
> time (including snapshot load), but it uses the random number. So

It seems strange to me, that loading the existing configuration has to 
randomize the device tree.

> having a second nosnapshotload reset function might not be called in
> the correct order, I think?  For now I will keep it as is.

Ok, let's wait for other reviewers.


Pavel Dovgalyuk



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events
  2023-08-08  3:52         ` Pavel Dovgalyuk
@ 2023-08-09  9:25           ` Nicholas Piggin
  0 siblings, 0 replies; 17+ messages in thread
From: Nicholas Piggin @ 2023-08-09  9:25 UTC (permalink / raw)
  To: Pavel Dovgalyuk, Daniel Henrique Barboza
  Cc: Cédric Le Goater, David Gibson, Greg Kurz,
	Harsh Prateek Bora, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel

On Tue Aug 8, 2023 at 1:52 PM AEST, Pavel Dovgalyuk wrote:
> On 08.08.2023 06:09, Nicholas Piggin wrote:
> > On Sun Aug 6, 2023 at 9:46 PM AEST, Nicholas Piggin wrote:
> >> On Fri Aug 4, 2023 at 6:50 PM AEST, Pavel Dovgalyuk wrote:
> >>> BTW, there is a function qemu_register_reset_nosnapshotload that can be
> >>> used in similar cases.
> >>> Can you just use it without changing the code of the reset handler?
> >>
> >> I didn't know that, thanks for pointing it out. I'll take a closer look
> >> at it before reposting.
> > 
> > Seems a bit tricky because the device tree has to be rebuilt at reset
> > time (including snapshot load), but it uses the random number. So
>
> It seems strange to me, that loading the existing configuration has to 
> randomize the device tree.

Building the device tree requires a random number for one of the
properties.

Other architectures that don't have this "cas" firmware call that
changes the device tree and so requires it is rebuilt at machine
reset time just build the device tree once at machine creation time
I believe.

So spapr is already weird in that way. We could go the way that
other archs have and just save that random number once at
creation and then reuse it for each reset. I thought that was not
so good because for a normal reset I think it is better to get a
new random number each time, no?

So I think it's natural enough to take a new random number for a
regular reset, but keep the existing one for a snapshot reset. I
could be misunderstanding something though.

Thanks,
Nick

>
> > having a second nosnapshotload reset function might not be called in
> > the correct order, I think?  For now I will keep it as is.
>
> Ok, let's wait for other reviewers.
>
>
> Pavel Dovgalyuk



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-08-09  9:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-26 18:35 [PATCH 0/7] ppc: record-replay fixes and enablement Nicholas Piggin
2023-07-26 18:35 ` [PATCH 1/7] target/ppc: Fix CPU reservation migration for record-replay Nicholas Piggin
2023-07-26 18:35 ` [PATCH 2/7] target/ppc: Fix timebase reset with record-replay Nicholas Piggin
2023-07-26 18:35 ` [PATCH 3/7] spapr: Fix machine reset deadlock from replay-record Nicholas Piggin
2023-07-26 18:35 ` [PATCH 4/7] spapr: Fix record-replay machine reset consuming too many events Nicholas Piggin
2023-07-31 11:40   ` Pavel Dovgalyuk
2023-08-04  8:50   ` Pavel Dovgalyuk
2023-08-06 11:46     ` Nicholas Piggin
2023-08-08  3:09       ` Nicholas Piggin
2023-08-08  3:52         ` Pavel Dovgalyuk
2023-08-09  9:25           ` Nicholas Piggin
2023-07-26 18:35 ` [PATCH 5/7] tests/avocado: boot ppc64 pseries replay-record test to Linux VFS mount Nicholas Piggin
2023-07-31 11:41   ` Pavel Dovgalyuk
2023-07-26 18:35 ` [PATCH 6/7] tests/avocado: reverse-debugging cope with re-executing breakpoints Nicholas Piggin
2023-07-31 12:08   ` Pavel Dovgalyuk
2023-07-26 18:35 ` [PATCH 7/7] tests/avocado: ppc64 reverse debugging tests for pseries and powernv Nicholas Piggin
2023-07-31 12:09   ` Pavel Dovgalyuk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).