[Qemu-devel] [PATCH v3 0/4] tcg: support heterogenous CPU clusters

* [Qemu-devel] [PATCH v3 0/4] tcg: support heterogenous CPU clusters
@ 2019-01-21 15:22 Peter Maydell
  2019-01-21 15:22 ` [Qemu-devel] [PATCH v3 1/4] hw/arm/xlx-zynqmp: Realize cluster after putting RPUs in it Peter Maydell
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Peter Maydell @ 2019-01-21 15:22 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: Richard Henderson, Peter Crosthwaite, Paolo Bonzini,
	Alistair Francis, Edgar E. Iglesias, Eduardo Habkost,
	Marcel Apfelbaum, Emilio G . Cota, Aleksandar Markovic

In the process of debugging my board code that uses this
feature, I realized I needed to support the CPU object
not being a direct child of the cluster object, which
also lead me to the object_child_foreach_recursive()
function which is a neater way to iterate children anyway.
I also figured out we could add an assert which catches
most "parenting CPU vs cluster realize ordering" bugs.

Only patch 2 has changed here, others are already reviewed.

Changes v2->v3:

 * allow CPU objects to be indirect children of the cluster;
   this is useful for ARMv7M, where the CPU object is a child
   of the armv7m container and the board code that sets up
   the cluster object only has the armv7m container object:
   this is done by using object_child_foreach_recursive()
   rather than an open-coded child iteration
 * add an assertion that the cluster has at least one CPU,
   which catches the easiest-to-make errors when creating
   and populating the cluster

Changes v1->v2:
 * just fixing a bug in patch 3 that meant we were
   accidentally never reusing TBs. Patch 3 is the only
   one that needs review.

Original cover letter included below for context.

thanks
-- PMM

Currently, TCG implicitly assumes that all CPUs are alike,
because we have a single cache of generated TBs and we
don't account for which CPU generated the code or is looking
for the TB when adding or searching for generated TBs.
This can go wrong in two situations:
 (1) two CPUs have different physical address spaces
     (eg CPU 1 has one lot of RAM/ROM, and CPU 2 has
     different RAM/ROM): the physical address alone is
     then not sufficient to distinguish what code to run
 (2) two CPUs have different features (eg FPU vs no FPU):
     since our TCG frontends bake assumptions into the
     generated code about the presence/absence of features,
     if a CPU with FPU picks up a TB for one generated
     without an FPU it will behave wrongly

This is unfortunate, because we already have one board in
the tree which has a heterogenous setup: the xlnx-zcu102.
This board has 4xCortex-A53 and 2xCortex-R5F. Although
most "normal" guest code doesn't run into this, it is
possible to demonstrate the bug with it. There's a test case
at http://people.linaro.org/~peter.maydell/xlnx-het.tgz
which arranges to run a fragment of code in AArch32 which
should behave differently on the two CPUs, but does not
(either the A53 gets the behaviour the R5 should have, or
vice-versa).

This patchset adds the "cluster ID" to the set of things
we include in the TCG TB hash, which means that the two
sets of CPUs in this board can no longer accidentally
get TBs generated by the other cluster. It fixes the
bug demonstrated by the test case.

Design notes:

 * Adding cluster ID to the hash is RTH's suggestion. It might
   also be possible to make each cluster have a TCGContext
   code generation context, and have the CPUs use the right
   one for the cluster, but that would be a lot more code
   rework compared to this approach.
 * I put the cluster number in the existing cflags, since
   we have plenty of spare bits there. If in future we
   need either more than 256 clusters (unlikely) or want
   to reclaim bits in the cflags field for some other
   purpose we can always extend our xxhash from 28 to 32 bytes.
   (I actually wrote the patch to do that before realising
   I didn't need it...)
 * The cluster object now fills in the CPU object's
   cluster_index field when the cluster object is realized.
   This imposes an ordering constraint that all CPUs must
   be added to a cluster before realizing the cluster. That's
   not a big deal, except that unfortunately QOM provides
   no way that I could find to enforce this. If anybody
   has a suggestion for a better approach that would be great.
   Patch 1 therefore makes sure our only current user of the
   cluster feature obeys the ordering constraint.
 * Patch 4 is a tidyup to the gdbstub code which is possible
   now that the CPUState struct has the cluster ID in it.

I believe that this fix is basically all we need to support
heterogenous setups (assuming of course that you just mean
"within an architecture family" like Arm, rather than
systems with say both an Arm and a Microblaze core in them).
The other things I have considered are:

 * code that calls cpu_physical_memory_read() and friends,
   uses address_space_memory, or otherwise assumes there's
   only a single view of "CPU memory": I sent some patches
   out before Christmas that fixed these in generic code
   like the monitor and disassembler. There are also some
   uses in target-arch or device-specific code, which I
   think we can in practice ignore unless/until we come to
   implement a board that is heterogenous and also uses those
   devices or target architectures.
 * handling of TB invalidation: I think this should Just Work,
   because tb_invalidate_phys_addr() takes an AddressSpace+hwaddr,
   which it converts into a ramaddr for (the badly misnamed)
   tb_invalidate_phys_page_range(). So the TB caching all works
   on ramaddrs, and if CPU 1 writes to some ram X that's mapped
   at physaddr A for it but at physaddr B for CPU 2, we will
   still correctly invalidate any TBs that CPU 2 had for code
   that from its point of view lives at physaddr B.
   Note that translate-all.c has a single l1_map[] structure
   which will have PageDesc entries for pages for all CPUs in
   the system. I believe this is OK because the only thing
   we use them for is TB invalidation.
 * dirty memory tracking: like TB invalidation, this is OK
   because it all works either on MemoryRegions or on ramaddrs,
   not on physaddrs.

Have I missed anything that we also need to fix ?

thanks
-- PMM

Peter Maydell (4):
  hw/arm/xlx-zynqmp: Realize cluster after putting RPUs in it
  qom/cpu: Add cluster_index to CPUState
  accel/tcg: Add cluster number to TCG TB hash
  gdbstub: Simplify gdb_get_cpu_pid() to use cpu->cluster_index

 include/exec/exec-all.h   |  4 +++-
 include/hw/cpu/cluster.h  | 24 ++++++++++++++++++++
 include/qom/cpu.h         |  7 ++++++
 accel/tcg/cpu-exec.c      |  3 +++
 accel/tcg/translate-all.c |  3 +++
 gdbstub.c                 | 48 ++++-----------------------------------
 hw/arm/xlnx-zynqmp.c      |  4 ++--
 hw/cpu/cluster.c          | 46 +++++++++++++++++++++++++++++++++++++
 qom/cpu.c                 |  1 +
 9 files changed, 94 insertions(+), 46 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 12+ messages in thread