All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Reducing NEED_CPU_H usage
@ 2022-12-28 16:16 Alessandro Di Federico via
  2022-12-29 19:13 ` Taylor Simpson
  2023-01-10 19:56 ` Richard Henderson
  0 siblings, 2 replies; 7+ messages in thread
From: Alessandro Di Federico via @ 2022-12-28 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Taylor Simpson, Philippe Mathieu-Daudé,
	Richard Henderson, Alex Bennée

Hello everyone, this is a proposal in the libtcg direction, i.e.,
enabling building a single QEMU executable featuring multiple *targets*
(i.e., TCG frontends).
This follows what we discussed in recent KVM calls and originally at
the KVM Forum Emulation BoF.

Note that our commitment to this project is possible thanks to our
collaboration with the Qualcomm Innovation Center!

# Grand goal

The following is the core point where translation of input code to host
code takes place:

    TranslationBlock *tb_gen_code(CPUState *cpu,
                                  target_ulong pc, target_ulong cs_base,
                                  uint32_t flags, int cflags)
    {
        // ...

        gen_intermediate_code(cpu, tb, max_insns, pc, host_pc);
        
        // ...
    
        gen_code_size = tcg_gen_code(tcg_ctx, tb, pc);
    
        // ...
    }

    https://github.com/qemu/qemu/blob/4e4fa6c/accel/tcg/translate-all.c#L816

We want:

* `gen_intermediate_code` to be target-specific (provided by a shared
  library)
* `tb_gen_code` and `tcg_gen_code` to be target-agnostic

Result: the frontend just emits instructions into the
`TranslationBlock`, not much more.

But first we need to make sure that QEMU is neatly divided into
components that can make compile-time assumptions about the target and
components that don't, which is what this proposal is about.

This will be a large chunk of work affecting many parts of the code and
requiring some infrastructural changes.

# Strategy

The strategy to make this happen is primarily to have an out-of-tree
branch where the development of libtcg takes place. The main goal is to
be able to limit usage of `NEED_CPU_H` and some other macros only to
certain places (mainly `target/*` and `{bsd,linux}-user/*`).

Initially we will focus on getting a `--target-list=x86_64-linux-user`
configuration to build with reduced usage of `NEED_CPU_H`. This is just
for simplicity, we intend to tackle all other configurations
step-by-step.

At first, there will be lots of hacks to make it build and QEMU won't
be operational, but building it will enable us to identify all the core
issues.

Once all the core issues have been identified, we intend to prepare
fixes for each one of them and frequently send them upstream in a form
palatable for master. For instance, we expect to have a patch to move
from `TARGET_TB_PCREL` to `TranslationBlock.pc_rel`.

This way, we hope to be able to tackle the project in small bites and,
at a certain point, send a patch upstream where `NEED_CPU_H` is
actually disabled in many many places. Given the amount of work, doing
everything in a single shot does not seem feasible or beneficial.

# Core issues identifed so far

In the following we report some of the core issues we identified, why
they are problematic and what we intend to do to tackle them. Feedback
on each and every point is very welcome! :)

## `NEED_CPU_H`

`NEED_CPU_H` is a macro to decide whether a certain translation unit
needs `target/$ARCH/cpu.h` or not. In several headers, this macro is
used to guard target-specific code.

Problem: we need to build translation units out of the `target/`
and `*-user/` directories *without* this macro being defined.
This also means that many files will be compiled only once and not once
per target!

Solution: Fix all the issues caused by undefining it (main ones
are reported below). Sometimes a solution is to take headers that are
currently target-specific headers and introduce `#ifdef NEED_CPU_H` in
certain parts so that they become target-agnostic.

At a certain point it might make sense to count how many lines are
guarded by such macro and consider splitting headers into
target-agnostic and target-specific ones.

Note: currently, we plan to get rid of all usages of `TARGET_*`
macros in code that we want to become target agnostic but *not* of
`CONFIG_USER_ONLY` and `CONFIG_SOFTMMU`. At least initially, our goal
is just to decouple translation units out of `target/`/`*-user/` from
target-specific assumptions.

## `target_ulong`

`target_ulong` is `uint32_t` in 32-bit targets and `uint64_t` in 64-bit
targets.

Problem: This is used in many many places to represent addresses in
code that could become target-independent.

Proposed solution: we can convert it to:

    typedef uint64_t target_address;

The problem with this is that, if arithmetic operations are performed
on it, we might get undesired results:

    // Was: char load_data(target_ulong address)
    char load_data(target_address address) {
      char *base_address = get_base_address();
      // On a 32-bits target this would overflow, it doesn't with
      // uint64_t
      target_address real_address = address + 1;
      return *(base_address + real_address);
    }

    load_data(UINT32_MAX);

This is might be a source of subtle bugs.
We're using a rough `clang-query` invocation to see who performs
arithmetic on `target_ulong`:

    $ cd build
    $ cat compile_commands.json \
        | jq -r '.[] | .["file"]' \
        | grep -v 'target/' \
        | grep -v 'linux-user/' \
        | xargs -P$(nproc) -n1 \
          clang-query \
            --extra-arg='-Wno-error' \
            --extra-arg='-w' \
            -p . \
            -c 'match binaryOperator(anyOf(hasOperatorName("-"),
                                           hasOperatorName("+")),
                                     hasDescendant(expr(hasType(typedefDecl(anyOf(hasName("target_ulong"),
                                                                                  hasName("abi_ulong"))))))).bind("x")' \
        | grep -v '0 matches.' \
        | grep 'binds here' \
        | sort -u \
        | wc -l
    34

Not too bad.

An alternative would be to do some "data hiding" and have:

    struct target_address {
      uint64_t address;
    };

And then have `increment_address(target_address *, CPUState *)` in
order to perform safe arithmetic depending on the current target.
However this seems a bit excessive, we could rely on simply letting the
developer mask addresses in cases of arithmetic in target-independent
code.

## `abi_ulong`

Similar to `target_ulong`, but with alignment info.

The role of `abi_ulong` is not entirely clear to me. Some feedback on
this would be welcome.

## `TCGv`

`TCGv` is a macro for `TCGv_i32` for 32-bit targets and `TCGv_i64`
for 64-bit targets.

Problem: it makes `tcg-op.h` and, more importantly, `tcg-op.c`
target-specific.

Solution: transform current functions using them into target-specific
wrappers that dispatch to target-agnostic functions that accept
`TCGv_dyn` instead of `TCGv`:

    typedef struct {
        union {
            TCGv_i32 i32;
            TCGv_i64 i64;
        };
        bool is_64;
    } TCGv_dyn;

In the semi-last KVM call, it has been suggested that instead of having
a discriminated `union` we could stuff the address size in `MemOp`,
which is a common argument to functions taking a `TCGv`. However, that
would either require one of the following:

* still have `TCG_dyn` as a union, use `MemOp` instead of the `is_64`
  field; however I think the discriminated union is more suitable and
  it'd be a bit of an abuse to use `MemOp` since currently it describe
  *how* to perform the load but it says nothing about the size of the
  address;

* make `TCGv` become opaque (`typedef struct TCGv_d *TCGv`) and put a
  cast in `tcg-op.h` code, we'd lose type safety here;

* use `_Generic` but that, in turn, would require either moving all
  functions taking a `TCGv` to the header and turn them into macros
  *or* turn `tcg-op.c` into a template file where each function taking
  a `TCGv` is instantiated twice with different suffixes depending on
  the type of the argument;

The changeset for the `TCGv_dyn` solution, on the other hand, is rather
small:

    https://github.com/revng/qemu-hexagon/commit/f4098d84c241b86418e401396851d12f2fdbcf7a
    https://github.com/revng/qemu-hexagon/commit/251ef7631640e645a03b0ffa72fa96f3eb686129

## `TARGET_` macros

These are macros that provide target-specific information.

Problem: they need to be abandoned in translation units that need to
become target agnostic.

Solution: promote them to fields of a `struct`.
Current ideas:

    TARGET_TB_PCREL -> TranslationBlock.pc_rel
    TARGET_PAGE_BITS -> TranslationBlock.page_bits
    TARGET_PAGE_MASK -> TranslationBlock.page_mask
    TARGET_PAGE_ALIGN -> CPUArchState.page_align
                      -> DisasContextBase.page_align
    TARGET_LONG_BITS -> TCGContext.long_bits 
    TARGET_PAGE_SIZE -> ???
    TCG_OVERSIZED_GUEST -> ???
    TARGET_FMT_lx -> ???
    CPU_RESOLVING_TYPE -> ???

Sometimes, this is trivial, other times this means adding arguments to
functions that previously could rely on direct macro use.

## `CPUState`

`CPUState` is a target-agnostic `struct` representing common information
of `ArchCPU`.

Problem: given an (opaque) pointer to `CPU${Arch}State`,
target-agnostic code often wants to reach `CPUState`, but this requires
knowledge of the layout of `ArchCPU`, which is target-dependent.

Solution: have a target-specific function to obtain the pointer to
`CPUState` given `CPU${Arch}State`. Where this function would go, and
how it would be retrieved, needs more consideration.
I imagine a table of target-specific function pointers.

## `CPUNegativeOffsetState`

`CPUNegativeOffsetState` is a `struct` placed in `ArchCPU` before
`CPU${Arch}State`.

    struct ArchCPU {
        /*< private >*/
        CPUState parent_obj;
        /*< public >*/
    
        CPUNegativeOffsetState neg;
        CPUAlphaState env;
    
        /* This alarm doesn't exist in real hardware; we wish it did.
    */ QEMUTimer *alarm_timer;
    };

Problem: used in several parts of the code that need to become
target-agnostic.

Solution: make it target-agnostic and push it into `CPUState`, which is
the place for target-agnostic stuff.

I'm not sure why this isn't already the case. I'm probably missing
something here, so feedback is welcome.

# What's left out?

Quite a few minor things but.. you tell me!
There's probably *a lot* more stuff when we dig into system mode
emulation, but I'd postpone investigating those issues until we have
user mode in a decent shape.

Apart from this, let me know if there's something else we're missing.

# Current status

We currently have a branch where we can build (but not link nor run) a
`x86_64-linux-user` configuration where `NEED_CPU_H` is defined only
for translation units in `target/` and `linux-user/`.

It's the result of a (painful) journey. You can try it here:

    git clone https://github.com/revng/qemu-hexagon.git
    cd qemu-hexagon
    git checkout feature/multi-tcg
    mkdir build
    cd build
    ../configure --target-list=x86_64-linux-user
    ninja -k0 | grep FAILED
    # It should build but fail linking qemu-x86_64

This process required commenting a lot of code, it's not supposed to
work at all and it's quite ugly. However, it has been very useful since
it enabled us to use the compiler to discover what is problematic.

Sorry for the wall of text.

Enjoy the last few days of 2022!

-- 
Alessandro Di Federico
rev.ng Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC] Reducing NEED_CPU_H usage
  2022-12-28 16:16 [RFC] Reducing NEED_CPU_H usage Alessandro Di Federico via
@ 2022-12-29 19:13 ` Taylor Simpson
  2023-01-10 19:56 ` Richard Henderson
  1 sibling, 0 replies; 7+ messages in thread
From: Taylor Simpson @ 2022-12-29 19:13 UTC (permalink / raw)
  To: Alessandro Di Federico, qemu-devel
  Cc: Philippe Mathieu-Daudé,
	Richard Henderson, Alex Bennée, Brian Cain, Mark Burton



> -----Original Message-----
> From: Alessandro Di Federico <ale@rev.ng>
> Sent: Wednesday, December 28, 2022 10:16 AM
> To: qemu-devel@nongnu.org
> Cc: Taylor Simpson <tsimpson@quicinc.com>; Philippe Mathieu-Daudé
> <philmd@linaro.org>; Richard Henderson <richard.henderson@linaro.org>;
> Alex Bennée <alex.bennee@linaro.org>
> Subject: [RFC] Reducing NEED_CPU_H usage
> 
> Hello everyone, this is a proposal in the libtcg direction, i.e., enabling building
> a single QEMU executable featuring multiple *targets* (i.e., TCG frontends).
> This follows what we discussed in recent KVM calls and originally at the KVM
> Forum Emulation BoF.
> 
> Note that our commitment to this project is possible thanks to our
> collaboration with the Qualcomm Innovation Center!
> 

Thanks for sharing.  My $.02 is to change the subject line to reflect the total initiative, not just NEED_CPU_H.  That would make more folks notice it in their inbox.

> # What's left out?
> 
> Quite a few minor things but.. you tell me!
> There's probably *a lot* more stuff when we dig into system mode
> emulation, but I'd postpone investigating those issues until we have user
> mode in a decent shape.

One thing I can think of is the TARGET_<arch> macros .  Some examples are in fpu/softfloat-specialize.c.inc and linux-user/user-internals.h.

Thanks,
Taylor



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Reducing NEED_CPU_H usage
  2022-12-28 16:16 [RFC] Reducing NEED_CPU_H usage Alessandro Di Federico via
  2022-12-29 19:13 ` Taylor Simpson
@ 2023-01-10 19:56 ` Richard Henderson
  2023-01-12 15:28   ` Alessandro Di Federico via
  1 sibling, 1 reply; 7+ messages in thread
From: Richard Henderson @ 2023-01-10 19:56 UTC (permalink / raw)
  To: Alessandro Di Federico, qemu-devel
  Cc: Taylor Simpson, Philippe Mathieu-Daudé, Alex Bennée

On 12/28/22 08:16, Alessandro Di Federico wrote:
> ## `target_ulong`
> 
> `target_ulong` is `uint32_t` in 32-bit targets and `uint64_t` in 64-bit
> targets.
> 
> Problem: This is used in many many places to represent addresses in
> code that could become target-independent.
> 
> Proposed solution: we can convert it to:
> 
>      typedef uint64_t target_address;

We have other typedefs that are better for this, e.g. vaddr.

However, at some point we do want to keep some target addresses in the proper size.  For 
instance within the softmmu tlb, where CPUTLBEntry is either 16 or 32 bytes, depending.

(On the other hand, if we drop support for 32-bit hosts, as we keep threatening to do, 
then CPUTLB is always 32 bytes, and we might as well use vaddr there too.  But not until 
32-bit hosts are gone.)

> 
> The problem with this is that, if arithmetic operations are performed
> on it, we might get undesired results:
> 
>      // Was: char load_data(target_ulong address)
>      char load_data(target_address address) {
>        char *base_address = get_base_address();
>        // On a 32-bits target this would overflow, it doesn't with
>        // uint64_t
>        target_address real_address = address + 1;
>        return *(base_address + real_address);
>      }

Doesn't, or shouldn't matter, because we should never do anything like this in generic 
code.  Note that

     vaddr ptr = ...;
     cpu_ldl_le_data(env, ptr + offset)

does not have the problem you describe, because any overflow is truncated within the load 
function.


> ## `abi_ulong`
> 
> Similar to `target_ulong`, but with alignment info.

Pardon?  There's no alignment info in abi_ulong.

The difference is that 'target_ulong' is the size of the target register, and 'abi_ulong' 
is the 'unsigned long' in the target's C ABI.  Consider e.g. x32 (x86_64 with ilp32 abi), 
for which target_ulong is 64-bit but abi_ulong is 32-bit.

This only applies to user-only, and should not matter for this project.

There *is* an 'abi_ptr' type, which is shared between softmmu and user-only, which might 
be able to be replaced by 'vaddr'.  Or 'typedef vaddr abi_ptr' in softmmu mode.  I haven't 
done a survey on that to be certain.

> ## `TCGv`
> 
> `TCGv` is a macro for `TCGv_i32` for 32-bit targets and `TCGv_i64`
> for 64-bit targets.

The idea is that this macro should only be visible to target-specific code, and the macro 
provides the swizzling/encoding to the concrete type functions.

> Problem: it makes `tcg-op.h` 

This is fine.

> and, more importantly, `tcg-op.c`

This one requires some work within tcg/ to handle two target address sizes simultaneously. 
  It should not be technically difficult to solve, but it does involve adding a few TCG 
opcodes and adjusting all tcg backends.


> Solution: transform current functions using them into target-specific
> wrappers that dispatch to target-agnostic functions that accept
> `TCGv_dyn` instead of `TCGv`:
> 
>      typedef struct {
>          union {
>              TCGv_i32 i32;
>              TCGv_i64 i64;
>          };
>          bool is_64;
>      } TCGv_dyn;

This forgets that both TCGv_i32 and TCGv_i64 are represented by TCGTemp, which contains 
'TCGType type' to discriminate.  This is not exposed to target/, but it's there.

Anyway, there's no need for this.

> ## `TARGET_` macros
> 
> These are macros that provide target-specific information.
> 
> Problem: they need to be abandoned in translation units that need to
> become target agnostic.
> 
> Solution: promote them to fields of a `struct`.
> Current ideas:
> 
>      TARGET_TB_PCREL -> TranslationBlock.pc_rel

I'd been thinking a bit on the cpu, but a CF_* bit works well.
It gets initialized for each TB from CPUState.tcg_cflags.
TBD where we'd initialize the new bit for each cpu...


>      TARGET_PAGE_BITS -> TranslationBlock.page_bits
>      TARGET_PAGE_MASK -> TranslationBlock.page_mask

You need to look at how TARGET_PAGE_BITS_VARY works.  The memory subsystem needs rewriting 
if we were to support multiple page sizes.  What we can support now is one single global 
page size, selected at startup.


>      TARGET_PAGE_ALIGN -> CPUArchState.page_align
>                        -> DisasContextBase.page_align

This remains a trivial macro based on the variable TARGET_PAGE_MASK.


>      TARGET_LONG_BITS -> TCGContext.long_bits

Yes.

I've been considering how to generalize this to arbitrary address widths, in order to 
better support ARM top-byte-ignore and RISC-V J extension (pointer masking).  But in the 
short term I'm happy with this number being exactly 64 or 32.

>      TARGET_PAGE_SIZE -> ???

Remains a trivial macro based on the variable TARGET_PAGE_MASK.

>      TCG_OVERSIZED_GUEST -> ???

Goes away if we drop support for 32-bit hosts, or restrict 32-bit hosts to 32-bit guests. 
I have no other good ideas.


>      TARGET_FMT_lx -> ???

VADDR_PRIx, mostly.  May need resolving on a case-by-case basis.

>      CPU_RESOLVING_TYPE -> ???

Would need to be part of the per-target shared library interface.

> ## `CPUState`
> 
> `CPUState` is a target-agnostic `struct` representing common information
> of `ArchCPU`.
> 
> Problem: given an (opaque) pointer to `CPU${Arch}State`,
> target-agnostic code often wants to reach `CPUState`, but this requires
> knowledge of the layout of `ArchCPU`, which is target-dependent.
> 
> Solution: have a target-specific function to obtain the pointer to
> `CPUState` given `CPU${Arch}State`. Where this function would go, and
> how it would be retrieved, needs more consideration.
> I imagine a table of target-specific function pointers.

For the most part, generic code should be converted to use CPUState as much as possible, 
and only target-specific code should deal with CPUArchState.

However, generic code and the tcg backend need to be able to find CPUNegativeOffsetState 
from CPUArchState.  This is (and must be) a per-target constant; we cannot allow full 
flexibility of function pointers.

We currently have CPUState.env_ptr; we can add 'CPUNegativeOffsetState *neg_ptr' to match, 
then the per-target constant is the difference between the two pointers.

> ## `CPUNegativeOffsetState`
> 
> `CPUNegativeOffsetState` is a `struct` placed in `ArchCPU` before
> `CPU${Arch}State`.
> 
>      struct ArchCPU {
>          /*< private >*/
>          CPUState parent_obj;
>          /*< public >*/
>      
>          CPUNegativeOffsetState neg;
>          CPUAlphaState env;
>      
>          /* This alarm doesn't exist in real hardware; we wish it did.
>      */ QEMUTimer *alarm_timer;
>      };
> 
> Problem: used in several parts of the code that need to become
> target-agnostic.
> 
> Solution: make it target-agnostic and push it into `CPUState`, which is
> the place for target-agnostic stuff.
> 
> I'm not sure why this isn't already the case. I'm probably missing
> something here, so feedback is welcome.

For code generation quality, we need small negative displacements from env to find 
icount_decr (used at the start of each TB), and CPUTLBDescFast[n].  Host specific 
addressing modes (arm32, arm64, riscv) require the maximum of these displacements to be >= 
-(1 << 11).

Thus CPUNegativeOffsetState, and the documented (but not enforced) requirement that 'neg' 
immediately precede 'env'.  The alignment requirements of env mean that there may be some 
small amount of padding between the two.

Before CPUNegativeOffsetState, we had all of those variables within CPUState, and even 
comments that they should remain at the end of the struct.  But those comments were 
ignored and one day icount_decr was too far away from env for easy addressing on arm host. 
  Luckily there was an assert that fired, but it didn't get caught right away.

As for NB_MMU_MODES, which CPUTLB depends on, we could fix this at 16.  This is larger 
than all current targets (arm, 12; ppc 10), and is also the maximum currently supported by 
the softmmu tlb api (uint16_t idxmap).

Once upon a time it was quite expensive to have many mmu modes, as we directly allocated 
their storage within CPUState.  But now we dynamically resize softmmu tlbs, so the 
overhead of an unused mmu index is fairly low (sizeof(CPUTLBDesc) + sizeof(CPUTLBDescFast)).

> # Current status
> 
> We currently have a branch where we can build (but not link nor run) a
> `x86_64-linux-user` configuration where `NEED_CPU_H` is defined only
> for translation units in `target/` and `linux-user/`.

This effort should be exclusive to system mode -- don't consider *-user at all.
Each linux-user target bakes in not just target architecture parameters, which are 
complicated enough, but also C ABI, and kernel API.  Moreover, the most common use case 
for linux-user is a statically linked binary that operates within a chroot.

All you need for *-user is to make sure you don't accidentally break them while doing 
system mode cleanup.


r~


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Reducing NEED_CPU_H usage
  2023-01-10 19:56 ` Richard Henderson
@ 2023-01-12 15:28   ` Alessandro Di Federico via
  2023-01-12 22:40     ` Richard Henderson
  2023-01-15  1:52     ` Richard Henderson
  0 siblings, 2 replies; 7+ messages in thread
From: Alessandro Di Federico via @ 2023-01-12 15:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Taylor Simpson, Alex Bennée

On Tue, 10 Jan 2023 11:56:50 -0800
Richard Henderson <richard.henderson@linaro.org> wrote:

> However, at some point we do want to keep some target addresses in
> the proper size.  For instance within the softmmu tlb, where
> CPUTLBEntry is either 16 or 32 bytes, depending.

So that would be an optimization if `HOST_LONG_BITS == 32 &&
TARGET_LONG_BITS == 32`, correct?

I've heard about dropping 32 bit hosts multiple times here and there.
Maybe we could start with dropping oversized guests, which AFAIU are
the real offenders for most (all?) of these situations.

> > ## `abi_ulong`
> > 
> > Similar to `target_ulong`, but with alignment info.  
> 
> Pardon?  There's no alignment info in abi_ulong.

From include/exec/user/abitypes.h:

    typedef uint32_t abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT)));
    typedef target_ulong abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT)));

I thought that was the difference. Thanks for the clarification.

> This one requires some work within tcg/ to handle two target address
> sizes simultaneously. It should not be technically difficult to
> solve, but it does involve adding a few TCG opcodes and adjusting all
> tcg backends.

I'm a bit confused by this, do backends for some reason have
expectations about the address size?
Wouldn't this be enough?

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 019fab00ccb..175162b8fef 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -29,0 +29,0 @@
@@ -2827,17 +2843,17 @@ static inline MemOp tcg_canonicalize_memop(MemOp op, bool is64, bool st)
     return op;
 }
 
-static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr,
+static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv_dyn addr,
                          MemOp memop, TCGArg idx)
 {
     MemOpIdx oi = make_memop_idx(memop, idx);
-#if TARGET_LONG_BITS == 32
-    tcg_gen_op3i_i32(opc, val, addr, oi);
-#else
-    if (TCG_TARGET_REG_BITS == 32) {
-        tcg_gen_op4i_i32(opc, val, TCGV_LOW(addr), TCGV_HIGH(addr), oi);
-    } else {
-        tcg_gen_op3(opc, tcgv_i32_arg(val), tcgv_i64_arg(addr), oi);
-    }
-#endif
+    if (addr.size == 32) {
+        tcg_gen_op3i_i32(opc, val, addr.i32, oi);
+    } else {
+        if (TCG_TARGET_REG_BITS == 32) {
+            tcg_gen_op4i_i32(opc, val, TCGV_LOW(addr.i64), TCGV_HIGH(addr.i64), oi);
+        } else {
+            tcg_gen_op3(opc, tcgv_i32_arg(val), tcgv_i64_arg(addr.i64), oi);
+        }
+    }
 }


> This forgets that both TCGv_i32 and TCGv_i64 are represented by
> TCGTemp,

    https://i.imgflip.com/777wax.jpg

> which contains 'TCGType type' to discriminate.  This is not
> exposed to target/, but it's there.
> 
> Anyway, there's no need for this.

So, if I got it right, we could just make TCGv become a new opaque
type, propagate it down until the spot where we actually need to know
its size and then just have some `TCGTemp *tcgv_temp(TCGv v)` function
to inspect the actual size?

Makes sense!

> Before CPUNegativeOffsetState, we had all of those variables within
> CPUState, and even comments that they should remain at the end of the
> struct.  But those comments were ignored and one day icount_decr was
> too far away from env for easy addressing on arm host. Luckily there
> was an assert that fired, but it didn't get caught right away.

How comes it wasn't caught immediately?
We could have something like:

    QEMU_BUILD_BUG_MSG(&ArchCPU.env - &ArchCPU.neg.tlb
                       < DESIRED_THRESHOLD)

> > # Current status
> > 
> > We currently have a branch where we can build (but not link nor
> > run) a `x86_64-linux-user` configuration where `NEED_CPU_H` is
> > defined only for translation units in `target/` and `linux-user/`.  
> 
> This effort should be exclusive to system mode -- don't consider
> *-user at all. Each linux-user target bakes in not just target
> architecture parameters, which are complicated enough, but also C
> ABI, and kernel API.  Moreover, the most common use case for
> linux-user is a statically linked binary that operates within a
> chroot.

Our current goal is to get the following compilation unit to build
without NEED_CPU_H:

    trace/control-target.c
    gdbstub/gdbstub.c
    cpu.c
    disas.c
    page-vary.c
    tcg/optimize.c
    tcg/region.c
    tcg/tcg.c
    tcg/tcg-common.c
    tcg/tcg-op.c
    tcg/tcg-op-gvec.c
    tcg/tcg-op-vec.c
    fpu/softfloat.c
    accel/accel-common.c
    accel/tcg/tcg-all.c
    accel/tcg/cpu-exec-common.c
    accel/tcg/cpu-exec.c
    accel/tcg/tb-maint.c
    accel/tcg/tcg-runtime-gvec.c
    accel/tcg/tcg-runtime.c
    accel/tcg/translate-all.c
    accel/tcg/translator.c
    accel/tcg/user-exec.c
    accel/tcg/user-exec-stub.c
    accel/tcg/plugin-gen.c
    plugins/loader.c
    plugins/core.c
    plugins/api.c

They are subset of `arch_srcs` from `meson.build`.

Making them target agnostic for *-user too should be easy and could save
some build time.
But yeah, we'll now focus on system-mode.

We'll now try to sort out things from the easiest to the most complex
and start send out patches.

Thanks a lot for your valuable insights!

-- 
Alessandro Di Federico
rev.ng Labs


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] Reducing NEED_CPU_H usage
  2023-01-12 15:28   ` Alessandro Di Federico via
@ 2023-01-12 22:40     ` Richard Henderson
  2023-01-15  1:52     ` Richard Henderson
  1 sibling, 0 replies; 7+ messages in thread
From: Richard Henderson @ 2023-01-12 22:40 UTC (permalink / raw)
  To: Alessandro Di Federico
  Cc: qemu-devel, Taylor Simpson, Alex Bennée,
	Philippe Mathieu-Daudé

On 1/12/23 07:28, Alessandro Di Federico wrote:
> On Tue, 10 Jan 2023 11:56:50 -0800
> Richard Henderson <richard.henderson@linaro.org> wrote:
> 
>> However, at some point we do want to keep some target addresses in
>> the proper size.  For instance within the softmmu tlb, where
>> CPUTLBEntry is either 16 or 32 bytes, depending.
> 
> So that would be an optimization if `HOST_LONG_BITS == 32 &&
> TARGET_LONG_BITS == 32`, correct?

At present, not an 'optimization' but 'natural fallout of type sizes'.
But, yeah.

>>> ## `abi_ulong`
>>>
>>> Similar to `target_ulong`, but with alignment info.
>>
>> Pardon?  There's no alignment info in abi_ulong.
> 
>  From include/exec/user/abitypes.h:
> 
>      typedef uint32_t abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT)));
>      typedef target_ulong abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT)));
> 
> I thought that was the difference. Thanks for the clarification.

Ah, I see what you mean.  I *believe* that use of target_ulong could actually be uint64_t, 
since all TARGET_LONG_BITS == 32 ought to have TARGET_ABI32 set too (which brings us to 
that first definition with uint32_t.)

There is a target alignment difference, for the benefit of e.g. m68k, which has 
sizeof(long) == 4 and __alignof(long) == 2, which may differ by the host.

In any case, all of "exec/abitypes.h" is (or should be) user-only related, so out of scope 
for this project.

>> This one requires some work within tcg/ to handle two target address
>> sizes simultaneously. It should not be technically difficult to
>> solve, but it does involve adding a few TCG opcodes and adjusting all
>> tcg backends.
> 
> I'm a bit confused by this, do backends for some reason have
> expectations about the address size?
> Wouldn't this be enough?

Yes, this affects the tcg backend as well:

static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
{
...
     datalo = *args++;
     datahi = (TCG_TARGET_REG_BITS == 32 && is64 ? *args++ : 0);
     addrlo = *args++;
     addrhi = (TARGET_LONG_BITS > TCG_TARGET_REG_BITS ? *args++ : 0);
     oi = *args++;

and further code generation changes especially for 64-bit guest pointer comparisons on 
32-bit hosts.  (There's that nasty combination again.)

There's also code generation changes for 32-bit guest pointer comparisons on 64-bit hosts, 
e.g.

static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
                                     int mem_index, MemOp opc,
                                     tcg_insn_unit **label_ptr, int which)
{
...
     TCGType tlbtype = TCG_TYPE_I32;
     int trexw = 0, hrexw = 0, tlbrexw = 0;
...
     if (TCG_TARGET_REG_BITS == 64) {
         if (TARGET_LONG_BITS == 64) {
             ttype = TCG_TYPE_I64;
             trexw = P_REXW;
         }
         if (TCG_TYPE_PTR == TCG_TYPE_I64) {
             hrexw = P_REXW;
             if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
                 tlbtype = TCG_TYPE_I64;
                 tlbrexw = P_REXW;
             }
         }
     }
...
     /* cmp 0(r0), r1 */
     tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which);

which generates either 'cmpl' or 'cmpq' depending on the guest address size.

>> Anyway, there's no need for this.
> 
> So, if I got it right, we could just make TCGv become a new opaque
> type, propagate it down until the spot where we actually need to know
> its size and then just have some `TCGTemp *tcgv_temp(TCGv v)` function
> to inspect the actual size?

No, leave TCGv as a macro, but we need changes to "tcg/tcg-op*.h", so that all of the 
above tcg backend stuff *also* gets disconnected from TARGET_LONG_BITS.

> Makes sense!
> 
>> Before CPUNegativeOffsetState, we had all of those variables within
>> CPUState, and even comments that they should remain at the end of the
>> struct.  But those comments were ignored and one day icount_decr was
>> too far away from env for easy addressing on arm host. Luckily there
>> was an assert that fired, but it didn't get caught right away.
> 
> How comes it wasn't caught immediately?
> We could have something like:
> 
>      QEMU_BUILD_BUG_MSG(&ArchCPU.env - &ArchCPU.neg.tlb
>                         < DESIRED_THRESHOLD)

We probably should.

But it didn't affect x86, and cross-build testing wasn't sufficient, so we didn't find out 
later when someone did runtime testing on the affected hosts.

> Our current goal is to get the following compilation unit to build
> without NEED_CPU_H:
> 
>      trace/control-target.c
>      gdbstub/gdbstub.c
>      cpu.c
>      disas.c
>      page-vary.c
>      tcg/optimize.c
>      tcg/region.c
>      tcg/tcg.c
>      tcg/tcg-common.c
>      tcg/tcg-op.c
>      tcg/tcg-op-gvec.c
>      tcg/tcg-op-vec.c
>      fpu/softfloat.c
>      accel/accel-common.c
>      accel/tcg/tcg-all.c
>      accel/tcg/cpu-exec-common.c
>      accel/tcg/cpu-exec.c
>      accel/tcg/tb-maint.c
>      accel/tcg/tcg-runtime-gvec.c
>      accel/tcg/tcg-runtime.c
>      accel/tcg/translate-all.c
>      accel/tcg/translator.c
>      accel/tcg/user-exec.c
>      accel/tcg/user-exec-stub.c
>      accel/tcg/plugin-gen.c
>      plugins/loader.c
>      plugins/core.c
>      plugins/api.c
> 
> They are subset of `arch_srcs` from `meson.build`.

Reasonable.  I think accel/tcg/user-exec.c is the only one that isn't also required for 
system mode, and completing accel/tcg/ would be less confusing than not.


r~


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Reducing NEED_CPU_H usage
  2023-01-12 15:28   ` Alessandro Di Federico via
  2023-01-12 22:40     ` Richard Henderson
@ 2023-01-15  1:52     ` Richard Henderson
  2023-03-02 17:44       ` Taylor Simpson
  1 sibling, 1 reply; 7+ messages in thread
From: Richard Henderson @ 2023-01-15  1:52 UTC (permalink / raw)
  To: Alessandro Di Federico; +Cc: qemu-devel, Taylor Simpson, Alex Bennée

On 1/12/23 05:28, Alessandro Di Federico wrote:
>      fpu/softfloat.c

Something I happened to notice while doing other triage:

     https://gitlab.com/qemu-project/qemu/-/issues/1375

This is an x86 problem that currently has no solution, but ought to be trivial with the 
changes to softfloat required for this project.


r~


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC] Reducing NEED_CPU_H usage
  2023-01-15  1:52     ` Richard Henderson
@ 2023-03-02 17:44       ` Taylor Simpson
  0 siblings, 0 replies; 7+ messages in thread
From: Taylor Simpson @ 2023-03-02 17:44 UTC (permalink / raw)
  To: Richard Henderson, Alessandro Di Federico; +Cc: qemu-devel, Alex Bennée



> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Saturday, January 14, 2023 6:53 PM
> To: Alessandro Di Federico <ale@rev.ng>
> Cc: qemu-devel@nongnu.org; Taylor Simpson <tsimpson@quicinc.com>;
> Alex Bennée <alex.bennee@linaro.org>
> Subject: Re: [RFC] Reducing NEED_CPU_H usage
> 
> On 1/12/23 05:28, Alessandro Di Federico wrote:
> >      fpu/softfloat.c
> 
> Something I happened to notice while doing other triage:
> 
>      https://gitlab.com/qemu-project/qemu/-/issues/1375
> 
> This is an x86 problem that currently has no solution, but ought to be trivial
> with the changes to softfloat required for this project.

One other thing we'll need to deal with is command-line options.  In a heterogeneous system, options like -cpu and -d will need a way to direct different settings to each target.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-03-02 17:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-28 16:16 [RFC] Reducing NEED_CPU_H usage Alessandro Di Federico via
2022-12-29 19:13 ` Taylor Simpson
2023-01-10 19:56 ` Richard Henderson
2023-01-12 15:28   ` Alessandro Di Federico via
2023-01-12 22:40     ` Richard Henderson
2023-01-15  1:52     ` Richard Henderson
2023-03-02 17:44       ` Taylor Simpson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.