qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] QEMU aarch64 TCG target
@ 2013-03-14 15:57 Claudio Fontana
  2013-03-14 16:16 ` Peter Maydell
  2013-05-23  8:09 ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Claudio Fontana
  0 siblings, 2 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-03-14 15:57 UTC (permalink / raw)
  To: qemu-devel

Hello all,

I am currently working on an aarch64 tcg target implementation,
based on the available gdb patches contributed by ARM and the results of the linaro toolchain.

I have implemented most opcodes, but I still need to clean up and also implement the
op_qemu_ld/st stuff, which seems the most painful, and then I still need the tcg_target_init and the prologue.

Is anybody else working on this?

Ciao,

Claudio

-- 
Claudio Fontana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] QEMU aarch64 TCG target
  2013-03-14 15:57 [Qemu-devel] QEMU aarch64 TCG target Claudio Fontana
@ 2013-03-14 16:16 ` Peter Maydell
  2013-05-06 12:56   ` [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64 Claudio Fontana
  2013-05-23  8:09 ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Claudio Fontana
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-03-14 16:16 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: qemu-devel

On 14 March 2013 15:57, Claudio Fontana <claudio.fontana@huawei.com> wrote:
> I am currently working on an aarch64 tcg target implementation,
> based on the available gdb patches contributed by ARM and the results
> of the linaro toolchain.

Doing a target implementation based on the gdb/binutils
patches and not the actual documentation is going to be
enormously painful to review (to the point that I will almost
certainly just say "sorry, no"), because it will basically
be "you have the semantics of this wrong", "you have the
decoding wrong" all the way through for a whole pile of
corner cases. You need to be working from the actual ARM
documentation (which I regret is currently only available
under NDA).

See also the patchset that Alex Graf posted recently (which
is a bunch of framework code but not the actual decoder).

-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64
  2013-03-14 16:16 ` Peter Maydell
@ 2013-05-06 12:56   ` Claudio Fontana
  2013-05-06 13:27     ` Paolo Bonzini
  2013-05-06 13:42     ` [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64 Peter Maydell
  0 siblings, 2 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-06 12:56 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-devel

On 14.03.2013 17:16, Peter Maydell wrote:
> On 14 March 2013 15:57, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>> I am currently working on an aarch64 tcg target implementation,
>> based on the available gdb patches contributed by ARM and the results
>> of the linaro toolchain.
> 
> Doing a target implementation based on the gdb/binutils
> patches and not the actual documentation is going to be
> enormously painful to review (to the point that I will almost
> certainly just say "sorry, no"), because it will basically
> be "you have the semantics of this wrong", "you have the
> decoding wrong" all the way through for a whole pile of
> corner cases. You need to be working from the actual ARM
> documentation (which I regret is currently only available
> under NDA).
> 
> See also the patchset that Alex Graf posted recently (which
> is a bunch of framework code but not the actual decoder).
> 
> -- PMM
> 

Well, we happen to have just completed a first working version of TCG support for aarch64 here,
and it has been tested successfully running on Foundation v8, running the system emulation for various targets
(at the moment armv5/linux, armv7/linux, x86 FreeDOS, X86 Linux).

I understand that you have reservations on upstreaming this work for the reasons you explain above,
so for now it will be available to Huawei only. If anybody is interested, I will be happy to send the patches.

Now I have a question regarding the test images, I have seen various QEMU images at
wiki.qemu.org/Testing

I have tested with some of those, but I don't see an x86-64 test case;
is there a reference test kernel/image for x86-64?

Thanks,

Claudio Fontana
Server OS Architect
Huawei Technologies Duesseldorf GmbH
Riesstraße 25 - 80992 München

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64
  2013-05-06 12:56   ` [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64 Claudio Fontana
@ 2013-05-06 13:27     ` Paolo Bonzini
  2013-05-13 13:22       ` [Qemu-devel] [PATCH 0/3] ARM aarch64 TCG target Claudio Fontana
  2013-05-06 13:42     ` [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64 Peter Maydell
  1 sibling, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-05-06 13:27 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Peter Maydell, qemu-devel

Il 06/05/2013 14:56, Claudio Fontana ha scritto:
> On 14.03.2013 17:16, Peter Maydell wrote:
>> On 14 March 2013 15:57, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>> I am currently working on an aarch64 tcg target implementation,
>>> based on the available gdb patches contributed by ARM and the results
>>> of the linaro toolchain.
>>
>> Doing a target implementation based on the gdb/binutils
>> patches and not the actual documentation is going to be
>> enormously painful to review (to the point that I will almost
>> certainly just say "sorry, no"), because it will basically
>> be "you have the semantics of this wrong", "you have the
>> decoding wrong" all the way through for a whole pile of
>> corner cases. You need to be working from the actual ARM
>> documentation (which I regret is currently only available
>> under NDA).
>>
>> See also the patchset that Alex Graf posted recently (which
>> is a bunch of framework code but not the actual decoder).
>>
>> -- PMM
>>
> 
> Well, we happen to have just completed a first working version of TCG support for aarch64 here,
> and it has been tested successfully running on Foundation v8, running the system emulation for various targets
> (at the moment armv5/linux, armv7/linux, x86 FreeDOS, X86 Linux).
> 
> I understand that you have reservations on upstreaming this work for the reasons you explain above,
> so for now it will be available to Huawei only. If anybody is interested, I will be happy to send the patches.
> 
> Now I have a question regarding the test images, I have seen various QEMU images at
> wiki.qemu.org/Testing
> 
> I have tested with some of those, but I don't see an x86-64 test case;
> is there a reference test kernel/image for x86-64?

No, usually people just do a "smoke test" using their favorite distro
and/or Windows.

More complete integration testing of i386/x86-64 images is done with
virt-test, which supports a variety of distros.  The closest thing to a
reference image is virt-test's "JeOS" image at
http://lmr.fedorapeople.org/jeos/jeos-17-64.qcow2.7z (should probably be
added to the list...), currently based on Fedora 17.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64
  2013-05-06 12:56   ` [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64 Claudio Fontana
  2013-05-06 13:27     ` Paolo Bonzini
@ 2013-05-06 13:42     ` Peter Maydell
  1 sibling, 0 replies; 60+ messages in thread
From: Peter Maydell @ 2013-05-06 13:42 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: qemu-devel

On 6 May 2013 13:56, Claudio Fontana <claudio.fontana@huawei.com> wrote:
> On 14.03.2013 17:16, Peter Maydell wrote:
>> On 14 March 2013 15:57, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>> I am currently working on an aarch64 tcg target implementation,
>>> based on the available gdb patches contributed by ARM and the results
>>> of the linaro toolchain.
>>
>> Doing a target implementation based on the gdb/binutils
>> patches and not the actual documentation is going to be
>> enormously painful to review

> Well, we happen to have just completed a first working version
> of TCG support for aarch64 here, and it has been tested successfully
> running on Foundation v8, running the system emulation for various targets
> (at the moment armv5/linux, armv7/linux, x86 FreeDOS, X86 Linux).

Auugh. I've just realised I totally misread your initial email as
being a proposal for a QEMU target (ie target-*, to implement
guest AArch64 support), because up til now nobody at all has expressed
any interest in supporting QEMU on AArch64 hosts. My reasons for
preferring to use the official documentation for the guest support
are rather less applicable to adding host support.

> I understand that you have reservations on upstreaming this work
> for the reasons you explain above, so for now it will be available
> to Huawei only.

Since you've written it (and now I've realised my confusion!)
you may as well send the patches to qemu-devel, I think.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 0/3] ARM aarch64 TCG target
  2013-05-06 13:27     ` Paolo Bonzini
@ 2013-05-13 13:22       ` Claudio Fontana
  2013-05-13 13:28         ` [Qemu-devel] [PATCH 1/3] configure: permit compilation on arm aarch64 Claudio Fontana
                           ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Peter Maydell


This series implements preliminary support for the ARM aarch64 TCG target.

Limitations of this initial implementation (TODOs) include:
 * unconditional lookups in TLBs in qemu_ld/st via C helper functions
 * most optional opcodes are not implemented yet
 * CONFIG_SOFTMMU only
 * only little endian qemu targets supported
 * icache flushing requires recent GCC

Tested running on a x86-64 physical machine running Foundation v8,
running a linux 3.8.0-rc6+ minimal host system based on linaro v8
image 201301271620 for user space.

Tested guests: arm v5 test image, i386 FreeDOS test image,
i386 linux test image, all from qemu-devel testing page.
Also tested on x86-64/linux built with buildroot,
and on arm v7/linux built with buildroot as well.

Claudio Fontana (3):
  configure: permit compilation on arm aarch64
  include/elf.h: add aarch64 ELF machine and relocs
  tcg/aarch64: implement new TCG target for aarch64

 configure                |    8 +
 include/elf.h            |  128 ++++++
 include/exec/exec-all.h  |    5 +-
 tcg/aarch64/tcg-target.c | 1084 ++++++++++++++++++++++++++++++++++++++++++++++
 tcg/aarch64/tcg-target.h |  106 +++++
 5 files changed, 1330 insertions(+), 1 deletion(-)
 create mode 100644 tcg/aarch64/tcg-target.c
 create mode 100644 tcg/aarch64/tcg-target.h

-- 
1.8.1

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 1/3] configure: permit compilation on arm aarch64
  2013-05-13 13:22       ` [Qemu-devel] [PATCH 0/3] ARM aarch64 TCG target Claudio Fontana
@ 2013-05-13 13:28         ` Claudio Fontana
  2013-05-13 18:29           ` Peter Maydell
  2013-05-13 13:31         ` [Qemu-devel] [PATCH 2/3] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
  2013-05-13 13:33         ` [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
  2 siblings, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-13 13:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Peter Maydell


support compiling on aarch64.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 configure | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/configure b/configure
index 9439f1c..9cc398c 100755
--- a/configure
+++ b/configure
@@ -384,6 +384,8 @@ elif check_define __s390__ ; then
   fi
 elif check_define __arm__ ; then
   cpu="arm"
+elif check_define __aarch64__ ; then
+  cpu="aarch64"
 elif check_define __hppa__ ; then
   cpu="hppa"
 else
@@ -406,6 +408,9 @@ case "$cpu" in
   armv*b|armv*l|arm)
     cpu="arm"
   ;;
+  aarch64)
+    cpu="aarch64"
+  ;;
   hppa|parisc|parisc64)
     cpu="hppa"
   ;;
@@ -4114,6 +4119,9 @@ if test "$linux" = "yes" ; then
   s390x)
     linux_arch=s390
     ;;
+  aarch64)
+    linux_arch=arm64
+    ;;
   *)
     # For most CPUs the kernel architecture name and QEMU CPU name match.
     linux_arch="$cpu"
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 2/3] include/elf.h: add aarch64 ELF machine and relocs
  2013-05-13 13:22       ` [Qemu-devel] [PATCH 0/3] ARM aarch64 TCG target Claudio Fontana
  2013-05-13 13:28         ` [Qemu-devel] [PATCH 1/3] configure: permit compilation on arm aarch64 Claudio Fontana
@ 2013-05-13 13:31         ` Claudio Fontana
  2013-05-13 18:34           ` Peter Maydell
  2013-05-13 13:33         ` [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
  2 siblings, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-13 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Peter Maydell


we will use the 26bit relative relocations in the aarch64 tcg target.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 include/elf.h | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/include/elf.h b/include/elf.h
index a21ea53..43f6c9b 100644
--- a/include/elf.h
+++ b/include/elf.h
@@ -129,6 +129,8 @@ typedef int64_t  Elf64_Sxword;
 
 #define EM_XTENSA   94      /* Tensilica Xtensa */
 
+#define EM_AARCH64  183
+
 /* This is the info that is needed to parse the dynamic section of the file */
 #define DT_NULL		0
 #define DT_NEEDED	1
@@ -616,6 +618,132 @@ typedef struct {
 /* Keep this the last entry.  */
 #define R_ARM_NUM		256
 
+#define R_AARCH64_NONE          256 /* also accept R_ARM_NONE (0) as null */
+/* static data relocations */
+#define R_AARCH64_ABS64         257
+#define R_AARCH64_ABS32         258
+#define R_AARCH64_ABS16         259
+#define R_AARCH64_PREL64        260
+#define R_AARCH64_PREL32        261
+#define R_AARCH64_PREL16        262
+/* static aarch64 group relocations */
+/* group relocs to create unsigned data value or address inline */
+#define R_AARCH64_MOVW_UABS_G0  263
+#define R_AARCH64_MOVW_UABS_G0_NC 264
+#define R_AARCH64_MOVW_UABS_G1  265
+#define R_AARCH64_MOVW_UABS_G1_NC 266
+#define R_AARCH64_MOVW_UABS_G2  267
+#define R_AARCH64_MOVW_UABS_G2_NC 268
+#define R_AARCH64_MOVW_UABS_G3  269
+/* group relocs to create signed data or offset value inline */
+#define R_AARCH64_MOVW_SABS_G0  270
+#define R_AARCH64_MOVW_SABS_G1  271
+#define R_AARCH64_MOVW_SABS_G2  272
+/* relocs to generate 19, 21, and 33 bit PC-relative addresses */
+#define R_AARCH64_LD_PREL_LO19 273
+#define R_AARCH64_ADR_PREL_LO21 274
+#define R_AARCH64_ADR_PREL_PG_HI21 275
+#define R_AARCH64_ADR_PREL_PG_HI21_NC 276
+#define R_AARCH64_ADD_ABS_LO12_NC 277
+#define R_AARCH64_LDST8_ABS_LO12_NC 278
+#define R_AARCH64_LDST16_ABS_LO12_NC 284
+#define R_AARCH64_LDST32_ABS_LO12_NC 285
+#define R_AARCH64_LDST64_ABS_LO12_NC 286
+#define R_AARCH64_LDST128_ABS_LO12_NC 299
+/* relocs for control-flow - all offsets as multiple of 4 */
+#define R_AARCH64_TSTBR14 279
+#define R_AARCH64_CONDBR19 280
+#define R_AARCH64_JUMP26 282
+#define R_AARCH64_CALL26 283
+/* group relocs to create pc-relative offset inline */
+#define R_AARCH64_MOVW_PREL_G0 287
+#define R_AARCH64_MOVW_PREL_G0_NC 288
+#define R_AARCH64_MOVW_PREL_G1 289
+#define R_AARCH64_MOVW_PREL_G1_NC 290
+#define R_AARCH64_MOVW_PREL_G2 291
+#define R_AARCH64_MOVW_PREL_G2_NC 292
+#define R_AARCH64_MOVW_PREL_G3 293
+/* group relocs to create a GOT-relative offset inline */
+#define R_AARCH64_MOVW_GOTOFF_G0 300
+#define R_AARCH64_MOVW_GOTOFF_G0_NC 301
+#define R_AARCH64_MOVW_GOTOFF_G1 302
+#define R_AARCH64_MOVW_GOTOFF_G1_NC 303
+#define R_AARCH64_MOVW_GOTOFF_G2 304
+#define R_AARCH64_MOVW_GOTOFF_G2_NC 305
+#define R_AARCH64_MOVW_GOTOFF_G3 306
+/* GOT-relative data relocs */
+#define R_AARCH64_GOTREL64 307
+#define R_AARCH64_GOTREL32 308
+/* GOT-relative instr relocs */
+#define R_AARCH64_GOT_LD_PREL19 309
+#define R_AARCH64_LD64_GOTOFF_LO15 310
+#define R_AARCH64_ADR_GOT_PAGE 311
+#define R_AARCH64_LD64_GOT_LO12_NC 312
+#define R_AARCH64_LD64_GOTPAGE_LO15 313
+/* General Dynamic TLS relocations */
+#define R_AARCH64_TLSGD_ADR_PREL21 512
+#define R_AARCH64_TLSGD_ADR_PAGE21 513
+#define R_AARCH64_TLSGD_ADD_LO12_NC 514
+#define R_AARCH64_TLSGD_MOVW_G1 515
+#define R_AARCH64_TLSGD_MOVW_G0_NC 516
+/* Local Dynamic TLS relocations */
+#define R_AARCH64_TLSLD_ADR_PREL21 517
+#define R_AARCH64_TLSLD_ADR_PAGE21 518
+#define R_AARCH64_TLSLD_ADD_LO12_NC 519
+#define R_AARCH64_TLSLD_MOVW_G1 520
+#define R_AARCH64_TLSLD_MOVW_G0_NC 521
+#define R_AARCH64_TLSLD_LD_PREL19 522
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G2 523
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G1 524
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G1_NC 525
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G0 526
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G0_NC 527
+#define R_AARCH64_TLSLD_ADD_DTPREL_HI12 528
+#define R_AARCH64_TLSLD_ADD_DTPREL_LO12 529
+#define R_AARCH64_TLSLD_ADD_DTPREL_LO12_NC 530
+#define R_AARCH64_TLSLD_LDST8_DTPREL_LO12 531
+#define R_AARCH64_TLSLD_LDST8_DTPREL_LO12_NC 532
+#define R_AARCH64_TLSLD_LDST16_DTPREL_LO12 533
+#define R_AARCH64_TLSLD_LDST16_DTPREL_LO12_NC 534
+#define R_AARCH64_TLSLD_LDST32_DTPREL_LO12 535
+#define R_AARCH64_TLSLD_LDST32_DTPREL_LO12_NC 536
+#define R_AARCH64_TLSLD_LDST64_DTPREL_LO12 537
+#define R_AARCH64_TLSLD_LDST64_DTPREL_LO12_NC 538
+/* initial exec TLS relocations */
+#define R_AARCH64_TLSIE_MOVW_GOTTPREL_G1 539
+#define R_AARCH64_TLSIE_MOVW_GOTTPREL_G0_NC 540
+#define R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 541
+#define R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC 542
+#define R_AARCH64_TLSIE_LD_GOTTPREL_PREL19 543
+/* local exec TLS relocations */
+#define R_AARCH64_TLSLE_MOVW_TPREL_G2 544
+#define R_AARCH64_TLSLE_MOVW_TPREL_G1 545
+#define R_AARCH64_TLSLE_MOVW_TPREL_G1_NC 546
+#define R_AARCH64_TLSLE_MOVW_TPREL_G0 547
+#define R_AARCH64_TLSLE_MOVW_TPREL_G0_NC 548
+#define R_AARCH64_TLSLE_ADD_TPREL_HI12 549
+#define R_AARCH64_TLSLE_ADD_TPREL_LO12 550
+#define R_AARCH64_TLSLE_ADD_TPREL_LO12_NC 551
+#define R_AARCH64_TLSLE_LDST8_TPREL_LO12 552
+#define R_AARCH64_TLSLE_LDST8_TPREL_LO12_NC 553
+#define R_AARCH64_TLSLE_LDST16_TPREL_LO12 554
+#define R_AARCH64_TLSLE_LDST16_TPREL_LO12_NC 555
+#define R_AARCH64_TLSLE_LDST32_TPREL_LO12 556
+#define R_AARCH64_TLSLE_LDST32_TPREL_LO12_NC 557
+#define R_AARCH64_TLSLE_LDST64_TPREL_LO12 558
+#define R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC 559
+/* Dynamic Relocations */
+#define R_AARCH64_COPY 1024
+#define R_AARCH64_GLOB_DAT 1025
+#define R_AARCH64_JUMP_SLOT 1026
+#define R_AARCH64_RELATIVE 1027
+#define R_AARCH64_TLS_DTPREL64 1028
+#define R_AARCH64_TLS_DTPMOD64 1029
+#define R_AARCH64_TLS_TPREL64 1030
+#define R_AARCH64_TLS_DTPREL32 1031
+#define R_AARCH64_TLS_DTPMOD32 1032
+#define R_AARCH64_TLS_TPREL32 1033
+
 /* s390 relocations defined by the ABIs */
 #define R_390_NONE		0	/* No reloc.  */
 #define R_390_8			1	/* Direct 8 bit.  */
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-13 13:22       ` [Qemu-devel] [PATCH 0/3] ARM aarch64 TCG target Claudio Fontana
  2013-05-13 13:28         ` [Qemu-devel] [PATCH 1/3] configure: permit compilation on arm aarch64 Claudio Fontana
  2013-05-13 13:31         ` [Qemu-devel] [PATCH 2/3] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
@ 2013-05-13 13:33         ` Claudio Fontana
  2013-05-13 18:28           ` Peter Maydell
  2013-05-13 19:49           ` Richard Henderson
  2 siblings, 2 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-13 13:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Peter Maydell


add preliminary support for TCG target aarch64.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 include/exec/exec-all.h  |    5 +-
 tcg/aarch64/tcg-target.c | 1084 ++++++++++++++++++++++++++++++++++++++++++++++
 tcg/aarch64/tcg-target.h |  106 +++++
 3 files changed, 1194 insertions(+), 1 deletion(-)
 create mode 100644 tcg/aarch64/tcg-target.c
 create mode 100644 tcg/aarch64/tcg-target.h

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 6362074..5c31863 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -128,7 +128,7 @@ static inline void tlb_flush(CPUArchState *env, int flush_global)
 
 #if defined(__arm__) || defined(_ARCH_PPC) \
     || defined(__x86_64__) || defined(__i386__) \
-    || defined(__sparc__) \
+    || defined(__sparc__) || defined(__aarch64__) \
     || defined(CONFIG_TCG_INTERPRETER)
 #define USE_DIRECT_JUMP
 #endif
@@ -230,6 +230,9 @@ static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
     *(uint32_t *)jmp_addr = addr - (jmp_addr + 4);
     /* no need to flush icache explicitly */
 }
+#elif defined(__aarch64__)
+void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
+#define tb_set_jmp_target1 aarch64_tb_set_jmp_target
 #elif defined(__arm__)
 static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
 {
diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
new file mode 100644
index 0000000..f24a567
--- /dev/null
+++ b/tcg/aarch64/tcg-target.c
@@ -0,0 +1,1084 @@
+/*
+ * Initial TCG Implementation for aarch64
+ *
+ * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
+ * Written by Claudio Fontana
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.
+ *
+ * See the COPYING file in the top-level directory for details.
+ */
+
+#ifdef TARGET_WORDS_BIGENDIAN
+#error "Sorry, bigendian target not supported yet."
+#endif /* TARGET_WORDS_BIGENDIAN */
+
+#ifndef NDEBUG
+static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
+    "%x0", "%x1", "%x2", "%x3", "%x4", "%x5", "%x6", "%x7",
+    "%x8", "%x9", "%x10", "%x11", "%x12", "%x13", "%x14", "%x15",
+    "%x16", "%x17", "%x18", "%x19", "%x20", "%x21", "%x22", "%x23",
+    "%x24", "%x25", "%x26", "%x27", "%x28",
+    "%fp", /* frame pointer */
+    "%lr", /* link register */
+    "%sp",  /* stack pointer */
+};
+#endif /* NDEBUG */
+
+static const int tcg_target_reg_alloc_order[] = {
+    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23,
+    TCG_REG_X24, TCG_REG_X25, TCG_REG_X26, TCG_REG_X27,
+    TCG_REG_X28,
+
+    TCG_REG_X9, TCG_REG_X10, TCG_REG_X11, TCG_REG_X12,
+    TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
+
+    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
+    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
+};
+
+static const int tcg_target_call_iarg_regs[8] = {
+    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
+    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7
+};
+static const int tcg_target_call_oarg_regs[1] = {
+    TCG_REG_X0
+};
+
+static inline void reloc_pc26(void *code_ptr, tcg_target_long target)
+{
+    tcg_target_long offset;
+    offset = (target - (tcg_target_long)code_ptr) / 4;
+    offset &= 0x03ffffff;
+
+    /* mask away previous PC_REL26 parameter contents, then set offset */
+    *(uint32_t *)code_ptr &= 0xfc000000;
+    *(uint32_t *)code_ptr |= offset;
+}
+
+static inline void patch_reloc(uint8_t *code_ptr, int type,
+                               tcg_target_long value, tcg_target_long addend)
+{
+    switch (type) {
+    case R_AARCH64_JUMP26:
+    case R_AARCH64_CALL26:
+        reloc_pc26(code_ptr, value);
+        break;
+    default:
+        tcg_abort();
+    }
+}
+
+/* parse target specific constraints */
+static int target_parse_constraint(TCGArgConstraint *ct,
+                                   const char **pct_str)
+{
+    const char *ct_str; ct_str = *pct_str;
+
+    switch (ct_str[0]) {
+    case 'r':
+        ct->ct |= TCG_CT_REG;
+        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
+        break;
+    case 'l': /* qemu_ld / qemu_st address, data_reg */
+        ct->ct |= TCG_CT_REG;
+        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
+#ifdef CONFIG_SOFTMMU
+        /* x0 and x1 will be overwritten when reading the tlb entry,
+           and x2, and x3 for helper args, better to avoid using them. */
+        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X0);
+        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X1);
+        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X2);
+        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X3);
+#endif
+        break;
+    default:
+        return -1;
+    }
+
+    ct_str++;
+    *pct_str = ct_str;
+    return 0;
+}
+
+static inline int tcg_target_const_match(tcg_target_long val,
+                                         const TCGArgConstraint *arg_ct)
+{
+    int ct; ct = arg_ct->ct;
+
+    if (ct & TCG_CT_CONST)
+        return 1;
+
+    return 0;
+}
+
+enum aarch64_cond_code {
+    COND_EQ = 0x0,
+    COND_NE = 0x1,
+    COND_CS = 0x2,	/* Unsigned greater or equal */
+    COND_HS = 0x2,      /* ALIAS greater or equal */
+    COND_CC = 0x3,	/* Unsigned less than */
+    COND_LO = 0x3,	/* ALIAS Lower */
+    COND_MI = 0x4,	/* Negative */
+    COND_PL = 0x5,	/* Zero or greater */
+    COND_VS = 0x6,	/* Overflow */
+    COND_VC = 0x7,	/* No overflow */
+    COND_HI = 0x8,	/* Unsigned greater than */
+    COND_LS = 0x9,	/* Unsigned less or equal */
+    COND_GE = 0xa,
+    COND_LT = 0xb,
+    COND_GT = 0xc,
+    COND_LE = 0xd,
+    COND_AL = 0xe,
+    COND_NV = 0xf,
+};
+
+static const enum aarch64_cond_code tcg_cond_to_aarch64_cond[] = {
+    [TCG_COND_EQ] = COND_EQ,
+    [TCG_COND_NE] = COND_NE,
+    [TCG_COND_LT] = COND_LT,
+    [TCG_COND_GE] = COND_GE,
+    [TCG_COND_LE] = COND_LE,
+    [TCG_COND_GT] = COND_GT,
+    /* unsigned */
+    [TCG_COND_LTU] = COND_LO,
+    [TCG_COND_GTU] = COND_HI,
+    [TCG_COND_GEU] = COND_HS,
+    [TCG_COND_LEU] = COND_LS,
+};
+
+/* opcodes for LDR / STR instructions with base + simm9 addressing */
+enum aarch64_ldst_op_data { /* size of the data moved */
+    LDST_8 = 0x38,
+    LDST_16 = 0x78,
+    LDST_32 = 0xb8,
+    LDST_64 = 0xf8,
+};
+enum aarch64_ldst_op_type { /* type of operation */
+    LDST_ST = 0x0,    /* store */
+    LDST_LD = 0x4,    /* load */
+    LDST_LD_S_X = 0x8,  /* load and sign-extend into Xt */
+    LDST_LD_S_W = 0xc,  /* load and sign-extend into Wt */
+};
+
+enum aarch64_arith_opc {
+    ARITH_ADD = 0x0b,
+    ARITH_SUB = 0x4b,
+    ARITH_AND = 0x0a,
+    ARITH_OR = 0x2a,
+    ARITH_XOR = 0x4a
+};
+
+enum aarch64_srr_opc {
+    SRR_SHL = 0x0,
+    SRR_SHR = 0x4,
+    SRR_SAR = 0x8,
+    SRR_ROR = 0xc
+};
+
+static inline enum aarch64_ldst_op_data
+aarch64_ldst_get_data(TCGOpcode tcg_op)
+{
+    switch (tcg_op) {
+    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
+    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
+    case INDEX_op_st8_i32: case INDEX_op_st8_i64:
+        return LDST_8;
+
+    case INDEX_op_ld16u_i32: case INDEX_op_ld16s_i32:
+    case INDEX_op_ld16u_i64: case INDEX_op_ld16s_i64:
+    case INDEX_op_st16_i32: case INDEX_op_st16_i64:
+        return LDST_16;
+
+    case INDEX_op_ld_i32: case INDEX_op_st_i32:
+    case INDEX_op_ld32u_i64: case INDEX_op_ld32s_i64:
+    case INDEX_op_st32_i64:
+        return LDST_32;
+
+    case INDEX_op_ld_i64: case INDEX_op_st_i64:
+        return LDST_64;
+
+    default:
+        tcg_abort();
+    }
+}
+
+static inline enum aarch64_ldst_op_type
+aarch64_ldst_get_type(TCGOpcode tcg_op)
+{
+    switch (tcg_op) {
+    case INDEX_op_st8_i32: case INDEX_op_st16_i32:
+    case INDEX_op_st8_i64: case INDEX_op_st16_i64:
+    case INDEX_op_st_i32:
+    case INDEX_op_st32_i64:
+    case INDEX_op_st_i64:
+        return LDST_ST;
+
+    case INDEX_op_ld8u_i32: case INDEX_op_ld16u_i32:
+    case INDEX_op_ld8u_i64: case INDEX_op_ld16u_i64:
+    case INDEX_op_ld_i32:
+    case INDEX_op_ld32u_i64:
+    case INDEX_op_ld_i64:
+        return LDST_LD;
+
+    case INDEX_op_ld8s_i32: case INDEX_op_ld16s_i32:
+        return LDST_LD_S_W;
+
+    case INDEX_op_ld8s_i64: case INDEX_op_ld16s_i64:
+    case INDEX_op_ld32s_i64:
+        return LDST_LD_S_X;
+
+    default:
+        tcg_abort();
+    }
+}
+
+static inline uint32_t tcg_in32(TCGContext *s)
+{
+    uint32_t v; v = *(uint32_t *)s->code_ptr;
+    return v;
+}
+
+static inline void tcg_out_ldst_9(TCGContext *s,
+                                  enum aarch64_ldst_op_data op_data,
+                                  enum aarch64_ldst_op_type op_type,
+                                  int rd, int rn, tcg_target_long offset)
+{
+    /* use LDUR with BASE register with 9bit signed unscaled offset */
+    unsigned int mod, off;
+
+    if (offset < 0) {
+        off = (256 + offset);
+        mod = 0x1;
+
+    } else {
+        off = offset;
+        mod = 0x0;
+    }
+
+    mod |= op_type;
+    tcg_out32(s, op_data << 24 | mod << 20 | off << 12 | rn << 5 | rd);
+}
+
+static inline void tcg_out_movr(TCGContext *s, int ext, int rd, int source)
+{
+    /* register to register move using MOV (shifted register with no shift) */
+    /* using MOV 0x2a0003e0 | (shift).. */
+    unsigned int base; base = ext ? 0xaa0003e0 : 0x2a0003e0;
+    tcg_out32(s, base | source << 16 | rd);
+}
+
+static inline void tcg_out_movi32(TCGContext *s, int ext, int rd,
+                                  uint32_t value)
+{
+    uint32_t half, base, movk = 0;
+    if (!value) {
+        tcg_out_movr(s, ext, rd, TCG_REG_XZR);
+        return;
+    }
+    /* construct halfwords of the immediate with MOVZ with LSL */
+    /* using MOVZ 0x52800000 | extended reg.. */
+    base = ext ? 0xd2800000 : 0x52800000;
+
+    half = value & 0xffff;
+    if (half) {
+        tcg_out32(s, base | half << 5 | rd);
+        movk = 0x20000000; /* morph next MOVZ into MOVK */
+    }
+
+    half = value >> 16;
+    if (half) { /* add shift 0x00200000. Op can be MOVZ or MOVK */
+        tcg_out32(s, base | movk | 0x00200000 | half << 5 | rd);
+    }
+}
+
+static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value)
+{
+    uint32_t half, base, movk = 0, shift = 0;
+    if (!value) {
+        tcg_out_movr(s, 1, rd, TCG_REG_XZR);
+        return;
+    }
+    /* construct halfwords of the immediate with MOVZ with LSL */
+    /* using MOVZ 0x52800000 | extended reg.. */
+    base = 0xd2800000;
+
+    while (value) {
+        half = value & 0xffff;
+        if (half) {
+            /* Op can be MOVZ or MOVK */
+            tcg_out32(s, base | movk | shift | half << 5 | rd);
+            if (!movk)
+                movk = 0x20000000; /* morph next MOVZs into MOVKs */
+        }
+        value >>= 16;
+        shift += 0x00200000;
+    }
+}
+
+static inline void tcg_out_ldst_r(TCGContext *s,
+                                  enum aarch64_ldst_op_data op_data,
+                                  enum aarch64_ldst_op_type op_type,
+                                  int rd, int base, int regoff)
+{
+    /* I can't explain the 0x6000, but objdump/gdb from linaro does that */
+    /* load from memory to register using base + 64bit register offset */
+    /* using f.e. STR Wt, [Xn, Xm] 0xb8600800|(regoff << 16)|(base << 5)|rd */
+    tcg_out32(s, 0x00206800
+              | op_data << 24 | op_type << 20 | regoff << 16 | base << 5 | rd);
+}
+
+/* solve the whole ldst problem */
+static inline void tcg_out_ldst(TCGContext *s, enum aarch64_ldst_op_data data,
+                                enum aarch64_ldst_op_type type,
+                                int rd, int rn, tcg_target_long offset)
+{
+    if (offset > -256 && offset < 256) {
+        tcg_out_ldst_9(s, data, type, rd, rn, offset);
+
+    } else {
+        tcg_out_movi64(s, TCG_REG_X8, offset);
+        tcg_out_ldst_r(s, data, type, rd, rn, TCG_REG_X8);
+    }
+}
+
+static inline void tcg_out_movi(TCGContext *s, TCGType type,
+                                TCGReg rd, tcg_target_long value)
+{
+    if (type == TCG_TYPE_I64)
+        tcg_out_movi64(s, rd, value);
+    else
+        tcg_out_movi32(s, 0, rd, value);
+}
+
+/* mov alias implemented with add immediate, useful to move to/from SP */
+static inline void tcg_out_movr_sp(TCGContext *s, int ext, int rd, int rn)
+{
+    /* using ADD 0x11000000 | (ext) | rn << 5 | rd */
+    unsigned int base; base = ext ? 0x91000000 : 0x11000000;
+    tcg_out32(s, base | rn << 5 | rd);
+}
+
+static inline void tcg_out_mov(TCGContext *s,
+                               TCGType type, TCGReg ret, TCGReg arg)
+{
+    if (ret != arg)
+        tcg_out_movr(s, type == TCG_TYPE_I64, ret, arg);
+}
+
+static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg arg,
+                              TCGReg arg1, tcg_target_long arg2)
+{
+    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_LD,
+                 arg, arg1, arg2);
+}
+
+static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
+                              TCGReg arg1, tcg_target_long arg2)
+{
+    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_ST,
+                 arg, arg1, arg2);
+}
+
+static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
+                                 int ext, int rd, int rn, int rm)
+{
+    /* Using shifted register arithmetic operations */
+    /* if extended registry operation (64bit) just or with 0x80 << 24 */
+    unsigned int base; base = ext ? (0x80 | opc) << 24 : opc << 24;
+    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
+}
+
+static inline void tcg_out_mul(TCGContext *s, int ext, int rd, int rn, int rm)
+{
+    /* Using MADD 0x1b000000 with Ra = wzr alias MUL 0x1b007c00 */
+    unsigned int base; base = ext ? 0x9b007c00 : 0x1b007c00;
+    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
+}
+
+static inline void tcg_out_shiftrot_reg(TCGContext *s,
+                                        enum aarch64_srr_opc opc, int ext,
+                                        int rd, int rn, int rm)
+{
+    /* using 2-source data processing instructions 0x1ac02000 */
+    unsigned int base; base = ext ? 0x9ac02000 : 0x1ac02000;
+    tcg_out32(s, base | rm << 16 | opc << 8 | rn << 5 | rd);
+}
+
+static inline void tcg_out_ubfm(TCGContext *s, int ext,
+                                int rd, int rn, unsigned int a, unsigned int b)
+{
+    /* Using UBFM 0x53000000 Wd, Wn, a, b - Why ext has 4? */
+    unsigned int base; base = ext ? 0xd3400000 : 0x53000000;
+    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
+}
+
+static inline void tcg_out_sbfm(TCGContext *s, int ext,
+                                int rd, int rn, unsigned int a, unsigned int b)
+{
+    /* Using SBFM 0x13000000 Wd, Wn, a, b - Why ext has 4? */
+    unsigned int base; base = ext ? 0x93400000 : 0x13000000;
+    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
+}
+
+static inline void tcg_out_extr(TCGContext *s, int ext,
+                                int rd, int rn, int rm, unsigned int a)
+{
+    /* Using EXTR 0x13800000 Wd, Wn, Wm, a - Why ext has 4? */
+    unsigned int base; base = ext ? 0x93c00000 : 0x13800000;
+    tcg_out32(s, base | rm << 16 | a << 10 | rn << 5 | rd);
+}
+
+static inline void tcg_out_shl(TCGContext *s, int ext,
+                               int rd, int rn, unsigned int m)
+{
+    int bits, max;
+    bits = ext ? 64 : 32; max = bits - 1;
+    tcg_out_ubfm(s, ext, rd, rn, bits - (m & max), max - (m & max));
+}
+
+static inline void tcg_out_shr(TCGContext *s, int ext,
+                               int rd, int rn, unsigned int m)
+{
+    int max; max = ext ? 63 : 31;
+    tcg_out_ubfm(s, ext, rd, rn, m & max, max);
+}
+
+static inline void tcg_out_sar(TCGContext *s, int ext,
+                               int rd, int rn, unsigned int m)
+{
+    int max; max = ext ? 63 : 31;
+    tcg_out_sbfm(s, ext, rd, rn, m & max, max);
+}
+
+static inline void tcg_out_rotr(TCGContext *s, int ext,
+                                int rd, int rn, unsigned int m)
+{
+    int max; max = ext ? 63 : 31;
+    tcg_out_extr(s, ext, rd, rn, rn, m & max);
+}
+
+static inline void tcg_out_rotl(TCGContext *s, int ext,
+                                int rd, int rn, unsigned int m)
+{
+    int bits, max;
+    bits = ext ? 64 : 32; max = bits - 1;
+    tcg_out_extr(s, ext, rd, rn, rn, bits - (m & max));
+}
+
+static inline void tcg_out_cmp(TCGContext *s, int ext,
+                               int rn, int rm)
+{
+    /* Using CMP alias SUBS wzr, Wn, Wm */
+    unsigned int base; base = ext ? 0xeb00001f : 0x6b00001f;
+    tcg_out32(s, base | rm << 16 | rn << 5);
+}
+
+static inline void tcg_out_csel(TCGContext *s, int ext,
+                                int rd, int rn, int rm,
+                                enum aarch64_cond_code c)
+{
+    /* Using CSEL 0x1a800000 wd, wn, wm, c */
+    unsigned int base; base = ext ? 0x9a800000 : 0x1a800000;
+    tcg_out32(s, base | rm << 16 | c << 12 | rn << 5 | rd);
+}
+
+static inline void tcg_out_goto(TCGContext *s, tcg_target_long target)
+{
+    tcg_target_long offset;
+    offset = (target - (tcg_target_long)s->code_ptr) / 4;
+
+    if (offset <= -0x02000000 || offset >= 0x02000000) {
+        /* out of 26bit range */
+        tcg_abort();
+    }
+
+    tcg_out32(s, 0x14000000 | (offset & 0x03ffffff));
+}
+
+static inline void tcg_out_goto_noaddr(TCGContext *s)
+{
+    /* We pay attention here to not modify the branch target by
+       reading from the buffer. This ensure that caches and memory are
+       kept coherent during retranslation. */
+    uint32_t insn; insn = tcg_in32(s);
+    insn |= 0x14000000;
+    tcg_out32(s, insn);
+}
+
+/* offset is scaled and relative! Check range before calling! */
+static inline void tcg_out_goto_cond(TCGContext *s, TCGCond c,
+                                     tcg_target_long offset)
+{
+    tcg_out32(s, 0x54000000 | tcg_cond_to_aarch64_cond[c] | offset << 5);
+}
+
+static inline void tcg_out_callr(TCGContext *s, int reg)
+{
+    tcg_out32(s, 0xd63f0000 | reg << 5);
+}
+
+static inline void tcg_out_gotor(TCGContext *s, int reg)
+{
+    tcg_out32(s, 0xd61f0000 | reg << 5);
+}
+
+static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
+{
+    tcg_target_long offset;
+
+    offset = (target - (tcg_target_long)s->code_ptr) / 4;
+
+    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */
+        tcg_out_movi64(s, TCG_REG_X8, target);
+        tcg_out_callr(s, TCG_REG_X8);
+
+    } else {
+        tcg_out32(s, 0x94000000 | (offset & 0x03ffffff));
+    }
+}
+
+static inline void tcg_out_ret(TCGContext *s)
+{
+    /* emit RET { LR } */
+    tcg_out32(s, 0xd65f03c0);
+}
+
+void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
+{
+    tcg_target_long target, offset;
+    target = (tcg_target_long)addr;
+    offset = (target - (tcg_target_long)jmp_addr) / 4;
+
+    if (offset <= -0x02000000 || offset >= 0x02000000) {
+        /* out of 26bit range */
+        tcg_abort();
+    }
+
+    patch_reloc((uint8_t *)jmp_addr, R_AARCH64_JUMP26, target, 0);
+    flush_icache_range(jmp_addr, jmp_addr + 4);
+}
+
+static inline void tcg_out_goto_label(TCGContext *s, int label_index)
+{
+    TCGLabel *l = &s->labels[label_index];
+
+    if (!l->has_value) {
+        tcg_out_reloc(s, s->code_ptr, R_AARCH64_JUMP26, label_index, 0);
+        tcg_out_goto_noaddr(s);
+
+    } else {
+        tcg_out_goto(s, l->u.value);
+    }
+}
+
+static inline void tcg_out_goto_label_cond(TCGContext *s, TCGCond c, int label_index)
+{
+    tcg_target_long offset;
+    /* backward conditional jump never seems to happen in practice,
+       so just always use the branch trampoline */
+    c = tcg_invert_cond(c);
+    offset = 2; /* skip current instr and the next */
+    tcg_out_goto_cond(s, c, offset);
+    tcg_out_goto_label(s, label_index); /* emit 26bit jump */
+}
+
+#ifdef CONFIG_SOFTMMU
+#include "exec/softmmu_defs.h"
+
+/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
+   int mmu_idx) */
+static const void * const qemu_ld_helpers[4] = {
+    helper_ldb_mmu,
+    helper_ldw_mmu,
+    helper_ldl_mmu,
+    helper_ldq_mmu,
+};
+
+/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
+   uintxx_t val, int mmu_idx) */
+static const void * const qemu_st_helpers[4] = {
+    helper_stb_mmu,
+    helper_stw_mmu,
+    helper_stl_mmu,
+    helper_stq_mmu,
+};
+
+#endif /* CONFIG_SOFTMMU */
+
+static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
+{
+    int addr_reg, data_reg;
+#ifdef CONFIG_SOFTMMU
+    int mem_index, s_bits;
+#endif
+    data_reg = args[0];
+    addr_reg = args[1];
+
+#ifdef CONFIG_SOFTMMU
+    mem_index = args[2];
+    s_bits = opc & 3;
+
+    /* Should generate something like the following:
+     *  shr x8, addr_reg, #TARGET_PAGE_BITS
+     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
+     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
+     */
+#  if CPU_TLB_BITS > 8
+#   error "CPU_TLB_BITS too large"
+#  endif
+
+    /* all arguments passed via registers */
+    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
+    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
+    tcg_out_movi32(s, 0, TCG_REG_X2, mem_index);
+
+    tcg_out_movi64(s, TCG_REG_X8, (uint64_t)qemu_ld_helpers[s_bits]);
+    tcg_out_callr(s, TCG_REG_X8);
+
+    if (opc & 0x04) { /* sign extend */
+        unsigned int bits; bits = 8 * (1 << s_bits) - 1;
+        tcg_out_sbfm(s, 1, data_reg, TCG_REG_X0, 0, bits); /* 7|15|31 */
+
+    } else {
+        tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
+    }
+
+#else /* !CONFIG_SOFTMMU */
+    tcg_abort(); /* TODO */
+#endif
+}
+
+static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
+{
+    int addr_reg, data_reg;
+#ifdef CONFIG_SOFTMMU
+    int mem_index, s_bits;
+#endif
+    data_reg = args[0];
+    addr_reg = args[1];
+
+#ifdef CONFIG_SOFTMMU
+    mem_index = args[2];
+    s_bits = opc & 3;
+
+    /* Should generate something like the following:
+     *  shr x8, addr_reg, #TARGET_PAGE_BITS
+     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
+     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
+     */
+#  if CPU_TLB_BITS > 8
+#   error "CPU_TLB_BITS too large"
+#  endif
+
+    /* all arguments passed via registers */
+    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
+    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
+    tcg_out_movr(s, 1, TCG_REG_X2, data_reg);
+    tcg_out_movi32(s, 0, TCG_REG_X3, mem_index);
+
+    tcg_out_movi64(s, TCG_REG_X8, (uint64_t)qemu_st_helpers[s_bits]);
+    tcg_out_callr(s, TCG_REG_X8);
+
+#else /* !CONFIG_SOFTMMU */
+    tcg_abort(); /* TODO */
+#endif
+}
+
+static uint8_t *tb_ret_addr;
+
+/* callee stack use example:
+   stp     x29, x30, [sp,#-32]!
+   mov     x29, sp
+   stp     x1, x2, [sp,#16]
+   ...
+   ldp     x1, x2, [sp,#16]
+   ldp     x29, x30, [sp],#32
+   ret
+*/
+
+/* push r1 and r2, and alloc stack space for a total of
+   alloc_n elements (1 element=16 bytes, must be between 1 and 31. */
+static inline void tcg_out_push_p(TCGContext *s,
+                                  TCGReg r1, TCGReg r2, int alloc_n)
+{
+    /* using indexed scaled simm7 STP 0x28800000 | (ext) | 0x01000000 (pre-idx)
+       | alloc_n * (-1) << 16 | r2 << 10 | sp(31) << 5 | r1 */
+    assert(alloc_n > 0 && alloc_n < 0x20);
+    alloc_n = (-alloc_n) & 0x3f;
+    tcg_out32(s, 0xa98003e0 | alloc_n << 16 | r2 << 10 | r1);
+}
+
+/* dealloc stack space for a total of alloc_n elements and pop r1, r2.  */
+static inline void tcg_out_pop_p(TCGContext *s,
+                                 TCGReg r1, TCGReg r2, int alloc_n)
+{
+    /* using indexed scaled simm7 LDP 0x28c00000 | (ext) | nothing (post-idx)
+       | alloc_n << 16 | r2 << 10 | sp(31) << 5 | r1 */
+    assert(alloc_n > 0 && alloc_n < 0x20);
+    tcg_out32(s, 0xa8c003e0 | alloc_n << 16 | r2 << 10 | r1);
+}
+
+static inline void tcg_out_store_p(TCGContext *s,
+                                   TCGReg r1, TCGReg r2, int idx)
+{
+    /* using register pair offset simm7 STP 0x29000000 | (ext)
+       | idx << 16 | r2 << 10 | FP(29) << 5 | r1 */
+    assert(idx > 0 && idx < 0x20);
+    tcg_out32(s, 0xa90003a0 | idx << 16 | r2 << 10 | r1);
+}
+
+static inline void tcg_out_load_p(TCGContext *s, TCGReg r1, TCGReg r2, int idx)
+{
+    /* using register pair offset simm7 LDP 0x29400000 | (ext)
+       | idx << 16 | r2 << 10 | FP(29) << 5 | r1 */
+    assert(idx > 0 && idx < 0x20);
+    tcg_out32(s, 0xa94003a0 | idx << 16 | r2 << 10 | r1);
+}
+
+static void tcg_out_op(TCGContext *s, TCGOpcode opc,
+                       const TCGArg *args, const int *const_args)
+{
+    int ext = 0;
+
+    switch (opc) {
+    case INDEX_op_exit_tb:
+        tcg_out_movi64(s, TCG_REG_X0, args[0]); /* load retval in X0 */
+        tcg_out_goto(s, (tcg_target_long)tb_ret_addr);
+        break;
+
+    case INDEX_op_goto_tb:
+#ifndef USE_DIRECT_JUMP
+#error "USE_DIRECT_JUMP required for aarch64"
+#endif
+        assert(s->tb_jmp_offset != NULL); /* consistency for USE_DIRECT_JUMP */
+        s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
+        /* actual branch destination will be patched by
+           aarch64_tb_set_jmp_target later, beware retranslation. */
+        tcg_out_goto_noaddr(s);
+        s->tb_next_offset[args[0]] = s->code_ptr - s->code_buf;
+        break;
+
+    case INDEX_op_call:
+        if (const_args[0])
+            tcg_out_call(s, args[0]);
+        else
+            tcg_out_callr(s, args[0]);
+        break;
+
+    case INDEX_op_br:
+        tcg_out_goto_label(s, args[0]);
+        break;
+
+    case INDEX_op_ld_i32:
+    case INDEX_op_ld_i64:
+    case INDEX_op_st_i32:
+    case INDEX_op_st_i64:
+    case INDEX_op_ld8u_i32:
+    case INDEX_op_ld8s_i32:
+    case INDEX_op_ld16u_i32:
+    case INDEX_op_ld16s_i32:
+    case INDEX_op_ld8u_i64:
+    case INDEX_op_ld8s_i64:
+    case INDEX_op_ld16u_i64:
+    case INDEX_op_ld16s_i64:
+    case INDEX_op_ld32u_i64:
+    case INDEX_op_ld32s_i64:
+    case INDEX_op_st8_i32:
+    case INDEX_op_st8_i64:
+    case INDEX_op_st16_i32:
+    case INDEX_op_st16_i64:
+    case INDEX_op_st32_i64:
+        tcg_out_ldst(s, aarch64_ldst_get_data(opc), aarch64_ldst_get_type(opc),
+                     args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_mov_i64: ext = 1;
+    case INDEX_op_mov_i32:
+        tcg_out_movr(s, ext, args[0], args[1]);
+        break;
+
+    case INDEX_op_movi_i64:
+        tcg_out_movi64(s, args[0], args[1]);
+        break;
+
+    case INDEX_op_movi_i32:
+        tcg_out_movi32(s, 0, args[0], args[1]);
+        break;
+
+    case INDEX_op_add_i64: ext = 1;
+    case INDEX_op_add_i32:
+        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_sub_i64: ext = 1;
+    case INDEX_op_sub_i32:
+        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_and_i64: ext = 1;
+    case INDEX_op_and_i32:
+        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_or_i64: ext = 1;
+    case INDEX_op_or_i32:
+        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_xor_i64: ext = 1;
+    case INDEX_op_xor_i32:
+        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_mul_i64: ext = 1;
+    case INDEX_op_mul_i32:
+        tcg_out_mul(s, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_shl_i64: ext = 1;
+    case INDEX_op_shl_i32:
+        if (const_args[2])      /* LSL / UBFM Wd, Wn, (32 - m) */
+            tcg_out_shl(s, ext, args[0], args[1], args[2]);
+        else                    /* LSL / LSLV */
+            tcg_out_shiftrot_reg(s, SRR_SHL, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_shr_i64: ext = 1;
+    case INDEX_op_shr_i32:
+        if (const_args[2])      /* LSR / UBFM Wd, Wn, m, 31 */
+            tcg_out_shr(s, ext, args[0], args[1], args[2]);
+        else                    /* LSR / LSRV */
+            tcg_out_shiftrot_reg(s, SRR_SHR, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_sar_i64: ext = 1;
+    case INDEX_op_sar_i32:
+        if (const_args[2])      /* ASR / SBFM Wd, Wn, m, 31 */
+            tcg_out_sar(s, ext, args[0], args[1], args[2]);
+        else                    /* ASR / ASRV */
+            tcg_out_shiftrot_reg(s, SRR_SAR, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_rotr_i64: ext = 1;
+    case INDEX_op_rotr_i32:
+        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, m */
+            tcg_out_rotr(s, ext, args[0], args[1], args[2]); /* XXX UNTESTED */
+        else                    /* ROR / RORV */
+            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_rotl_i64: ext = 1;
+    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
+        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, 32 - m */
+            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
+        else { /* no RSB in aarch64 unfortunately. */
+            /* XXX UNTESTED */
+            tcg_out_movi32(s, ext, TCG_REG_X8, ext ? 64 : 32);
+            tcg_out_arith(s, ARITH_SUB, ext, TCG_REG_X8, TCG_REG_X8, args[2]);
+            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], TCG_REG_X8);
+        }
+        break;
+
+    case INDEX_op_brcond_i64: ext = 1;
+    case INDEX_op_brcond_i32: /* CMP 0, 1, cond(2), label 3 */
+        tcg_out_cmp(s, ext, args[0], args[1]);
+        tcg_out_goto_label_cond(s, args[2], args[3]);
+        break;
+
+    case INDEX_op_setcond_i64: ext = 1;
+    case INDEX_op_setcond_i32:
+        tcg_out_movi32(s, ext, TCG_REG_X8, 0x01);
+        tcg_out_cmp(s, ext, args[1], args[2]);
+        tcg_out_csel(s, ext, args[0], TCG_REG_X8, TCG_REG_XZR,
+                     tcg_cond_to_aarch64_cond[args[3]]);
+        break;
+
+    case INDEX_op_qemu_ld8u:
+        tcg_out_qemu_ld(s, args, 0 | 0);
+        break;
+    case INDEX_op_qemu_ld8s:
+        tcg_out_qemu_ld(s, args, 4 | 0);
+        break;
+    case INDEX_op_qemu_ld16u:
+        tcg_out_qemu_ld(s, args, 0 | 1);
+        break;
+    case INDEX_op_qemu_ld16s:
+        tcg_out_qemu_ld(s, args, 4 | 1);
+        break;
+    case INDEX_op_qemu_ld32u:
+        tcg_out_qemu_ld(s, args, 0 | 2);
+        break;
+    case INDEX_op_qemu_ld32s:
+        tcg_out_qemu_ld(s, args, 4 | 2);
+        break;
+    case INDEX_op_qemu_ld32:
+        tcg_out_qemu_ld(s, args, 0 | 2);
+        break;
+    case INDEX_op_qemu_ld64:
+        tcg_out_qemu_ld(s, args, 0 | 3);
+        break;
+    case INDEX_op_qemu_st8:
+        tcg_out_qemu_st(s, args, 0);
+        break;
+    case INDEX_op_qemu_st16:
+        tcg_out_qemu_st(s, args, 1);
+        break;
+    case INDEX_op_qemu_st32:
+        tcg_out_qemu_st(s, args, 2);
+        break;
+    case INDEX_op_qemu_st64:
+        tcg_out_qemu_st(s, args, 3);
+        break;
+
+    default:
+        tcg_abort(); /* opcode not implemented */
+    }
+}
+
+static const TCGTargetOpDef aarch64_op_defs[] = {
+    { INDEX_op_exit_tb, { } },
+    { INDEX_op_goto_tb, { } },
+    { INDEX_op_call, { "ri" } },
+    { INDEX_op_br, { } },
+
+    { INDEX_op_mov_i32, { "r", "r" } },
+    { INDEX_op_mov_i64, { "r", "r" } },
+
+    { INDEX_op_movi_i32, { "r" } },
+    { INDEX_op_movi_i64, { "r" } },
+
+    { INDEX_op_ld8u_i32, { "r", "r" } },
+    { INDEX_op_ld8s_i32, { "r", "r" } },
+    { INDEX_op_ld16u_i32, { "r", "r" } },
+    { INDEX_op_ld16s_i32, { "r", "r" } },
+    { INDEX_op_ld_i32, { "r", "r" } },
+    { INDEX_op_ld8u_i64, { "r", "r" } },
+    { INDEX_op_ld8s_i64, { "r", "r" } },
+    { INDEX_op_ld16u_i64, { "r", "r" } },
+    { INDEX_op_ld16s_i64, { "r", "r" } },
+    { INDEX_op_ld32u_i64, { "r", "r" } },
+    { INDEX_op_ld32s_i64, { "r", "r" } },
+    { INDEX_op_ld_i64, { "r", "r" } },
+
+    { INDEX_op_st8_i32, { "r", "r" } },
+    { INDEX_op_st16_i32, { "r", "r" } },
+    { INDEX_op_st_i32, { "r", "r" } },
+    { INDEX_op_st8_i64, { "r", "r" } },
+    { INDEX_op_st16_i64, { "r", "r" } },
+    { INDEX_op_st32_i64, { "r", "r" } },
+    { INDEX_op_st_i64, { "r", "r" } },
+
+    { INDEX_op_add_i32, { "r", "r", "r" } },
+    { INDEX_op_add_i64, { "r", "r", "r" } },
+    { INDEX_op_sub_i32, { "r", "r", "r" } },
+    { INDEX_op_sub_i64, { "r", "r", "r" } },
+    { INDEX_op_mul_i32, { "r", "r", "r" } },
+    { INDEX_op_mul_i64, { "r", "r", "r" } },
+    { INDEX_op_and_i32, { "r", "r", "r" } },
+    { INDEX_op_and_i64, { "r", "r", "r" } },
+    { INDEX_op_or_i32, { "r", "r", "r" } },
+    { INDEX_op_or_i64, { "r", "r", "r" } },
+    { INDEX_op_xor_i32, { "r", "r", "r" } },
+    { INDEX_op_xor_i64, { "r", "r", "r" } },
+
+    { INDEX_op_shl_i32, { "r", "r", "ri" } },
+    { INDEX_op_shr_i32, { "r", "r", "ri" } },
+    { INDEX_op_sar_i32, { "r", "r", "ri" } },
+    { INDEX_op_rotl_i32, { "r", "r", "ri" } },
+    { INDEX_op_rotr_i32, { "r", "r", "ri" } },
+    { INDEX_op_shl_i64, { "r", "r", "ri" } },
+    { INDEX_op_shr_i64, { "r", "r", "ri" } },
+    { INDEX_op_sar_i64, { "r", "r", "ri" } },
+    { INDEX_op_rotl_i64, { "r", "r", "ri" } },
+    { INDEX_op_rotr_i64, { "r", "r", "ri" } },
+
+    { INDEX_op_brcond_i32, { "r", "r" } },
+    { INDEX_op_setcond_i32, { "r", "r", "r" } },
+    { INDEX_op_brcond_i64, { "r", "r" } },
+    { INDEX_op_setcond_i64, { "r", "r", "r" } },
+
+    { INDEX_op_qemu_ld8u, { "r", "l" } },
+    { INDEX_op_qemu_ld8s, { "r", "l" } },
+    { INDEX_op_qemu_ld16u, { "r", "l" } },
+    { INDEX_op_qemu_ld16s, { "r", "l" } },
+    { INDEX_op_qemu_ld32u, { "r", "l" } },
+    { INDEX_op_qemu_ld32s, { "r", "l" } },
+
+    { INDEX_op_qemu_ld32, { "r", "l" } },
+    { INDEX_op_qemu_ld64, { "r", "l" } },
+
+    { INDEX_op_qemu_st8, { "l", "l" } },
+    { INDEX_op_qemu_st16, { "l", "l" } },
+    { INDEX_op_qemu_st32, { "l", "l" } },
+    { INDEX_op_qemu_st64, { "l", "l" } },
+    { -1 },
+};
+
+static void tcg_target_init(TCGContext *s)
+{
+#if !defined(CONFIG_USER_ONLY)
+    /* fail safe */
+    if ((1ULL << CPU_TLB_ENTRY_BITS) != sizeof(CPUTLBEntry))
+        tcg_abort();
+#endif
+    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffff);
+    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffff);
+
+    tcg_regset_set32(tcg_target_call_clobber_regs, 0,
+                     (1 << TCG_REG_X0) | (1 << TCG_REG_X1) |
+                     (1 << TCG_REG_X2) | (1 << TCG_REG_X3) |
+                     (1 << TCG_REG_X4) | (1 << TCG_REG_X5) |
+                     (1 << TCG_REG_X6) | (1 << TCG_REG_X7) |
+                     (1 << TCG_REG_X8) | (1 << TCG_REG_X9) |
+                     (1 << TCG_REG_X10) | (1 << TCG_REG_X11) |
+                     (1 << TCG_REG_X12) | (1 << TCG_REG_X13) |
+                     (1 << TCG_REG_X14) | (1 << TCG_REG_X15) |
+                     (1 << TCG_REG_X16) | (1 << TCG_REG_X17) |
+                     (1 << TCG_REG_X18) | (1 << TCG_REG_LR));
+
+    tcg_regset_clear(s->reserved_regs);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_X8);
+
+    tcg_add_target_add_op_defs(aarch64_op_defs);
+    tcg_set_frame(s, TCG_AREG0, offsetof(CPUArchState, temp_buf),
+                  CPU_TEMP_BUF_NLONGS * sizeof(long));
+}
+
+static void tcg_target_qemu_prologue(TCGContext *s)
+{
+    int r;
+    int frame_size; /* number of 16 byte items */
+
+    /* we need to save (FP, LR) and X19 to X28 */
+    frame_size = (1) + (TCG_REG_X27 - TCG_REG_X19) / 2 + 1;
+
+    /* push (fp, lr) and update sp to final frame size */
+    tcg_out_push_p(s, TCG_REG_FP, TCG_REG_LR, frame_size);
+
+    /* FP -> frame chain */
+    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
+
+    /* store callee-preserved regs x19..x28 */
+    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
+        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
+        tcg_out_store_p(s, r, r + 1, idx);
+    }
+
+    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
+    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
+
+    tb_ret_addr = s->code_ptr;
+
+    /* restore registers x19..x28 */
+    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
+        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
+        tcg_out_load_p(s, r, r + 1, idx);
+    }
+
+    /* pop (fp, lr), restore sp to previous frame, return */
+    tcg_out_pop_p(s, TCG_REG_FP, TCG_REG_LR, frame_size);
+    tcg_out_ret(s);
+}
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
new file mode 100644
index 0000000..f28af09
--- /dev/null
+++ b/tcg/aarch64/tcg-target.h
@@ -0,0 +1,106 @@
+/*
+ * Initial TCG Implementation for aarch64
+ *
+ * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
+ * Written by Claudio Fontana
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.
+ *
+ * See the COPYING file in the top-level directory for details.
+ */
+
+#ifndef TCG_TARGET_AARCH64
+#define TCG_TARGET_AARCH64 1
+
+#undef TCG_TARGET_WORDS_BIGENDIAN
+#undef TCG_TARGET_STACK_GROWSUP
+
+typedef enum {
+    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3, TCG_REG_X4,
+    TCG_REG_X5, TCG_REG_X6, TCG_REG_X7, TCG_REG_X8, TCG_REG_X9,
+    TCG_REG_X10, TCG_REG_X11, TCG_REG_X12, TCG_REG_X13, TCG_REG_X14,
+    TCG_REG_X15, TCG_REG_X16, TCG_REG_X17, TCG_REG_X18, TCG_REG_X19,
+    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23, TCG_REG_X24,
+    TCG_REG_X25, TCG_REG_X26, TCG_REG_X27, TCG_REG_X28,
+    TCG_REG_FP,  /* frame pointer */
+    TCG_REG_LR, /* link register */
+    TCG_REG_SP,  /* stack pointer or zero register */
+    TCG_REG_XZR = TCG_REG_SP /* same register number */
+    /* program counter is not directly accessible! */
+} TCGReg;
+
+#define TCG_TARGET_NB_REGS 32
+#define TCG_CT_CONST_ARM 0x100
+
+/* used for function call generation */
+#define TCG_REG_CALL_STACK		TCG_REG_SP
+#define TCG_TARGET_STACK_ALIGN		16
+#define TCG_TARGET_CALL_ALIGN_ARGS      1
+#define TCG_TARGET_CALL_STACK_OFFSET	0
+
+/* optional instructions */
+#define TCG_TARGET_HAS_div_i32          0
+#define TCG_TARGET_HAS_ext8s_i32        0
+#define TCG_TARGET_HAS_ext16s_i32       0
+#define TCG_TARGET_HAS_ext8u_i32        0
+#define TCG_TARGET_HAS_ext16u_i32       0
+#define TCG_TARGET_HAS_bswap16_i32      0
+#define TCG_TARGET_HAS_bswap32_i32      0
+#define TCG_TARGET_HAS_not_i32          0
+#define TCG_TARGET_HAS_neg_i32          0
+#define TCG_TARGET_HAS_rot_i32          1
+#define TCG_TARGET_HAS_andc_i32         0
+#define TCG_TARGET_HAS_orc_i32          0
+#define TCG_TARGET_HAS_eqv_i32          0
+#define TCG_TARGET_HAS_nand_i32         0
+#define TCG_TARGET_HAS_nor_i32          0
+#define TCG_TARGET_HAS_deposit_i32      0
+#define TCG_TARGET_HAS_movcond_i32      0
+#define TCG_TARGET_HAS_add2_i32         0
+#define TCG_TARGET_HAS_sub2_i32         0
+#define TCG_TARGET_HAS_mulu2_i32        0
+#define TCG_TARGET_HAS_muls2_i32        0
+
+#define TCG_TARGET_HAS_div_i64          0
+#define TCG_TARGET_HAS_ext8s_i64        0
+#define TCG_TARGET_HAS_ext16s_i64       0
+#define TCG_TARGET_HAS_ext32s_i64       0
+#define TCG_TARGET_HAS_ext8u_i64        0
+#define TCG_TARGET_HAS_ext16u_i64       0
+#define TCG_TARGET_HAS_ext32u_i64       0
+#define TCG_TARGET_HAS_bswap16_i64      0
+#define TCG_TARGET_HAS_bswap32_i64      0
+#define TCG_TARGET_HAS_bswap64_i64      0
+#define TCG_TARGET_HAS_not_i64          0
+#define TCG_TARGET_HAS_neg_i64          0
+#define TCG_TARGET_HAS_rot_i64          1
+#define TCG_TARGET_HAS_andc_i64         0
+#define TCG_TARGET_HAS_orc_i64          0
+#define TCG_TARGET_HAS_eqv_i64          0
+#define TCG_TARGET_HAS_nand_i64         0
+#define TCG_TARGET_HAS_nor_i64          0
+#define TCG_TARGET_HAS_deposit_i64      0
+#define TCG_TARGET_HAS_movcond_i64      0
+#define TCG_TARGET_HAS_add2_i64         0
+#define TCG_TARGET_HAS_sub2_i64         0
+#define TCG_TARGET_HAS_mulu2_i64        0
+#define TCG_TARGET_HAS_muls2_i64        0
+
+enum {
+    TCG_AREG0 = TCG_REG_X19,
+};
+
+static inline void flush_icache_range(tcg_target_ulong start,
+                                      tcg_target_ulong stop)
+{
+#if QEMU_GNUC_PREREQ(4, 1)
+    __builtin___clear_cache((char *)start, (char *)stop);
+#else
+    /* XXX should provide alternative with IC <ic_op>, Xt */
+#error "need GNUC >= 4.1, alternative not implemented yet."
+#endif
+
+}
+
+#endif /* TCG_TARGET_AARCH64 */
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-13 13:33         ` [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
@ 2013-05-13 18:28           ` Peter Maydell
  2013-05-14 12:01             ` Claudio Fontana
  2013-05-13 19:49           ` Richard Henderson
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-13 18:28 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson

On 13 May 2013 14:33, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>
> add preliminary support for TCG target aarch64.

Thanks for this patch. Some comments below.

> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
> ---
>  include/exec/exec-all.h  |    5 +-
>  tcg/aarch64/tcg-target.c | 1084 ++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/aarch64/tcg-target.h |  106 +++++
>  3 files changed, 1194 insertions(+), 1 deletion(-)
>  create mode 100644 tcg/aarch64/tcg-target.c
>  create mode 100644 tcg/aarch64/tcg-target.h
>
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 6362074..5c31863 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -128,7 +128,7 @@ static inline void tlb_flush(CPUArchState *env, int flush_global)
>
>  #if defined(__arm__) || defined(_ARCH_PPC) \
>      || defined(__x86_64__) || defined(__i386__) \
> -    || defined(__sparc__) \
> +    || defined(__sparc__) || defined(__aarch64__) \
>      || defined(CONFIG_TCG_INTERPRETER)
>  #define USE_DIRECT_JUMP
>  #endif
> @@ -230,6 +230,9 @@ static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
>      *(uint32_t *)jmp_addr = addr - (jmp_addr + 4);
>      /* no need to flush icache explicitly */
>  }
> +#elif defined(__aarch64__)
> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
> +#define tb_set_jmp_target1 aarch64_tb_set_jmp_target
>  #elif defined(__arm__)
>  static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
>  {
> diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
> new file mode 100644
> index 0000000..f24a567
> --- /dev/null
> +++ b/tcg/aarch64/tcg-target.c
> @@ -0,0 +1,1084 @@
> +/*
> + * Initial TCG Implementation for aarch64
> + *
> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
> + * Written by Claudio Fontana
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * (at your option) any later version.
> + *
> + * See the COPYING file in the top-level directory for details.
> + */
> +
> +#ifdef TARGET_WORDS_BIGENDIAN
> +#error "Sorry, bigendian target not supported yet."
> +#endif /* TARGET_WORDS_BIGENDIAN */
> +
> +#ifndef NDEBUG
> +static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
> +    "%x0", "%x1", "%x2", "%x3", "%x4", "%x5", "%x6", "%x7",
> +    "%x8", "%x9", "%x10", "%x11", "%x12", "%x13", "%x14", "%x15",
> +    "%x16", "%x17", "%x18", "%x19", "%x20", "%x21", "%x22", "%x23",
> +    "%x24", "%x25", "%x26", "%x27", "%x28",
> +    "%fp", /* frame pointer */
> +    "%lr", /* link register */
> +    "%sp",  /* stack pointer */
> +};
> +#endif /* NDEBUG */
> +
> +static const int tcg_target_reg_alloc_order[] = {
> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23,
> +    TCG_REG_X24, TCG_REG_X25, TCG_REG_X26, TCG_REG_X27,
> +    TCG_REG_X28,
> +
> +    TCG_REG_X9, TCG_REG_X10, TCG_REG_X11, TCG_REG_X12,
> +    TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
> +
> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
> +    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,

This list seems to not have all the registers in it.
You can put the registers used for AREG0 and the temp
reg in here -- TCG will correctly not use them because
(a) AREG0 is allocated as a fixed register and (b)
the temp is put in the reserved-regs list in tcg_target_init.

It should be OK to use X16 and X17 as well, right?

> +};
> +
> +static const int tcg_target_call_iarg_regs[8] = {
> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
> +    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7
> +};
> +static const int tcg_target_call_oarg_regs[1] = {
> +    TCG_REG_X0
> +};

This would be a good place to say:
#define TCG_REG_TMP TCG_REG_X8

and then use that instead of hard-coding X8 in various places
(compare the tcg/arm code which has recently made that change)

> +
> +static inline void reloc_pc26(void *code_ptr, tcg_target_long target)
> +{
> +    tcg_target_long offset;
> +    offset = (target - (tcg_target_long)code_ptr) / 4;
> +    offset &= 0x03ffffff;
> +
> +    /* mask away previous PC_REL26 parameter contents, then set offset */
> +    *(uint32_t *)code_ptr &= 0xfc000000;
> +    *(uint32_t *)code_ptr |= offset;

It's important that this function doesn't ever write
an intermediate value to the code area. In particular if
the code area is already a valid and relocated instruction
then we must never (even temporarily) write something other
than the same byte values over it. So you need to read in
the full 32 bits, modify it in a local variable and then
write the 32 bits back again.

This is necessary because when dealing with exceptions QEMU
will retranslate a block of code in-place; if it ever writes
incorrect values to memory then it's possible for us to
end up executing the incorrect values (due to split icache
and dcache).

> +}
> +
> +static inline void patch_reloc(uint8_t *code_ptr, int type,
> +                               tcg_target_long value, tcg_target_long addend)
> +{
> +    switch (type) {
> +    case R_AARCH64_JUMP26:
> +    case R_AARCH64_CALL26:
> +        reloc_pc26(code_ptr, value);
> +        break;
> +    default:
> +        tcg_abort();
> +    }
> +}
> +
> +/* parse target specific constraints */
> +static int target_parse_constraint(TCGArgConstraint *ct,
> +                                   const char **pct_str)
> +{
> +    const char *ct_str; ct_str = *pct_str;
> +
> +    switch (ct_str[0]) {
> +    case 'r':
> +        ct->ct |= TCG_CT_REG;
> +        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
> +        break;
> +    case 'l': /* qemu_ld / qemu_st address, data_reg */
> +        ct->ct |= TCG_CT_REG;
> +        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
> +#ifdef CONFIG_SOFTMMU
> +        /* x0 and x1 will be overwritten when reading the tlb entry,
> +           and x2, and x3 for helper args, better to avoid using them. */
> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X0);
> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X1);
> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X2);
> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X3);
> +#endif
> +        break;
> +    default:
> +        return -1;
> +    }
> +
> +    ct_str++;
> +    *pct_str = ct_str;
> +    return 0;
> +}
> +
> +static inline int tcg_target_const_match(tcg_target_long val,
> +                                         const TCGArgConstraint *arg_ct)
> +{
> +    int ct; ct = arg_ct->ct;

Please don't put multiple statements on one line.
Either "int ct = arg_ct->ct;" or put the assignment
on a line of its own.

> +
> +    if (ct & TCG_CT_CONST)
> +        return 1;

QEMU coding style requires braces for all if() statements.
You can run scripts/checkpatch.pl on your patches, and it
will pick up most of these nits. (I won't bother mentioning
other style problems checkpatch remarks on below, but there
are 38 issues in total.)

> +
> +    return 0;
> +}
> +
> +enum aarch64_cond_code {
> +    COND_EQ = 0x0,
> +    COND_NE = 0x1,
> +    COND_CS = 0x2,     /* Unsigned greater or equal */
> +    COND_HS = 0x2,      /* ALIAS greater or equal */
> +    COND_CC = 0x3,     /* Unsigned less than */
> +    COND_LO = 0x3,     /* ALIAS Lower */
> +    COND_MI = 0x4,     /* Negative */
> +    COND_PL = 0x5,     /* Zero or greater */
> +    COND_VS = 0x6,     /* Overflow */
> +    COND_VC = 0x7,     /* No overflow */
> +    COND_HI = 0x8,     /* Unsigned greater than */
> +    COND_LS = 0x9,     /* Unsigned less or equal */
> +    COND_GE = 0xa,
> +    COND_LT = 0xb,
> +    COND_GT = 0xc,
> +    COND_LE = 0xd,
> +    COND_AL = 0xe,
> +    COND_NV = 0xf,

Probably worth a comment that 'NV' doesn't mean 'never' here:
it behaves exactly like 'AL'.

> +};
> +
> +static const enum aarch64_cond_code tcg_cond_to_aarch64_cond[] = {
> +    [TCG_COND_EQ] = COND_EQ,
> +    [TCG_COND_NE] = COND_NE,
> +    [TCG_COND_LT] = COND_LT,
> +    [TCG_COND_GE] = COND_GE,
> +    [TCG_COND_LE] = COND_LE,
> +    [TCG_COND_GT] = COND_GT,
> +    /* unsigned */
> +    [TCG_COND_LTU] = COND_LO,
> +    [TCG_COND_GTU] = COND_HI,
> +    [TCG_COND_GEU] = COND_HS,
> +    [TCG_COND_LEU] = COND_LS,
> +};
> +
> +/* opcodes for LDR / STR instructions with base + simm9 addressing */
> +enum aarch64_ldst_op_data { /* size of the data moved */
> +    LDST_8 = 0x38,
> +    LDST_16 = 0x78,
> +    LDST_32 = 0xb8,
> +    LDST_64 = 0xf8,
> +};
> +enum aarch64_ldst_op_type { /* type of operation */
> +    LDST_ST = 0x0,    /* store */
> +    LDST_LD = 0x4,    /* load */
> +    LDST_LD_S_X = 0x8,  /* load and sign-extend into Xt */
> +    LDST_LD_S_W = 0xc,  /* load and sign-extend into Wt */
> +};
> +
> +enum aarch64_arith_opc {
> +    ARITH_ADD = 0x0b,
> +    ARITH_SUB = 0x4b,
> +    ARITH_AND = 0x0a,
> +    ARITH_OR = 0x2a,
> +    ARITH_XOR = 0x4a
> +};
> +
> +enum aarch64_srr_opc {
> +    SRR_SHL = 0x0,
> +    SRR_SHR = 0x4,
> +    SRR_SAR = 0x8,
> +    SRR_ROR = 0xc
> +};
> +
> +static inline enum aarch64_ldst_op_data
> +aarch64_ldst_get_data(TCGOpcode tcg_op)
> +{
> +    switch (tcg_op) {
> +    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
> +    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
> +    case INDEX_op_st8_i32: case INDEX_op_st8_i64:
> +        return LDST_8;
> +
> +    case INDEX_op_ld16u_i32: case INDEX_op_ld16s_i32:
> +    case INDEX_op_ld16u_i64: case INDEX_op_ld16s_i64:
> +    case INDEX_op_st16_i32: case INDEX_op_st16_i64:
> +        return LDST_16;
> +
> +    case INDEX_op_ld_i32: case INDEX_op_st_i32:
> +    case INDEX_op_ld32u_i64: case INDEX_op_ld32s_i64:
> +    case INDEX_op_st32_i64:
> +        return LDST_32;
> +
> +    case INDEX_op_ld_i64: case INDEX_op_st_i64:
> +        return LDST_64;
> +
> +    default:
> +        tcg_abort();
> +    }
> +}
> +
> +static inline enum aarch64_ldst_op_type
> +aarch64_ldst_get_type(TCGOpcode tcg_op)
> +{
> +    switch (tcg_op) {
> +    case INDEX_op_st8_i32: case INDEX_op_st16_i32:
> +    case INDEX_op_st8_i64: case INDEX_op_st16_i64:
> +    case INDEX_op_st_i32:
> +    case INDEX_op_st32_i64:
> +    case INDEX_op_st_i64:
> +        return LDST_ST;
> +
> +    case INDEX_op_ld8u_i32: case INDEX_op_ld16u_i32:
> +    case INDEX_op_ld8u_i64: case INDEX_op_ld16u_i64:
> +    case INDEX_op_ld_i32:
> +    case INDEX_op_ld32u_i64:
> +    case INDEX_op_ld_i64:
> +        return LDST_LD;
> +
> +    case INDEX_op_ld8s_i32: case INDEX_op_ld16s_i32:
> +        return LDST_LD_S_W;
> +
> +    case INDEX_op_ld8s_i64: case INDEX_op_ld16s_i64:
> +    case INDEX_op_ld32s_i64:
> +        return LDST_LD_S_X;
> +
> +    default:
> +        tcg_abort();
> +    }
> +}
> +
> +static inline uint32_t tcg_in32(TCGContext *s)
> +{
> +    uint32_t v; v = *(uint32_t *)s->code_ptr;
> +    return v;
> +}
> +
> +static inline void tcg_out_ldst_9(TCGContext *s,
> +                                  enum aarch64_ldst_op_data op_data,
> +                                  enum aarch64_ldst_op_type op_type,
> +                                  int rd, int rn, tcg_target_long offset)
> +{
> +    /* use LDUR with BASE register with 9bit signed unscaled offset */
> +    unsigned int mod, off;
> +
> +    if (offset < 0) {
> +        off = (256 + offset);
> +        mod = 0x1;
> +
> +    } else {
> +        off = offset;
> +        mod = 0x0;
> +    }
> +
> +    mod |= op_type;
> +    tcg_out32(s, op_data << 24 | mod << 20 | off << 12 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_movr(TCGContext *s, int ext, int rd, int source)
> +{
> +    /* register to register move using MOV (shifted register with no shift) */
> +    /* using MOV 0x2a0003e0 | (shift).. */
> +    unsigned int base; base = ext ? 0xaa0003e0 : 0x2a0003e0;
> +    tcg_out32(s, base | source << 16 | rd);
> +}
> +
> +static inline void tcg_out_movi32(TCGContext *s, int ext, int rd,
> +                                  uint32_t value)
> +{
> +    uint32_t half, base, movk = 0;
> +    if (!value) {
> +        tcg_out_movr(s, ext, rd, TCG_REG_XZR);
> +        return;
> +    }
> +    /* construct halfwords of the immediate with MOVZ with LSL */
> +    /* using MOVZ 0x52800000 | extended reg.. */
> +    base = ext ? 0xd2800000 : 0x52800000;
> +
> +    half = value & 0xffff;
> +    if (half) {
> +        tcg_out32(s, base | half << 5 | rd);
> +        movk = 0x20000000; /* morph next MOVZ into MOVK */
> +    }
> +
> +    half = value >> 16;
> +    if (half) { /* add shift 0x00200000. Op can be MOVZ or MOVK */
> +        tcg_out32(s, base | movk | 0x00200000 | half << 5 | rd);
> +    }
> +}
> +
> +static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value)
> +{
> +    uint32_t half, base, movk = 0, shift = 0;
> +    if (!value) {
> +        tcg_out_movr(s, 1, rd, TCG_REG_XZR);
> +        return;
> +    }
> +    /* construct halfwords of the immediate with MOVZ with LSL */
> +    /* using MOVZ 0x52800000 | extended reg.. */
> +    base = 0xd2800000;
> +
> +    while (value) {
> +        half = value & 0xffff;
> +        if (half) {
> +            /* Op can be MOVZ or MOVK */
> +            tcg_out32(s, base | movk | shift | half << 5 | rd);
> +            if (!movk)
> +                movk = 0x20000000; /* morph next MOVZs into MOVKs */
> +        }
> +        value >>= 16;
> +        shift += 0x00200000;
> +    }

It should be possible to improve on this, but this will do for now.

> +}
> +
> +static inline void tcg_out_ldst_r(TCGContext *s,
> +                                  enum aarch64_ldst_op_data op_data,
> +                                  enum aarch64_ldst_op_type op_type,
> +                                  int rd, int base, int regoff)
> +{
> +    /* I can't explain the 0x6000, but objdump/gdb from linaro does that */

It is just the encoding for "no extend field" (or explicit "LSL");
check aarch64_ext_addr_regoff in the opcodes library (and NB that
AARCH64_MOD_UXTX is defined to be 6):
https://github.com/embecosm/sourceware/blob/master/opcodes/aarch64-dis.c

> +    /* load from memory to register using base + 64bit register offset */
> +    /* using f.e. STR Wt, [Xn, Xm] 0xb8600800|(regoff << 16)|(base << 5)|rd */
> +    tcg_out32(s, 0x00206800
> +              | op_data << 24 | op_type << 20 | regoff << 16 | base << 5 | rd);
> +}
> +
> +/* solve the whole ldst problem */
> +static inline void tcg_out_ldst(TCGContext *s, enum aarch64_ldst_op_data data,
> +                                enum aarch64_ldst_op_type type,
> +                                int rd, int rn, tcg_target_long offset)
> +{
> +    if (offset > -256 && offset < 256) {
> +        tcg_out_ldst_9(s, data, type, rd, rn, offset);
> +
> +    } else {
> +        tcg_out_movi64(s, TCG_REG_X8, offset);
> +        tcg_out_ldst_r(s, data, type, rd, rn, TCG_REG_X8);
> +    }
> +}
> +
> +static inline void tcg_out_movi(TCGContext *s, TCGType type,
> +                                TCGReg rd, tcg_target_long value)
> +{
> +    if (type == TCG_TYPE_I64)
> +        tcg_out_movi64(s, rd, value);
> +    else
> +        tcg_out_movi32(s, 0, rd, value);
> +}
> +
> +/* mov alias implemented with add immediate, useful to move to/from SP */
> +static inline void tcg_out_movr_sp(TCGContext *s, int ext, int rd, int rn)
> +{
> +    /* using ADD 0x11000000 | (ext) | rn << 5 | rd */
> +    unsigned int base; base = ext ? 0x91000000 : 0x11000000;
> +    tcg_out32(s, base | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_mov(TCGContext *s,
> +                               TCGType type, TCGReg ret, TCGReg arg)
> +{
> +    if (ret != arg)
> +        tcg_out_movr(s, type == TCG_TYPE_I64, ret, arg);
> +}
> +
> +static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg arg,
> +                              TCGReg arg1, tcg_target_long arg2)
> +{
> +    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_LD,
> +                 arg, arg1, arg2);
> +}
> +
> +static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> +                              TCGReg arg1, tcg_target_long arg2)
> +{
> +    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_ST,
> +                 arg, arg1, arg2);
> +}
> +
> +static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
> +                                 int ext, int rd, int rn, int rm)
> +{
> +    /* Using shifted register arithmetic operations */
> +    /* if extended registry operation (64bit) just or with 0x80 << 24 */
> +    unsigned int base; base = ext ? (0x80 | opc) << 24 : opc << 24;
> +    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_mul(TCGContext *s, int ext, int rd, int rn, int rm)
> +{
> +    /* Using MADD 0x1b000000 with Ra = wzr alias MUL 0x1b007c00 */
> +    unsigned int base; base = ext ? 0x9b007c00 : 0x1b007c00;
> +    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_shiftrot_reg(TCGContext *s,
> +                                        enum aarch64_srr_opc opc, int ext,
> +                                        int rd, int rn, int rm)
> +{
> +    /* using 2-source data processing instructions 0x1ac02000 */
> +    unsigned int base; base = ext ? 0x9ac02000 : 0x1ac02000;
> +    tcg_out32(s, base | rm << 16 | opc << 8 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_ubfm(TCGContext *s, int ext,
> +                                int rd, int rn, unsigned int a, unsigned int b)
> +{
> +    /* Using UBFM 0x53000000 Wd, Wn, a, b - Why ext has 4? */

It's required by the instruction encoding, that's all.

> +    unsigned int base; base = ext ? 0xd3400000 : 0x53000000;
> +    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_sbfm(TCGContext *s, int ext,
> +                                int rd, int rn, unsigned int a, unsigned int b)
> +{
> +    /* Using SBFM 0x13000000 Wd, Wn, a, b - Why ext has 4? */
> +    unsigned int base; base = ext ? 0x93400000 : 0x13000000;
> +    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_extr(TCGContext *s, int ext,
> +                                int rd, int rn, int rm, unsigned int a)
> +{
> +    /* Using EXTR 0x13800000 Wd, Wn, Wm, a - Why ext has 4? */
> +    unsigned int base; base = ext ? 0x93c00000 : 0x13800000;
> +    tcg_out32(s, base | rm << 16 | a << 10 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_shl(TCGContext *s, int ext,
> +                               int rd, int rn, unsigned int m)
> +{
> +    int bits, max;
> +    bits = ext ? 64 : 32; max = bits - 1;
> +    tcg_out_ubfm(s, ext, rd, rn, bits - (m & max), max - (m & max));
> +}
> +
> +static inline void tcg_out_shr(TCGContext *s, int ext,
> +                               int rd, int rn, unsigned int m)
> +{
> +    int max; max = ext ? 63 : 31;
> +    tcg_out_ubfm(s, ext, rd, rn, m & max, max);
> +}
> +
> +static inline void tcg_out_sar(TCGContext *s, int ext,
> +                               int rd, int rn, unsigned int m)
> +{
> +    int max; max = ext ? 63 : 31;
> +    tcg_out_sbfm(s, ext, rd, rn, m & max, max);
> +}
> +
> +static inline void tcg_out_rotr(TCGContext *s, int ext,
> +                                int rd, int rn, unsigned int m)
> +{
> +    int max; max = ext ? 63 : 31;
> +    tcg_out_extr(s, ext, rd, rn, rn, m & max);
> +}
> +
> +static inline void tcg_out_rotl(TCGContext *s, int ext,
> +                                int rd, int rn, unsigned int m)
> +{
> +    int bits, max;
> +    bits = ext ? 64 : 32; max = bits - 1;
> +    tcg_out_extr(s, ext, rd, rn, rn, bits - (m & max));
> +}
> +
> +static inline void tcg_out_cmp(TCGContext *s, int ext,
> +                               int rn, int rm)
> +{
> +    /* Using CMP alias SUBS wzr, Wn, Wm */
> +    unsigned int base; base = ext ? 0xeb00001f : 0x6b00001f;
> +    tcg_out32(s, base | rm << 16 | rn << 5);
> +}
> +
> +static inline void tcg_out_csel(TCGContext *s, int ext,
> +                                int rd, int rn, int rm,
> +                                enum aarch64_cond_code c)
> +{
> +    /* Using CSEL 0x1a800000 wd, wn, wm, c */
> +    unsigned int base; base = ext ? 0x9a800000 : 0x1a800000;
> +    tcg_out32(s, base | rm << 16 | c << 12 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_goto(TCGContext *s, tcg_target_long target)
> +{
> +    tcg_target_long offset;
> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
> +
> +    if (offset <= -0x02000000 || offset >= 0x02000000) {
> +        /* out of 26bit range */
> +        tcg_abort();
> +    }
> +
> +    tcg_out32(s, 0x14000000 | (offset & 0x03ffffff));
> +}
> +
> +static inline void tcg_out_goto_noaddr(TCGContext *s)
> +{
> +    /* We pay attention here to not modify the branch target by
> +       reading from the buffer. This ensure that caches and memory are
> +       kept coherent during retranslation. */
> +    uint32_t insn; insn = tcg_in32(s);

Don't put two statements on one line, please.

> +    insn |= 0x14000000;

If you're reading the whole 32 bit insn in then you need to
mask out the possible garbage in the instruction bits
before ORing in that 0x14000000. The first time around
you can't guarantee they are either zero or correct.

> +    tcg_out32(s, insn);
> +}
> +
> +/* offset is scaled and relative! Check range before calling! */
> +static inline void tcg_out_goto_cond(TCGContext *s, TCGCond c,
> +                                     tcg_target_long offset)
> +{
> +    tcg_out32(s, 0x54000000 | tcg_cond_to_aarch64_cond[c] | offset << 5);
> +}
> +
> +static inline void tcg_out_callr(TCGContext *s, int reg)
> +{
> +    tcg_out32(s, 0xd63f0000 | reg << 5);
> +}
> +
> +static inline void tcg_out_gotor(TCGContext *s, int reg)
> +{
> +    tcg_out32(s, 0xd61f0000 | reg << 5);
> +}
> +
> +static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
> +{
> +    tcg_target_long offset;
> +
> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
> +
> +    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */
> +        tcg_out_movi64(s, TCG_REG_X8, target);
> +        tcg_out_callr(s, TCG_REG_X8);
> +
> +    } else {
> +        tcg_out32(s, 0x94000000 | (offset & 0x03ffffff));
> +    }
> +}
> +
> +static inline void tcg_out_ret(TCGContext *s)
> +{
> +    /* emit RET { LR } */
> +    tcg_out32(s, 0xd65f03c0);
> +}
> +
> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
> +{
> +    tcg_target_long target, offset;
> +    target = (tcg_target_long)addr;
> +    offset = (target - (tcg_target_long)jmp_addr) / 4;
> +
> +    if (offset <= -0x02000000 || offset >= 0x02000000) {
> +        /* out of 26bit range */
> +        tcg_abort();
> +    }
> +
> +    patch_reloc((uint8_t *)jmp_addr, R_AARCH64_JUMP26, target, 0);
> +    flush_icache_range(jmp_addr, jmp_addr + 4);
> +}
> +
> +static inline void tcg_out_goto_label(TCGContext *s, int label_index)
> +{
> +    TCGLabel *l = &s->labels[label_index];
> +
> +    if (!l->has_value) {
> +        tcg_out_reloc(s, s->code_ptr, R_AARCH64_JUMP26, label_index, 0);
> +        tcg_out_goto_noaddr(s);
> +
> +    } else {
> +        tcg_out_goto(s, l->u.value);
> +    }
> +}
> +
> +static inline void tcg_out_goto_label_cond(TCGContext *s, TCGCond c, int label_index)
> +{
> +    tcg_target_long offset;
> +    /* backward conditional jump never seems to happen in practice,
> +       so just always use the branch trampoline */

I think I know what you mean here but this comment is a bit cryptic;
can you expand?

> +    c = tcg_invert_cond(c);
> +    offset = 2; /* skip current instr and the next */
> +    tcg_out_goto_cond(s, c, offset);
> +    tcg_out_goto_label(s, label_index); /* emit 26bit jump */
> +}
> +
> +#ifdef CONFIG_SOFTMMU
> +#include "exec/softmmu_defs.h"
> +
> +/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
> +   int mmu_idx) */
> +static const void * const qemu_ld_helpers[4] = {
> +    helper_ldb_mmu,
> +    helper_ldw_mmu,
> +    helper_ldl_mmu,
> +    helper_ldq_mmu,
> +};
> +
> +/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
> +   uintxx_t val, int mmu_idx) */
> +static const void * const qemu_st_helpers[4] = {
> +    helper_stb_mmu,
> +    helper_stw_mmu,
> +    helper_stl_mmu,
> +    helper_stq_mmu,
> +};
> +
> +#endif /* CONFIG_SOFTMMU */
> +
> +static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
> +{
> +    int addr_reg, data_reg;
> +#ifdef CONFIG_SOFTMMU
> +    int mem_index, s_bits;
> +#endif
> +    data_reg = args[0];
> +    addr_reg = args[1];
> +
> +#ifdef CONFIG_SOFTMMU
> +    mem_index = args[2];
> +    s_bits = opc & 3;
> +
> +    /* Should generate something like the following:
> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
> +     */

The comment says this, but you don't actually seem to have
the code to do it?

And there definitely needs to be a test somewhere in
your generated code for "did the TLB hit or miss?"

> +#  if CPU_TLB_BITS > 8
> +#   error "CPU_TLB_BITS too large"
> +#  endif
> +
> +    /* all arguments passed via registers */
> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
> +    tcg_out_movi32(s, 0, TCG_REG_X2, mem_index);
> +
> +    tcg_out_movi64(s, TCG_REG_X8, (uint64_t)qemu_ld_helpers[s_bits]);
> +    tcg_out_callr(s, TCG_REG_X8);
> +
> +    if (opc & 0x04) { /* sign extend */
> +        unsigned int bits; bits = 8 * (1 << s_bits) - 1;
> +        tcg_out_sbfm(s, 1, data_reg, TCG_REG_X0, 0, bits); /* 7|15|31 */
> +
> +    } else {
> +        tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
> +    }
> +
> +#else /* !CONFIG_SOFTMMU */
> +    tcg_abort(); /* TODO */
> +#endif
> +}
> +
> +static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
> +{
> +    int addr_reg, data_reg;
> +#ifdef CONFIG_SOFTMMU
> +    int mem_index, s_bits;
> +#endif
> +    data_reg = args[0];
> +    addr_reg = args[1];
> +
> +#ifdef CONFIG_SOFTMMU
> +    mem_index = args[2];
> +    s_bits = opc & 3;
> +
> +    /* Should generate something like the following:
> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
> +     */
> +#  if CPU_TLB_BITS > 8
> +#   error "CPU_TLB_BITS too large"
> +#  endif
> +
> +    /* all arguments passed via registers */
> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
> +    tcg_out_movr(s, 1, TCG_REG_X2, data_reg);
> +    tcg_out_movi32(s, 0, TCG_REG_X3, mem_index);
> +
> +    tcg_out_movi64(s, TCG_REG_X8, (uint64_t)qemu_st_helpers[s_bits]);
> +    tcg_out_callr(s, TCG_REG_X8);
> +
> +#else /* !CONFIG_SOFTMMU */
> +    tcg_abort(); /* TODO */
> +#endif
> +}
> +
> +static uint8_t *tb_ret_addr;
> +
> +/* callee stack use example:
> +   stp     x29, x30, [sp,#-32]!
> +   mov     x29, sp
> +   stp     x1, x2, [sp,#16]
> +   ...
> +   ldp     x1, x2, [sp,#16]
> +   ldp     x29, x30, [sp],#32
> +   ret
> +*/
> +
> +/* push r1 and r2, and alloc stack space for a total of
> +   alloc_n elements (1 element=16 bytes, must be between 1 and 31. */
> +static inline void tcg_out_push_p(TCGContext *s,
> +                                  TCGReg r1, TCGReg r2, int alloc_n)

I think these function names would benefit from spelling
out "pair" rather than abbreviating it to "p".

> +{
> +    /* using indexed scaled simm7 STP 0x28800000 | (ext) | 0x01000000 (pre-idx)
> +       | alloc_n * (-1) << 16 | r2 << 10 | sp(31) << 5 | r1 */
> +    assert(alloc_n > 0 && alloc_n < 0x20);
> +    alloc_n = (-alloc_n) & 0x3f;
> +    tcg_out32(s, 0xa98003e0 | alloc_n << 16 | r2 << 10 | r1);
> +}
> +
> +/* dealloc stack space for a total of alloc_n elements and pop r1, r2.  */
> +static inline void tcg_out_pop_p(TCGContext *s,
> +                                 TCGReg r1, TCGReg r2, int alloc_n)
> +{
> +    /* using indexed scaled simm7 LDP 0x28c00000 | (ext) | nothing (post-idx)
> +       | alloc_n << 16 | r2 << 10 | sp(31) << 5 | r1 */
> +    assert(alloc_n > 0 && alloc_n < 0x20);
> +    tcg_out32(s, 0xa8c003e0 | alloc_n << 16 | r2 << 10 | r1);
> +}
> +
> +static inline void tcg_out_store_p(TCGContext *s,
> +                                   TCGReg r1, TCGReg r2, int idx)
> +{
> +    /* using register pair offset simm7 STP 0x29000000 | (ext)
> +       | idx << 16 | r2 << 10 | FP(29) << 5 | r1 */
> +    assert(idx > 0 && idx < 0x20);
> +    tcg_out32(s, 0xa90003a0 | idx << 16 | r2 << 10 | r1);
> +}
> +
> +static inline void tcg_out_load_p(TCGContext *s, TCGReg r1, TCGReg r2, int idx)
> +{
> +    /* using register pair offset simm7 LDP 0x29400000 | (ext)
> +       | idx << 16 | r2 << 10 | FP(29) << 5 | r1 */
> +    assert(idx > 0 && idx < 0x20);
> +    tcg_out32(s, 0xa94003a0 | idx << 16 | r2 << 10 | r1);
> +}
> +
> +static void tcg_out_op(TCGContext *s, TCGOpcode opc,
> +                       const TCGArg *args, const int *const_args)
> +{
> +    int ext = 0;
> +
> +    switch (opc) {
> +    case INDEX_op_exit_tb:
> +        tcg_out_movi64(s, TCG_REG_X0, args[0]); /* load retval in X0 */
> +        tcg_out_goto(s, (tcg_target_long)tb_ret_addr);
> +        break;
> +
> +    case INDEX_op_goto_tb:
> +#ifndef USE_DIRECT_JUMP
> +#error "USE_DIRECT_JUMP required for aarch64"
> +#endif
> +        assert(s->tb_jmp_offset != NULL); /* consistency for USE_DIRECT_JUMP */
> +        s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
> +        /* actual branch destination will be patched by
> +           aarch64_tb_set_jmp_target later, beware retranslation. */
> +        tcg_out_goto_noaddr(s);
> +        s->tb_next_offset[args[0]] = s->code_ptr - s->code_buf;
> +        break;
> +
> +    case INDEX_op_call:
> +        if (const_args[0])
> +            tcg_out_call(s, args[0]);
> +        else
> +            tcg_out_callr(s, args[0]);
> +        break;
> +
> +    case INDEX_op_br:
> +        tcg_out_goto_label(s, args[0]);
> +        break;
> +
> +    case INDEX_op_ld_i32:
> +    case INDEX_op_ld_i64:
> +    case INDEX_op_st_i32:
> +    case INDEX_op_st_i64:
> +    case INDEX_op_ld8u_i32:
> +    case INDEX_op_ld8s_i32:
> +    case INDEX_op_ld16u_i32:
> +    case INDEX_op_ld16s_i32:
> +    case INDEX_op_ld8u_i64:
> +    case INDEX_op_ld8s_i64:
> +    case INDEX_op_ld16u_i64:
> +    case INDEX_op_ld16s_i64:
> +    case INDEX_op_ld32u_i64:
> +    case INDEX_op_ld32s_i64:
> +    case INDEX_op_st8_i32:
> +    case INDEX_op_st8_i64:
> +    case INDEX_op_st16_i32:
> +    case INDEX_op_st16_i64:
> +    case INDEX_op_st32_i64:
> +        tcg_out_ldst(s, aarch64_ldst_get_data(opc), aarch64_ldst_get_type(opc),
> +                     args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_mov_i64: ext = 1;
> +    case INDEX_op_mov_i32:
> +        tcg_out_movr(s, ext, args[0], args[1]);
> +        break;
> +
> +    case INDEX_op_movi_i64:
> +        tcg_out_movi64(s, args[0], args[1]);
> +        break;
> +
> +    case INDEX_op_movi_i32:
> +        tcg_out_movi32(s, 0, args[0], args[1]);
> +        break;
> +
> +    case INDEX_op_add_i64: ext = 1;
> +    case INDEX_op_add_i32:
> +        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_sub_i64: ext = 1;
> +    case INDEX_op_sub_i32:
> +        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_and_i64: ext = 1;
> +    case INDEX_op_and_i32:
> +        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_or_i64: ext = 1;
> +    case INDEX_op_or_i32:
> +        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_xor_i64: ext = 1;
> +    case INDEX_op_xor_i32:
> +        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_mul_i64: ext = 1;
> +    case INDEX_op_mul_i32:
> +        tcg_out_mul(s, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_shl_i64: ext = 1;
> +    case INDEX_op_shl_i32:
> +        if (const_args[2])      /* LSL / UBFM Wd, Wn, (32 - m) */
> +            tcg_out_shl(s, ext, args[0], args[1], args[2]);
> +        else                    /* LSL / LSLV */
> +            tcg_out_shiftrot_reg(s, SRR_SHL, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_shr_i64: ext = 1;
> +    case INDEX_op_shr_i32:
> +        if (const_args[2])      /* LSR / UBFM Wd, Wn, m, 31 */
> +            tcg_out_shr(s, ext, args[0], args[1], args[2]);
> +        else                    /* LSR / LSRV */
> +            tcg_out_shiftrot_reg(s, SRR_SHR, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_sar_i64: ext = 1;
> +    case INDEX_op_sar_i32:
> +        if (const_args[2])      /* ASR / SBFM Wd, Wn, m, 31 */
> +            tcg_out_sar(s, ext, args[0], args[1], args[2]);
> +        else                    /* ASR / ASRV */
> +            tcg_out_shiftrot_reg(s, SRR_SAR, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_rotr_i64: ext = 1;
> +    case INDEX_op_rotr_i32:
> +        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, m */
> +            tcg_out_rotr(s, ext, args[0], args[1], args[2]); /* XXX UNTESTED */
> +        else                    /* ROR / RORV */
> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_rotl_i64: ext = 1;
> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
> +        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, 32 - m */
> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
> +        else { /* no RSB in aarch64 unfortunately. */
> +            /* XXX UNTESTED */
> +            tcg_out_movi32(s, ext, TCG_REG_X8, ext ? 64 : 32);
> +            tcg_out_arith(s, ARITH_SUB, ext, TCG_REG_X8, TCG_REG_X8, args[2]);
> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], TCG_REG_X8);

I think you should either test this, or remove it [rot
support is optional so you could put it back in a later
patch].

> +        }
> +        break;
> +
> +    case INDEX_op_brcond_i64: ext = 1;
> +    case INDEX_op_brcond_i32: /* CMP 0, 1, cond(2), label 3 */
> +        tcg_out_cmp(s, ext, args[0], args[1]);
> +        tcg_out_goto_label_cond(s, args[2], args[3]);
> +        break;
> +
> +    case INDEX_op_setcond_i64: ext = 1;
> +    case INDEX_op_setcond_i32:
> +        tcg_out_movi32(s, ext, TCG_REG_X8, 0x01);
> +        tcg_out_cmp(s, ext, args[1], args[2]);
> +        tcg_out_csel(s, ext, args[0], TCG_REG_X8, TCG_REG_XZR,
> +                     tcg_cond_to_aarch64_cond[args[3]]);

Better to use CSET Xd, cond [which is an alias for
CSINC Xd, XZR, XZR, invert(cond)]

> +        break;
> +
> +    case INDEX_op_qemu_ld8u:
> +        tcg_out_qemu_ld(s, args, 0 | 0);
> +        break;
> +    case INDEX_op_qemu_ld8s:
> +        tcg_out_qemu_ld(s, args, 4 | 0);
> +        break;
> +    case INDEX_op_qemu_ld16u:
> +        tcg_out_qemu_ld(s, args, 0 | 1);
> +        break;
> +    case INDEX_op_qemu_ld16s:
> +        tcg_out_qemu_ld(s, args, 4 | 1);
> +        break;
> +    case INDEX_op_qemu_ld32u:
> +        tcg_out_qemu_ld(s, args, 0 | 2);
> +        break;
> +    case INDEX_op_qemu_ld32s:
> +        tcg_out_qemu_ld(s, args, 4 | 2);
> +        break;
> +    case INDEX_op_qemu_ld32:
> +        tcg_out_qemu_ld(s, args, 0 | 2);
> +        break;
> +    case INDEX_op_qemu_ld64:
> +        tcg_out_qemu_ld(s, args, 0 | 3);
> +        break;
> +    case INDEX_op_qemu_st8:
> +        tcg_out_qemu_st(s, args, 0);
> +        break;
> +    case INDEX_op_qemu_st16:
> +        tcg_out_qemu_st(s, args, 1);
> +        break;
> +    case INDEX_op_qemu_st32:
> +        tcg_out_qemu_st(s, args, 2);
> +        break;
> +    case INDEX_op_qemu_st64:
> +        tcg_out_qemu_st(s, args, 3);
> +        break;
> +
> +    default:
> +        tcg_abort(); /* opcode not implemented */
> +    }
> +}
> +
> +static const TCGTargetOpDef aarch64_op_defs[] = {
> +    { INDEX_op_exit_tb, { } },
> +    { INDEX_op_goto_tb, { } },
> +    { INDEX_op_call, { "ri" } },
> +    { INDEX_op_br, { } },
> +
> +    { INDEX_op_mov_i32, { "r", "r" } },
> +    { INDEX_op_mov_i64, { "r", "r" } },
> +
> +    { INDEX_op_movi_i32, { "r" } },
> +    { INDEX_op_movi_i64, { "r" } },
> +
> +    { INDEX_op_ld8u_i32, { "r", "r" } },
> +    { INDEX_op_ld8s_i32, { "r", "r" } },
> +    { INDEX_op_ld16u_i32, { "r", "r" } },
> +    { INDEX_op_ld16s_i32, { "r", "r" } },
> +    { INDEX_op_ld_i32, { "r", "r" } },
> +    { INDEX_op_ld8u_i64, { "r", "r" } },
> +    { INDEX_op_ld8s_i64, { "r", "r" } },
> +    { INDEX_op_ld16u_i64, { "r", "r" } },
> +    { INDEX_op_ld16s_i64, { "r", "r" } },
> +    { INDEX_op_ld32u_i64, { "r", "r" } },
> +    { INDEX_op_ld32s_i64, { "r", "r" } },
> +    { INDEX_op_ld_i64, { "r", "r" } },
> +
> +    { INDEX_op_st8_i32, { "r", "r" } },
> +    { INDEX_op_st16_i32, { "r", "r" } },
> +    { INDEX_op_st_i32, { "r", "r" } },
> +    { INDEX_op_st8_i64, { "r", "r" } },
> +    { INDEX_op_st16_i64, { "r", "r" } },
> +    { INDEX_op_st32_i64, { "r", "r" } },
> +    { INDEX_op_st_i64, { "r", "r" } },
> +
> +    { INDEX_op_add_i32, { "r", "r", "r" } },
> +    { INDEX_op_add_i64, { "r", "r", "r" } },
> +    { INDEX_op_sub_i32, { "r", "r", "r" } },
> +    { INDEX_op_sub_i64, { "r", "r", "r" } },
> +    { INDEX_op_mul_i32, { "r", "r", "r" } },
> +    { INDEX_op_mul_i64, { "r", "r", "r" } },
> +    { INDEX_op_and_i32, { "r", "r", "r" } },
> +    { INDEX_op_and_i64, { "r", "r", "r" } },
> +    { INDEX_op_or_i32, { "r", "r", "r" } },
> +    { INDEX_op_or_i64, { "r", "r", "r" } },
> +    { INDEX_op_xor_i32, { "r", "r", "r" } },
> +    { INDEX_op_xor_i64, { "r", "r", "r" } },
> +
> +    { INDEX_op_shl_i32, { "r", "r", "ri" } },
> +    { INDEX_op_shr_i32, { "r", "r", "ri" } },
> +    { INDEX_op_sar_i32, { "r", "r", "ri" } },
> +    { INDEX_op_rotl_i32, { "r", "r", "ri" } },
> +    { INDEX_op_rotr_i32, { "r", "r", "ri" } },
> +    { INDEX_op_shl_i64, { "r", "r", "ri" } },
> +    { INDEX_op_shr_i64, { "r", "r", "ri" } },
> +    { INDEX_op_sar_i64, { "r", "r", "ri" } },
> +    { INDEX_op_rotl_i64, { "r", "r", "ri" } },
> +    { INDEX_op_rotr_i64, { "r", "r", "ri" } },
> +
> +    { INDEX_op_brcond_i32, { "r", "r" } },
> +    { INDEX_op_setcond_i32, { "r", "r", "r" } },
> +    { INDEX_op_brcond_i64, { "r", "r" } },
> +    { INDEX_op_setcond_i64, { "r", "r", "r" } },
> +
> +    { INDEX_op_qemu_ld8u, { "r", "l" } },
> +    { INDEX_op_qemu_ld8s, { "r", "l" } },
> +    { INDEX_op_qemu_ld16u, { "r", "l" } },
> +    { INDEX_op_qemu_ld16s, { "r", "l" } },
> +    { INDEX_op_qemu_ld32u, { "r", "l" } },
> +    { INDEX_op_qemu_ld32s, { "r", "l" } },
> +
> +    { INDEX_op_qemu_ld32, { "r", "l" } },
> +    { INDEX_op_qemu_ld64, { "r", "l" } },
> +
> +    { INDEX_op_qemu_st8, { "l", "l" } },
> +    { INDEX_op_qemu_st16, { "l", "l" } },
> +    { INDEX_op_qemu_st32, { "l", "l" } },
> +    { INDEX_op_qemu_st64, { "l", "l" } },
> +    { -1 },
> +};
> +
> +static void tcg_target_init(TCGContext *s)
> +{
> +#if !defined(CONFIG_USER_ONLY)
> +    /* fail safe */
> +    if ((1ULL << CPU_TLB_ENTRY_BITS) != sizeof(CPUTLBEntry))
> +        tcg_abort();
> +#endif
> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffff);
> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffff);
> +
> +    tcg_regset_set32(tcg_target_call_clobber_regs, 0,
> +                     (1 << TCG_REG_X0) | (1 << TCG_REG_X1) |
> +                     (1 << TCG_REG_X2) | (1 << TCG_REG_X3) |
> +                     (1 << TCG_REG_X4) | (1 << TCG_REG_X5) |
> +                     (1 << TCG_REG_X6) | (1 << TCG_REG_X7) |
> +                     (1 << TCG_REG_X8) | (1 << TCG_REG_X9) |
> +                     (1 << TCG_REG_X10) | (1 << TCG_REG_X11) |
> +                     (1 << TCG_REG_X12) | (1 << TCG_REG_X13) |
> +                     (1 << TCG_REG_X14) | (1 << TCG_REG_X15) |
> +                     (1 << TCG_REG_X16) | (1 << TCG_REG_X17) |
> +                     (1 << TCG_REG_X18) | (1 << TCG_REG_LR));
> +
> +    tcg_regset_clear(s->reserved_regs);
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_X8);
> +
> +    tcg_add_target_add_op_defs(aarch64_op_defs);
> +    tcg_set_frame(s, TCG_AREG0, offsetof(CPUArchState, temp_buf),
> +                  CPU_TEMP_BUF_NLONGS * sizeof(long));

tcg_set_frame() should be called in the prologue generation
function, not here. Also, please don't use temp_buf, it is
going to go away shortly, as per this patch:
 http://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03859.html

> +}
> +
> +static void tcg_target_qemu_prologue(TCGContext *s)
> +{
> +    int r;
> +    int frame_size; /* number of 16 byte items */
> +
> +    /* we need to save (FP, LR) and X19 to X28 */
> +    frame_size = (1) + (TCG_REG_X27 - TCG_REG_X19) / 2 + 1;

The comment says "X19 to X28" and the code does X27 - X19:
which is right?

Why the brackets round the first '1' ?

> +
> +    /* push (fp, lr) and update sp to final frame size */
> +    tcg_out_push_p(s, TCG_REG_FP, TCG_REG_LR, frame_size);
> +
> +    /* FP -> frame chain */
> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
> +
> +    /* store callee-preserved regs x19..x28 */
> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
> +        tcg_out_store_p(s, r, r + 1, idx);
> +    }
> +
> +    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
> +    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
> +
> +    tb_ret_addr = s->code_ptr;
> +
> +    /* restore registers x19..x28 */
> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
> +        tcg_out_load_p(s, r, r + 1, idx);
> +    }
> +
> +    /* pop (fp, lr), restore sp to previous frame, return */
> +    tcg_out_pop_p(s, TCG_REG_FP, TCG_REG_LR, frame_size);
> +    tcg_out_ret(s);
> +}
> diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
> new file mode 100644
> index 0000000..f28af09
> --- /dev/null
> +++ b/tcg/aarch64/tcg-target.h
> @@ -0,0 +1,106 @@
> +/*
> + * Initial TCG Implementation for aarch64
> + *
> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
> + * Written by Claudio Fontana
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * (at your option) any later version.
> + *
> + * See the COPYING file in the top-level directory for details.
> + */
> +
> +#ifndef TCG_TARGET_AARCH64
> +#define TCG_TARGET_AARCH64 1
> +
> +#undef TCG_TARGET_WORDS_BIGENDIAN
> +#undef TCG_TARGET_STACK_GROWSUP
> +
> +typedef enum {
> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3, TCG_REG_X4,
> +    TCG_REG_X5, TCG_REG_X6, TCG_REG_X7, TCG_REG_X8, TCG_REG_X9,
> +    TCG_REG_X10, TCG_REG_X11, TCG_REG_X12, TCG_REG_X13, TCG_REG_X14,
> +    TCG_REG_X15, TCG_REG_X16, TCG_REG_X17, TCG_REG_X18, TCG_REG_X19,
> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23, TCG_REG_X24,
> +    TCG_REG_X25, TCG_REG_X26, TCG_REG_X27, TCG_REG_X28,
> +    TCG_REG_FP,  /* frame pointer */
> +    TCG_REG_LR, /* link register */
> +    TCG_REG_SP,  /* stack pointer or zero register */
> +    TCG_REG_XZR = TCG_REG_SP /* same register number */
> +    /* program counter is not directly accessible! */
> +} TCGReg;
> +
> +#define TCG_TARGET_NB_REGS 32
> +#define TCG_CT_CONST_ARM 0x100

This define is never used. (Eventually you'll want to define
some constraints for particular kinds of constant and some
TCG_CT_CONST_* defines to go with them but for now we don't
need either.)

> +
> +/* used for function call generation */
> +#define TCG_REG_CALL_STACK             TCG_REG_SP
> +#define TCG_TARGET_STACK_ALIGN         16
> +#define TCG_TARGET_CALL_ALIGN_ARGS      1
> +#define TCG_TARGET_CALL_STACK_OFFSET   0
> +
> +/* optional instructions */
> +#define TCG_TARGET_HAS_div_i32          0
> +#define TCG_TARGET_HAS_ext8s_i32        0
> +#define TCG_TARGET_HAS_ext16s_i32       0
> +#define TCG_TARGET_HAS_ext8u_i32        0
> +#define TCG_TARGET_HAS_ext16u_i32       0
> +#define TCG_TARGET_HAS_bswap16_i32      0
> +#define TCG_TARGET_HAS_bswap32_i32      0
> +#define TCG_TARGET_HAS_not_i32          0
> +#define TCG_TARGET_HAS_neg_i32          0
> +#define TCG_TARGET_HAS_rot_i32          1
> +#define TCG_TARGET_HAS_andc_i32         0
> +#define TCG_TARGET_HAS_orc_i32          0
> +#define TCG_TARGET_HAS_eqv_i32          0
> +#define TCG_TARGET_HAS_nand_i32         0
> +#define TCG_TARGET_HAS_nor_i32          0
> +#define TCG_TARGET_HAS_deposit_i32      0
> +#define TCG_TARGET_HAS_movcond_i32      0
> +#define TCG_TARGET_HAS_add2_i32         0
> +#define TCG_TARGET_HAS_sub2_i32         0
> +#define TCG_TARGET_HAS_mulu2_i32        0
> +#define TCG_TARGET_HAS_muls2_i32        0
> +
> +#define TCG_TARGET_HAS_div_i64          0
> +#define TCG_TARGET_HAS_ext8s_i64        0
> +#define TCG_TARGET_HAS_ext16s_i64       0
> +#define TCG_TARGET_HAS_ext32s_i64       0
> +#define TCG_TARGET_HAS_ext8u_i64        0
> +#define TCG_TARGET_HAS_ext16u_i64       0
> +#define TCG_TARGET_HAS_ext32u_i64       0
> +#define TCG_TARGET_HAS_bswap16_i64      0
> +#define TCG_TARGET_HAS_bswap32_i64      0
> +#define TCG_TARGET_HAS_bswap64_i64      0
> +#define TCG_TARGET_HAS_not_i64          0
> +#define TCG_TARGET_HAS_neg_i64          0
> +#define TCG_TARGET_HAS_rot_i64          1
> +#define TCG_TARGET_HAS_andc_i64         0
> +#define TCG_TARGET_HAS_orc_i64          0
> +#define TCG_TARGET_HAS_eqv_i64          0
> +#define TCG_TARGET_HAS_nand_i64         0
> +#define TCG_TARGET_HAS_nor_i64          0
> +#define TCG_TARGET_HAS_deposit_i64      0
> +#define TCG_TARGET_HAS_movcond_i64      0
> +#define TCG_TARGET_HAS_add2_i64         0
> +#define TCG_TARGET_HAS_sub2_i64         0
> +#define TCG_TARGET_HAS_mulu2_i64        0
> +#define TCG_TARGET_HAS_muls2_i64        0
> +
> +enum {
> +    TCG_AREG0 = TCG_REG_X19,
> +};
> +
> +static inline void flush_icache_range(tcg_target_ulong start,
> +                                      tcg_target_ulong stop)
> +{
> +#if QEMU_GNUC_PREREQ(4, 1)
> +    __builtin___clear_cache((char *)start, (char *)stop);
> +#else
> +    /* XXX should provide alternative with IC <ic_op>, Xt */
> +#error "need GNUC >= 4.1, alternative not implemented yet."
> +#endif

I think we can just assume a GCC new enough to support
__builtin___clear_cache(). Nobody's going to be compiling
aarch64 code with a gcc that old, because they didn't
support the architecture at all. You can drop the #if/#else
completely.

> +
> +}
> +
> +#endif /* TCG_TARGET_AARCH64 */
> --
> 1.8.1
>
>

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] configure: permit compilation on arm aarch64
  2013-05-13 13:28         ` [Qemu-devel] [PATCH 1/3] configure: permit compilation on arm aarch64 Claudio Fontana
@ 2013-05-13 18:29           ` Peter Maydell
  2013-05-14  8:19             ` Claudio Fontana
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-13 18:29 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel

On 13 May 2013 14:28, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>
> support compiling on aarch64.
>
> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

This looks good, but it should be the last patch in the series,
so we don't allow the support to be enabled until the code
that implements it is present.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] include/elf.h: add aarch64 ELF machine and relocs
  2013-05-13 13:31         ` [Qemu-devel] [PATCH 2/3] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
@ 2013-05-13 18:34           ` Peter Maydell
  2013-05-14  8:24             ` Claudio Fontana
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-13 18:34 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel

On 13 May 2013 14:31, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>
> we will use the 26bit relative relocations in the aarch64 tcg target.

This patch looks OK, but can I ask you to just neaten up
the #defines by making the column of values line up?
(use spaces, not hardcoded tabs).

> @@ -616,6 +618,132 @@ typedef struct {
>  /* Keep this the last entry.  */
>  #define R_ARM_NUM              256

It's kind of obvious that we're doing aarch64 relocs from
here on, but it would be nice to just put a comment
here to explicitly separate the aarch64 relocs from the
32 bit ARM ones:

/* ARM AArch64 relocation types */

> +#define R_AARCH64_NONE          256 /* also accept R_ARM_NONE (0) as null */
> +/* static data relocations */
> +#define R_AARCH64_ABS64         257
> +#define R_AARCH64_ABS32         258

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-13 13:33         ` [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
  2013-05-13 18:28           ` Peter Maydell
@ 2013-05-13 19:49           ` Richard Henderson
  2013-05-14 14:05             ` Claudio Fontana
  1 sibling, 1 reply; 60+ messages in thread
From: Richard Henderson @ 2013-05-13 19:49 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel, Peter Maydell

On 05/13/2013 06:33 AM, Claudio Fontana wrote:
> +enum aarch64_cond_code {
> +    COND_EQ = 0x0,
> +    COND_NE = 0x1,
> +    COND_CS = 0x2,	/* Unsigned greater or equal */
> +    COND_HS = 0x2,      /* ALIAS greater or equal */

Clearer to define aliases as COND_HS = COND_CS.

> +static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value)
> +{
> +    uint32_t half, base, movk = 0, shift = 0;
> +    if (!value) {
> +        tcg_out_movr(s, 1, rd, TCG_REG_XZR);
> +        return;
> +    }
> +    /* construct halfwords of the immediate with MOVZ with LSL */
> +    /* using MOVZ 0x52800000 | extended reg.. */
> +    base = 0xd2800000;
> +
> +    while (value) {
> +        half = value & 0xffff;
> +        if (half) {
> +            /* Op can be MOVZ or MOVK */
> +            tcg_out32(s, base | movk | shift | half << 5 | rd);
> +            if (!movk)
> +                movk = 0x20000000; /* morph next MOVZs into MOVKs */
> +        }
> +        value >>= 16;
> +        shift += 0x00200000;

You'll almost certainly want to try ADP+ADD before decomposing into 3-4 mov[zk]
instructions.

> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
> +{
> +    tcg_target_long target, offset;
> +    target = (tcg_target_long)addr;
> +    offset = (target - (tcg_target_long)jmp_addr) / 4;
> +
> +    if (offset <= -0x02000000 || offset >= 0x02000000) {
> +        /* out of 26bit range */
> +        tcg_abort();
> +    }

See MAX_CODE_GEN_BUFFER_SIZE in translate-all.c.  Set this value to 128MB and
then all cross-TB branches will be in range, and the abort won't trigger.

> +static inline void tcg_out_goto_label_cond(TCGContext *s, TCGCond c, int label_index)
> +{
> +    tcg_target_long offset;
> +    /* backward conditional jump never seems to happen in practice,
> +       so just always use the branch trampoline */
> +    c = tcg_invert_cond(c);
> +    offset = 2; /* skip current instr and the next */
> +    tcg_out_goto_cond(s, c, offset);
> +    tcg_out_goto_label(s, label_index); /* emit 26bit jump */
> +}

Conditional branch range is +-1MB.  You'll never see a TB that large.  You
don't need to emit a branch-across-branch.

> +    /* Should generate something like the following:
> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
> +     */
> +#  if CPU_TLB_BITS > 8
> +#   error "CPU_TLB_BITS too large"
> +#  endif

I wonder if using UBFM to extract the TLB bits and BFM with XZR to clear the
middle bits wouldn't be better, as you wouldn't be restricted on the size of
CPU_TLB_BITS.  AFAICS it would be the same number of instructions.

> +    case INDEX_op_mov_i64: ext = 1;
> +    case INDEX_op_mov_i32:
> +        tcg_out_movr(s, ext, args[0], args[1]);
> +        break;

See how the i386 backend uses macros to reduce the typing with these sorts of
paired opcodes.

> +    case INDEX_op_rotl_i64: ext = 1;
> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
> +        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, 32 - m */
> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
> +        else { /* no RSB in aarch64 unfortunately. */
> +            /* XXX UNTESTED */
> +            tcg_out_movi32(s, ext, TCG_REG_X8, ext ? 64 : 32);

But A64 does have shift counts that truncate to the width of the operation.
Which means that the high bits may contain garbage, which means that you can
compute this merely as ROR = -ROL, ignoring the 32/64.

> +    case INDEX_op_setcond_i64: ext = 1;
> +    case INDEX_op_setcond_i32:
> +        tcg_out_movi32(s, ext, TCG_REG_X8, 0x01);
> +        tcg_out_cmp(s, ext, args[1], args[2]);
> +        tcg_out_csel(s, ext, args[0], TCG_REG_X8, TCG_REG_XZR,
> +                     tcg_cond_to_aarch64_cond[args[3]]);

See CSINC Wd,Wzr,Wzr,cond.  No need for the initial movi.

> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffff);
> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffff);

Only half of your registers are marked available.


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] configure: permit compilation on arm aarch64
  2013-05-13 18:29           ` Peter Maydell
@ 2013-05-14  8:19             ` Claudio Fontana
  0 siblings, 0 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-14  8:19 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, qemu-devel

On 13.05.2013 20:29, Peter Maydell wrote:
> On 13 May 2013 14:28, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>
>> support compiling on aarch64.
>>
>> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
> 
> This looks good, but it should be the last patch in the series,
> so we don't allow the support to be enabled until the code
> that implements it is present.
> 
> thanks
> -- PMM

Of course, that would break bisection. Will fix by putting last.

-- 
Claudio Fontana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] include/elf.h: add aarch64 ELF machine and relocs
  2013-05-13 18:34           ` Peter Maydell
@ 2013-05-14  8:24             ` Claudio Fontana
  0 siblings, 0 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-14  8:24 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, qemu-devel

On 13.05.2013 20:34, Peter Maydell wrote:
> On 13 May 2013 14:31, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>
>> we will use the 26bit relative relocations in the aarch64 tcg target.
> 
> This patch looks OK, but can I ask you to just neaten up
> the #defines by making the column of values line up?
> (use spaces, not hardcoded tabs).
> 
>> @@ -616,6 +618,132 @@ typedef struct {
>>  /* Keep this the last entry.  */
>>  #define R_ARM_NUM              256
> 
> It's kind of obvious that we're doing aarch64 relocs from
> here on, but it would be nice to just put a comment
> here to explicitly separate the aarch64 relocs from the
> 32 bit ARM ones:
> 
> /* ARM AArch64 relocation types */
> 
>> +#define R_AARCH64_NONE          256 /* also accept R_ARM_NONE (0) as null */
>> +/* static data relocations */
>> +#define R_AARCH64_ABS64         257
>> +#define R_AARCH64_ABS32         258
> 
> thanks
> -- PMM
> 

I agree with your comments above, will change accordingly.

-- 
Claudio Fontana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-13 18:28           ` Peter Maydell
@ 2013-05-14 12:01             ` Claudio Fontana
  2013-05-14 12:25               ` Peter Maydell
  2013-05-14 12:41               ` Laurent Desnogues
  0 siblings, 2 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-14 12:01 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson

On 13.05.2013 20:28, Peter Maydell wrote:
> On 13 May 2013 14:33, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>
>> add preliminary support for TCG target aarch64.
> 
> Thanks for this patch. Some comments below.
> 
>> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
>> ---
>>  include/exec/exec-all.h  |    5 +-
>>  tcg/aarch64/tcg-target.c | 1084 ++++++++++++++++++++++++++++++++++++++++++++++
>>  tcg/aarch64/tcg-target.h |  106 +++++
>>  3 files changed, 1194 insertions(+), 1 deletion(-)
>>  create mode 100644 tcg/aarch64/tcg-target.c
>>  create mode 100644 tcg/aarch64/tcg-target.h
>>
>> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
>> index 6362074..5c31863 100644
>> --- a/include/exec/exec-all.h
>> +++ b/include/exec/exec-all.h
>> @@ -128,7 +128,7 @@ static inline void tlb_flush(CPUArchState *env, int flush_global)
>>
>>  #if defined(__arm__) || defined(_ARCH_PPC) \
>>      || defined(__x86_64__) || defined(__i386__) \
>> -    || defined(__sparc__) \
>> +    || defined(__sparc__) || defined(__aarch64__) \
>>      || defined(CONFIG_TCG_INTERPRETER)
>>  #define USE_DIRECT_JUMP
>>  #endif
>> @@ -230,6 +230,9 @@ static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
>>      *(uint32_t *)jmp_addr = addr - (jmp_addr + 4);
>>      /* no need to flush icache explicitly */
>>  }
>> +#elif defined(__aarch64__)
>> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
>> +#define tb_set_jmp_target1 aarch64_tb_set_jmp_target
>>  #elif defined(__arm__)
>>  static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
>>  {
>> diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
>> new file mode 100644
>> index 0000000..f24a567
>> --- /dev/null
>> +++ b/tcg/aarch64/tcg-target.c
>> @@ -0,0 +1,1084 @@
>> +/*
>> + * Initial TCG Implementation for aarch64
>> + *
>> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
>> + * Written by Claudio Fontana
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * (at your option) any later version.
>> + *
>> + * See the COPYING file in the top-level directory for details.
>> + */
>> +
>> +#ifdef TARGET_WORDS_BIGENDIAN
>> +#error "Sorry, bigendian target not supported yet."
>> +#endif /* TARGET_WORDS_BIGENDIAN */
>> +
>> +#ifndef NDEBUG
>> +static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>> +    "%x0", "%x1", "%x2", "%x3", "%x4", "%x5", "%x6", "%x7",
>> +    "%x8", "%x9", "%x10", "%x11", "%x12", "%x13", "%x14", "%x15",
>> +    "%x16", "%x17", "%x18", "%x19", "%x20", "%x21", "%x22", "%x23",
>> +    "%x24", "%x25", "%x26", "%x27", "%x28",
>> +    "%fp", /* frame pointer */
>> +    "%lr", /* link register */
>> +    "%sp",  /* stack pointer */
>> +};
>> +#endif /* NDEBUG */
>> +
>> +static const int tcg_target_reg_alloc_order[] = {
>> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23,
>> +    TCG_REG_X24, TCG_REG_X25, TCG_REG_X26, TCG_REG_X27,
>> +    TCG_REG_X28,
>> +
>> +    TCG_REG_X9, TCG_REG_X10, TCG_REG_X11, TCG_REG_X12,
>> +    TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
>> +
>> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
>> +    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
> 
> This list seems to not have all the registers in it.
> You can put the registers used for AREG0 and the temp
> reg in here -- TCG will correctly not use them because
> (a) AREG0 is allocated as a fixed register and (b)
> the temp is put in the reserved-regs list in tcg_target_init.
> 
> It should be OK to use X16 and X17 as well, right?

I see, I can add AREG0 (X19) and temp (X8) to the list then.  

I got cold feet about using X16 and X17 when I experienced their use by possibly libgthread and other system libraries,
and due to their definitions as IP0 and IP1 ("can be used by call veneers and PLT code").
But if you are sure they are safe to use I can add them to the set as temporary registers.

I skipped X18 because of its definition as the "platform register".
If you think that's a groundless fear, I can add that to the list as well.

> 
>> +};
>> +
>> +static const int tcg_target_call_iarg_regs[8] = {
>> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
>> +    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7
>> +};
>> +static const int tcg_target_call_oarg_regs[1] = {
>> +    TCG_REG_X0
>> +};
> 
> This would be a good place to say:
> #define TCG_REG_TMP TCG_REG_X8
> 
> and then use that instead of hard-coding X8 in various places
> (compare the tcg/arm code which has recently made that change)

Okay.

> 
>> +
>> +static inline void reloc_pc26(void *code_ptr, tcg_target_long target)
>> +{
>> +    tcg_target_long offset;
>> +    offset = (target - (tcg_target_long)code_ptr) / 4;
>> +    offset &= 0x03ffffff;
>> +
>> +    /* mask away previous PC_REL26 parameter contents, then set offset */
>> +    *(uint32_t *)code_ptr &= 0xfc000000;
>> +    *(uint32_t *)code_ptr |= offset;
> 
> It's important that this function doesn't ever write
> an intermediate value to the code area. In particular if
> the code area is already a valid and relocated instruction
> then we must never (even temporarily) write something other
> than the same byte values over it. So you need to read in
> the full 32 bits, modify it in a local variable and then
> write the 32 bits back again.
> 
> This is necessary because when dealing with exceptions QEMU
> will retranslate a block of code in-place; if it ever writes
> incorrect values to memory then it's possible for us to
> end up executing the incorrect values (due to split icache
> and dcache).

OK, will change accordingly.

> 
>> +}
>> +
>> +static inline void patch_reloc(uint8_t *code_ptr, int type,
>> +                               tcg_target_long value, tcg_target_long addend)
>> +{
>> +    switch (type) {
>> +    case R_AARCH64_JUMP26:
>> +    case R_AARCH64_CALL26:
>> +        reloc_pc26(code_ptr, value);
>> +        break;
>> +    default:
>> +        tcg_abort();
>> +    }
>> +}
>> +
>> +/* parse target specific constraints */
>> +static int target_parse_constraint(TCGArgConstraint *ct,
>> +                                   const char **pct_str)
>> +{
>> +    const char *ct_str; ct_str = *pct_str;
>> +
>> +    switch (ct_str[0]) {
>> +    case 'r':
>> +        ct->ct |= TCG_CT_REG;
>> +        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
>> +        break;
>> +    case 'l': /* qemu_ld / qemu_st address, data_reg */
>> +        ct->ct |= TCG_CT_REG;
>> +        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
>> +#ifdef CONFIG_SOFTMMU
>> +        /* x0 and x1 will be overwritten when reading the tlb entry,
>> +           and x2, and x3 for helper args, better to avoid using them. */
>> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X0);
>> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X1);
>> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X2);
>> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X3);
>> +#endif
>> +        break;
>> +    default:
>> +        return -1;
>> +    }
>> +
>> +    ct_str++;
>> +    *pct_str = ct_str;
>> +    return 0;
>> +}
>> +
>> +static inline int tcg_target_const_match(tcg_target_long val,
>> +                                         const TCGArgConstraint *arg_ct)
>> +{
>> +    int ct; ct = arg_ct->ct;
> 
> Please don't put multiple statements on one line.
> Either "int ct = arg_ct->ct;" or put the assignment
> on a line of its own.

Ok.

> 
>> +
>> +    if (ct & TCG_CT_CONST)
>> +        return 1;
> 
> QEMU coding style requires braces for all if() statements.
> You can run scripts/checkpatch.pl on your patches, and it
> will pick up most of these nits. (I won't bother mentioning
> other style problems checkpatch remarks on below, but there
> are 38 issues in total.)

Ok, will run the whole thing through the script and change accordingly.

> 
>> +
>> +    return 0;
>> +}
>> +
>> +enum aarch64_cond_code {
>> +    COND_EQ = 0x0,
>> +    COND_NE = 0x1,
>> +    COND_CS = 0x2,     /* Unsigned greater or equal */
>> +    COND_HS = 0x2,      /* ALIAS greater or equal */
>> +    COND_CC = 0x3,     /* Unsigned less than */
>> +    COND_LO = 0x3,     /* ALIAS Lower */
>> +    COND_MI = 0x4,     /* Negative */
>> +    COND_PL = 0x5,     /* Zero or greater */
>> +    COND_VS = 0x6,     /* Overflow */
>> +    COND_VC = 0x7,     /* No overflow */
>> +    COND_HI = 0x8,     /* Unsigned greater than */
>> +    COND_LS = 0x9,     /* Unsigned less or equal */
>> +    COND_GE = 0xa,
>> +    COND_LT = 0xb,
>> +    COND_GT = 0xc,
>> +    COND_LE = 0xd,
>> +    COND_AL = 0xe,
>> +    COND_NV = 0xf,
> 
> Probably worth a comment that 'NV' doesn't mean 'never' here:
> it behaves exactly like 'AL'.

Ok.

> 
>> +};
>> +
>> +static const enum aarch64_cond_code tcg_cond_to_aarch64_cond[] = {
>> +    [TCG_COND_EQ] = COND_EQ,
>> +    [TCG_COND_NE] = COND_NE,
>> +    [TCG_COND_LT] = COND_LT,
>> +    [TCG_COND_GE] = COND_GE,
>> +    [TCG_COND_LE] = COND_LE,
>> +    [TCG_COND_GT] = COND_GT,
>> +    /* unsigned */
>> +    [TCG_COND_LTU] = COND_LO,
>> +    [TCG_COND_GTU] = COND_HI,
>> +    [TCG_COND_GEU] = COND_HS,
>> +    [TCG_COND_LEU] = COND_LS,
>> +};
>> +
>> +/* opcodes for LDR / STR instructions with base + simm9 addressing */
>> +enum aarch64_ldst_op_data { /* size of the data moved */
>> +    LDST_8 = 0x38,
>> +    LDST_16 = 0x78,
>> +    LDST_32 = 0xb8,
>> +    LDST_64 = 0xf8,
>> +};
>> +enum aarch64_ldst_op_type { /* type of operation */
>> +    LDST_ST = 0x0,    /* store */
>> +    LDST_LD = 0x4,    /* load */
>> +    LDST_LD_S_X = 0x8,  /* load and sign-extend into Xt */
>> +    LDST_LD_S_W = 0xc,  /* load and sign-extend into Wt */
>> +};
>> +
>> +enum aarch64_arith_opc {
>> +    ARITH_ADD = 0x0b,
>> +    ARITH_SUB = 0x4b,
>> +    ARITH_AND = 0x0a,
>> +    ARITH_OR = 0x2a,
>> +    ARITH_XOR = 0x4a
>> +};
>> +
>> +enum aarch64_srr_opc {
>> +    SRR_SHL = 0x0,
>> +    SRR_SHR = 0x4,
>> +    SRR_SAR = 0x8,
>> +    SRR_ROR = 0xc
>> +};
>> +
>> +static inline enum aarch64_ldst_op_data
>> +aarch64_ldst_get_data(TCGOpcode tcg_op)
>> +{
>> +    switch (tcg_op) {
>> +    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
>> +    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
>> +    case INDEX_op_st8_i32: case INDEX_op_st8_i64:
>> +        return LDST_8;
>> +
>> +    case INDEX_op_ld16u_i32: case INDEX_op_ld16s_i32:
>> +    case INDEX_op_ld16u_i64: case INDEX_op_ld16s_i64:
>> +    case INDEX_op_st16_i32: case INDEX_op_st16_i64:
>> +        return LDST_16;
>> +
>> +    case INDEX_op_ld_i32: case INDEX_op_st_i32:
>> +    case INDEX_op_ld32u_i64: case INDEX_op_ld32s_i64:
>> +    case INDEX_op_st32_i64:
>> +        return LDST_32;
>> +
>> +    case INDEX_op_ld_i64: case INDEX_op_st_i64:
>> +        return LDST_64;
>> +
>> +    default:
>> +        tcg_abort();
>> +    }
>> +}
>> +
>> +static inline enum aarch64_ldst_op_type
>> +aarch64_ldst_get_type(TCGOpcode tcg_op)
>> +{
>> +    switch (tcg_op) {
>> +    case INDEX_op_st8_i32: case INDEX_op_st16_i32:
>> +    case INDEX_op_st8_i64: case INDEX_op_st16_i64:
>> +    case INDEX_op_st_i32:
>> +    case INDEX_op_st32_i64:
>> +    case INDEX_op_st_i64:
>> +        return LDST_ST;
>> +
>> +    case INDEX_op_ld8u_i32: case INDEX_op_ld16u_i32:
>> +    case INDEX_op_ld8u_i64: case INDEX_op_ld16u_i64:
>> +    case INDEX_op_ld_i32:
>> +    case INDEX_op_ld32u_i64:
>> +    case INDEX_op_ld_i64:
>> +        return LDST_LD;
>> +
>> +    case INDEX_op_ld8s_i32: case INDEX_op_ld16s_i32:
>> +        return LDST_LD_S_W;
>> +
>> +    case INDEX_op_ld8s_i64: case INDEX_op_ld16s_i64:
>> +    case INDEX_op_ld32s_i64:
>> +        return LDST_LD_S_X;
>> +
>> +    default:
>> +        tcg_abort();
>> +    }
>> +}
>> +
>> +static inline uint32_t tcg_in32(TCGContext *s)
>> +{
>> +    uint32_t v; v = *(uint32_t *)s->code_ptr;
>> +    return v;
>> +}
>> +
>> +static inline void tcg_out_ldst_9(TCGContext *s,
>> +                                  enum aarch64_ldst_op_data op_data,
>> +                                  enum aarch64_ldst_op_type op_type,
>> +                                  int rd, int rn, tcg_target_long offset)
>> +{
>> +    /* use LDUR with BASE register with 9bit signed unscaled offset */
>> +    unsigned int mod, off;
>> +
>> +    if (offset < 0) {
>> +        off = (256 + offset);
>> +        mod = 0x1;
>> +
>> +    } else {
>> +        off = offset;
>> +        mod = 0x0;
>> +    }
>> +
>> +    mod |= op_type;
>> +    tcg_out32(s, op_data << 24 | mod << 20 | off << 12 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_movr(TCGContext *s, int ext, int rd, int source)
>> +{
>> +    /* register to register move using MOV (shifted register with no shift) */
>> +    /* using MOV 0x2a0003e0 | (shift).. */
>> +    unsigned int base; base = ext ? 0xaa0003e0 : 0x2a0003e0;
>> +    tcg_out32(s, base | source << 16 | rd);
>> +}
>> +
>> +static inline void tcg_out_movi32(TCGContext *s, int ext, int rd,
>> +                                  uint32_t value)
>> +{
>> +    uint32_t half, base, movk = 0;
>> +    if (!value) {
>> +        tcg_out_movr(s, ext, rd, TCG_REG_XZR);
>> +        return;
>> +    }
>> +    /* construct halfwords of the immediate with MOVZ with LSL */
>> +    /* using MOVZ 0x52800000 | extended reg.. */
>> +    base = ext ? 0xd2800000 : 0x52800000;
>> +
>> +    half = value & 0xffff;
>> +    if (half) {
>> +        tcg_out32(s, base | half << 5 | rd);
>> +        movk = 0x20000000; /* morph next MOVZ into MOVK */
>> +    }
>> +
>> +    half = value >> 16;
>> +    if (half) { /* add shift 0x00200000. Op can be MOVZ or MOVK */
>> +        tcg_out32(s, base | movk | 0x00200000 | half << 5 | rd);
>> +    }
>> +}
>> +
>> +static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value)
>> +{
>> +    uint32_t half, base, movk = 0, shift = 0;
>> +    if (!value) {
>> +        tcg_out_movr(s, 1, rd, TCG_REG_XZR);
>> +        return;
>> +    }
>> +    /* construct halfwords of the immediate with MOVZ with LSL */
>> +    /* using MOVZ 0x52800000 | extended reg.. */
>> +    base = 0xd2800000;
>> +
>> +    while (value) {
>> +        half = value & 0xffff;
>> +        if (half) {
>> +            /* Op can be MOVZ or MOVK */
>> +            tcg_out32(s, base | movk | shift | half << 5 | rd);
>> +            if (!movk)
>> +                movk = 0x20000000; /* morph next MOVZs into MOVKs */
>> +        }
>> +        value >>= 16;
>> +        shift += 0x00200000;
>> +    }
> 
> It should be possible to improve on this, but this will do for now.

Yes, it should be possible to do better as an incremental change.

>> +}
>> +
>> +static inline void tcg_out_ldst_r(TCGContext *s,
>> +                                  enum aarch64_ldst_op_data op_data,
>> +                                  enum aarch64_ldst_op_type op_type,
>> +                                  int rd, int base, int regoff)
>> +{
>> +    /* I can't explain the 0x6000, but objdump/gdb from linaro does that */
> 
> It is just the encoding for "no extend field" (or explicit "LSL");
> check aarch64_ext_addr_regoff in the opcodes library (and NB that
> AARCH64_MOD_UXTX is defined to be 6):
> https://github.com/embecosm/sourceware/blob/master/opcodes/aarch64-dis.c

I see.

> 
>> +    /* load from memory to register using base + 64bit register offset */
>> +    /* using f.e. STR Wt, [Xn, Xm] 0xb8600800|(regoff << 16)|(base << 5)|rd */
>> +    tcg_out32(s, 0x00206800
>> +              | op_data << 24 | op_type << 20 | regoff << 16 | base << 5 | rd);
>> +}
>> +
>> +/* solve the whole ldst problem */
>> +static inline void tcg_out_ldst(TCGContext *s, enum aarch64_ldst_op_data data,
>> +                                enum aarch64_ldst_op_type type,
>> +                                int rd, int rn, tcg_target_long offset)
>> +{
>> +    if (offset > -256 && offset < 256) {
>> +        tcg_out_ldst_9(s, data, type, rd, rn, offset);
>> +
>> +    } else {
>> +        tcg_out_movi64(s, TCG_REG_X8, offset);
>> +        tcg_out_ldst_r(s, data, type, rd, rn, TCG_REG_X8);
>> +    }
>> +}
>> +
>> +static inline void tcg_out_movi(TCGContext *s, TCGType type,
>> +                                TCGReg rd, tcg_target_long value)
>> +{
>> +    if (type == TCG_TYPE_I64)
>> +        tcg_out_movi64(s, rd, value);
>> +    else
>> +        tcg_out_movi32(s, 0, rd, value);
>> +}
>> +
>> +/* mov alias implemented with add immediate, useful to move to/from SP */
>> +static inline void tcg_out_movr_sp(TCGContext *s, int ext, int rd, int rn)
>> +{
>> +    /* using ADD 0x11000000 | (ext) | rn << 5 | rd */
>> +    unsigned int base; base = ext ? 0x91000000 : 0x11000000;
>> +    tcg_out32(s, base | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_mov(TCGContext *s,
>> +                               TCGType type, TCGReg ret, TCGReg arg)
>> +{
>> +    if (ret != arg)
>> +        tcg_out_movr(s, type == TCG_TYPE_I64, ret, arg);
>> +}
>> +
>> +static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg arg,
>> +                              TCGReg arg1, tcg_target_long arg2)
>> +{
>> +    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_LD,
>> +                 arg, arg1, arg2);
>> +}
>> +
>> +static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
>> +                              TCGReg arg1, tcg_target_long arg2)
>> +{
>> +    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_ST,
>> +                 arg, arg1, arg2);
>> +}
>> +
>> +static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
>> +                                 int ext, int rd, int rn, int rm)
>> +{
>> +    /* Using shifted register arithmetic operations */
>> +    /* if extended registry operation (64bit) just or with 0x80 << 24 */
>> +    unsigned int base; base = ext ? (0x80 | opc) << 24 : opc << 24;
>> +    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_mul(TCGContext *s, int ext, int rd, int rn, int rm)
>> +{
>> +    /* Using MADD 0x1b000000 with Ra = wzr alias MUL 0x1b007c00 */
>> +    unsigned int base; base = ext ? 0x9b007c00 : 0x1b007c00;
>> +    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_shiftrot_reg(TCGContext *s,
>> +                                        enum aarch64_srr_opc opc, int ext,
>> +                                        int rd, int rn, int rm)
>> +{
>> +    /* using 2-source data processing instructions 0x1ac02000 */
>> +    unsigned int base; base = ext ? 0x9ac02000 : 0x1ac02000;
>> +    tcg_out32(s, base | rm << 16 | opc << 8 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_ubfm(TCGContext *s, int ext,
>> +                                int rd, int rn, unsigned int a, unsigned int b)
>> +{
>> +    /* Using UBFM 0x53000000 Wd, Wn, a, b - Why ext has 4? */
> 
> It's required by the instruction encoding, that's all.

I see.

>> +    unsigned int base; base = ext ? 0xd3400000 : 0x53000000;
>> +    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_sbfm(TCGContext *s, int ext,
>> +                                int rd, int rn, unsigned int a, unsigned int b)
>> +{
>> +    /* Using SBFM 0x13000000 Wd, Wn, a, b - Why ext has 4? */
>> +    unsigned int base; base = ext ? 0x93400000 : 0x13000000;
>> +    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_extr(TCGContext *s, int ext,
>> +                                int rd, int rn, int rm, unsigned int a)
>> +{
>> +    /* Using EXTR 0x13800000 Wd, Wn, Wm, a - Why ext has 4? */
>> +    unsigned int base; base = ext ? 0x93c00000 : 0x13800000;
>> +    tcg_out32(s, base | rm << 16 | a << 10 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_shl(TCGContext *s, int ext,
>> +                               int rd, int rn, unsigned int m)
>> +{
>> +    int bits, max;
>> +    bits = ext ? 64 : 32; max = bits - 1;
>> +    tcg_out_ubfm(s, ext, rd, rn, bits - (m & max), max - (m & max));
>> +}
>> +
>> +static inline void tcg_out_shr(TCGContext *s, int ext,
>> +                               int rd, int rn, unsigned int m)
>> +{
>> +    int max; max = ext ? 63 : 31;
>> +    tcg_out_ubfm(s, ext, rd, rn, m & max, max);
>> +}
>> +
>> +static inline void tcg_out_sar(TCGContext *s, int ext,
>> +                               int rd, int rn, unsigned int m)
>> +{
>> +    int max; max = ext ? 63 : 31;
>> +    tcg_out_sbfm(s, ext, rd, rn, m & max, max);
>> +}
>> +
>> +static inline void tcg_out_rotr(TCGContext *s, int ext,
>> +                                int rd, int rn, unsigned int m)
>> +{
>> +    int max; max = ext ? 63 : 31;
>> +    tcg_out_extr(s, ext, rd, rn, rn, m & max);
>> +}
>> +
>> +static inline void tcg_out_rotl(TCGContext *s, int ext,
>> +                                int rd, int rn, unsigned int m)
>> +{
>> +    int bits, max;
>> +    bits = ext ? 64 : 32; max = bits - 1;
>> +    tcg_out_extr(s, ext, rd, rn, rn, bits - (m & max));
>> +}
>> +
>> +static inline void tcg_out_cmp(TCGContext *s, int ext,
>> +                               int rn, int rm)
>> +{
>> +    /* Using CMP alias SUBS wzr, Wn, Wm */
>> +    unsigned int base; base = ext ? 0xeb00001f : 0x6b00001f;
>> +    tcg_out32(s, base | rm << 16 | rn << 5);
>> +}
>> +
>> +static inline void tcg_out_csel(TCGContext *s, int ext,
>> +                                int rd, int rn, int rm,
>> +                                enum aarch64_cond_code c)
>> +{
>> +    /* Using CSEL 0x1a800000 wd, wn, wm, c */
>> +    unsigned int base; base = ext ? 0x9a800000 : 0x1a800000;
>> +    tcg_out32(s, base | rm << 16 | c << 12 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_goto(TCGContext *s, tcg_target_long target)
>> +{
>> +    tcg_target_long offset;
>> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
>> +
>> +    if (offset <= -0x02000000 || offset >= 0x02000000) {
>> +        /* out of 26bit range */
>> +        tcg_abort();
>> +    }
>> +
>> +    tcg_out32(s, 0x14000000 | (offset & 0x03ffffff));
>> +}
>> +
>> +static inline void tcg_out_goto_noaddr(TCGContext *s)
>> +{
>> +    /* We pay attention here to not modify the branch target by
>> +       reading from the buffer. This ensure that caches and memory are
>> +       kept coherent during retranslation. */
>> +    uint32_t insn; insn = tcg_in32(s);
> 
> Don't put two statements on one line, please.

Ack.
 
>> +    insn |= 0x14000000;
> 
> If you're reading the whole 32 bit insn in then you need to
> mask out the possible garbage in the instruction bits
> before ORing in that 0x14000000. The first time around
> you can't guarantee they are either zero or correct.

I understand, will fix.

>> +    tcg_out32(s, insn);
>> +}
>> +
>> +/* offset is scaled and relative! Check range before calling! */
>> +static inline void tcg_out_goto_cond(TCGContext *s, TCGCond c,
>> +                                     tcg_target_long offset)
>> +{
>> +    tcg_out32(s, 0x54000000 | tcg_cond_to_aarch64_cond[c] | offset << 5);
>> +}
>> +
>> +static inline void tcg_out_callr(TCGContext *s, int reg)
>> +{
>> +    tcg_out32(s, 0xd63f0000 | reg << 5);
>> +}
>> +
>> +static inline void tcg_out_gotor(TCGContext *s, int reg)
>> +{
>> +    tcg_out32(s, 0xd61f0000 | reg << 5);
>> +}
>> +
>> +static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
>> +{
>> +    tcg_target_long offset;
>> +
>> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
>> +
>> +    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */
>> +        tcg_out_movi64(s, TCG_REG_X8, target);
>> +        tcg_out_callr(s, TCG_REG_X8);
>> +
>> +    } else {
>> +        tcg_out32(s, 0x94000000 | (offset & 0x03ffffff));
>> +    }
>> +}
>> +
>> +static inline void tcg_out_ret(TCGContext *s)
>> +{
>> +    /* emit RET { LR } */
>> +    tcg_out32(s, 0xd65f03c0);
>> +}
>> +
>> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
>> +{
>> +    tcg_target_long target, offset;
>> +    target = (tcg_target_long)addr;
>> +    offset = (target - (tcg_target_long)jmp_addr) / 4;
>> +
>> +    if (offset <= -0x02000000 || offset >= 0x02000000) {
>> +        /* out of 26bit range */
>> +        tcg_abort();
>> +    }
>> +
>> +    patch_reloc((uint8_t *)jmp_addr, R_AARCH64_JUMP26, target, 0);
>> +    flush_icache_range(jmp_addr, jmp_addr + 4);
>> +}
>> +
>> +static inline void tcg_out_goto_label(TCGContext *s, int label_index)
>> +{
>> +    TCGLabel *l = &s->labels[label_index];
>> +
>> +    if (!l->has_value) {
>> +        tcg_out_reloc(s, s->code_ptr, R_AARCH64_JUMP26, label_index, 0);
>> +        tcg_out_goto_noaddr(s);
>> +
>> +    } else {
>> +        tcg_out_goto(s, l->u.value);
>> +    }
>> +}
>> +
>> +static inline void tcg_out_goto_label_cond(TCGContext *s, TCGCond c, int label_index)
>> +{
>> +    tcg_target_long offset;
>> +    /* backward conditional jump never seems to happen in practice,
>> +       so just always use the branch trampoline */
> 
> I think I know what you mean here but this comment is a bit cryptic;
> can you expand?

I basically had some code in place before, that would emit a conditional jump
if the label already had a value and was in range of the conditional jump instruction.

While doing some coverage testing though, I realized that all conditional jumps
emitted were forward jumps, where the destination was unknown,
so the [pseudo]code block

"if (hasvalue(label) AND in_range(label)) { emit_cond_jump(label); }"

was never triggered.
So the safest option seemed to invert the condition and do a short (2 instr) conditional jump forward,
followed by an unconditional 26bit relative jump.

If we are sure that the destinations of the conditional jumps are always in the 19bit range
(+-1MiB from PC), then we could always just use the conditional jump.
In this case, I propose to add it to the list of incremental improvements.

> 
>> +    c = tcg_invert_cond(c);
>> +    offset = 2; /* skip current instr and the next */
>> +    tcg_out_goto_cond(s, c, offset);
>> +    tcg_out_goto_label(s, label_index); /* emit 26bit jump */
>> +}
>> +
>> +#ifdef CONFIG_SOFTMMU
>> +#include "exec/softmmu_defs.h"
>> +
>> +/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
>> +   int mmu_idx) */
>> +static const void * const qemu_ld_helpers[4] = {
>> +    helper_ldb_mmu,
>> +    helper_ldw_mmu,
>> +    helper_ldl_mmu,
>> +    helper_ldq_mmu,
>> +};
>> +
>> +/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
>> +   uintxx_t val, int mmu_idx) */
>> +static const void * const qemu_st_helpers[4] = {
>> +    helper_stb_mmu,
>> +    helper_stw_mmu,
>> +    helper_stl_mmu,
>> +    helper_stq_mmu,
>> +};
>> +
>> +#endif /* CONFIG_SOFTMMU */
>> +
>> +static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
>> +{
>> +    int addr_reg, data_reg;
>> +#ifdef CONFIG_SOFTMMU
>> +    int mem_index, s_bits;
>> +#endif
>> +    data_reg = args[0];
>> +    addr_reg = args[1];
>> +
>> +#ifdef CONFIG_SOFTMMU
>> +    mem_index = args[2];
>> +    s_bits = opc & 3;
>> +
>> +    /* Should generate something like the following:
>> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
>> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
>> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
>> +     */
> 
> The comment says this, but you don't actually seem to have
> the code to do it?
> 
> And there definitely needs to be a test somewhere in
> your generated code for "did the TLB hit or miss?"

Yes, it's basically a TODO comment, adapted from the ARM TCG target.
Right now we always go through the C helpers, which is of course slower.

> 
>> +#  if CPU_TLB_BITS > 8
>> +#   error "CPU_TLB_BITS too large"
>> +#  endif
>> +
>> +    /* all arguments passed via registers */
>> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
>> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
>> +    tcg_out_movi32(s, 0, TCG_REG_X2, mem_index);
>> +
>> +    tcg_out_movi64(s, TCG_REG_X8, (uint64_t)qemu_ld_helpers[s_bits]);
>> +    tcg_out_callr(s, TCG_REG_X8);
>> +
>> +    if (opc & 0x04) { /* sign extend */
>> +        unsigned int bits; bits = 8 * (1 << s_bits) - 1;
>> +        tcg_out_sbfm(s, 1, data_reg, TCG_REG_X0, 0, bits); /* 7|15|31 */
>> +
>> +    } else {
>> +        tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
>> +    }
>> +
>> +#else /* !CONFIG_SOFTMMU */
>> +    tcg_abort(); /* TODO */
>> +#endif
>> +}
>> +
>> +static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
>> +{
>> +    int addr_reg, data_reg;
>> +#ifdef CONFIG_SOFTMMU
>> +    int mem_index, s_bits;
>> +#endif
>> +    data_reg = args[0];
>> +    addr_reg = args[1];
>> +
>> +#ifdef CONFIG_SOFTMMU
>> +    mem_index = args[2];
>> +    s_bits = opc & 3;
>> +
>> +    /* Should generate something like the following:
>> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
>> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
>> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
>> +     */
>> +#  if CPU_TLB_BITS > 8
>> +#   error "CPU_TLB_BITS too large"
>> +#  endif
>> +
>> +    /* all arguments passed via registers */
>> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
>> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
>> +    tcg_out_movr(s, 1, TCG_REG_X2, data_reg);
>> +    tcg_out_movi32(s, 0, TCG_REG_X3, mem_index);
>> +
>> +    tcg_out_movi64(s, TCG_REG_X8, (uint64_t)qemu_st_helpers[s_bits]);
>> +    tcg_out_callr(s, TCG_REG_X8);
>> +
>> +#else /* !CONFIG_SOFTMMU */
>> +    tcg_abort(); /* TODO */
>> +#endif
>> +}
>> +
>> +static uint8_t *tb_ret_addr;
>> +
>> +/* callee stack use example:
>> +   stp     x29, x30, [sp,#-32]!
>> +   mov     x29, sp
>> +   stp     x1, x2, [sp,#16]
>> +   ...
>> +   ldp     x1, x2, [sp,#16]
>> +   ldp     x29, x30, [sp],#32
>> +   ret
>> +*/
>> +
>> +/* push r1 and r2, and alloc stack space for a total of
>> +   alloc_n elements (1 element=16 bytes, must be between 1 and 31. */
>> +static inline void tcg_out_push_p(TCGContext *s,
>> +                                  TCGReg r1, TCGReg r2, int alloc_n)
> 
> I think these function names would benefit from spelling
> out "pair" rather than abbreviating it to "p".

Ok, no damage in changing it.

> 
>> +{
>> +    /* using indexed scaled simm7 STP 0x28800000 | (ext) | 0x01000000 (pre-idx)
>> +       | alloc_n * (-1) << 16 | r2 << 10 | sp(31) << 5 | r1 */
>> +    assert(alloc_n > 0 && alloc_n < 0x20);
>> +    alloc_n = (-alloc_n) & 0x3f;
>> +    tcg_out32(s, 0xa98003e0 | alloc_n << 16 | r2 << 10 | r1);
>> +}
>> +
>> +/* dealloc stack space for a total of alloc_n elements and pop r1, r2.  */
>> +static inline void tcg_out_pop_p(TCGContext *s,
>> +                                 TCGReg r1, TCGReg r2, int alloc_n)
>> +{
>> +    /* using indexed scaled simm7 LDP 0x28c00000 | (ext) | nothing (post-idx)
>> +       | alloc_n << 16 | r2 << 10 | sp(31) << 5 | r1 */
>> +    assert(alloc_n > 0 && alloc_n < 0x20);
>> +    tcg_out32(s, 0xa8c003e0 | alloc_n << 16 | r2 << 10 | r1);
>> +}
>> +
>> +static inline void tcg_out_store_p(TCGContext *s,
>> +                                   TCGReg r1, TCGReg r2, int idx)
>> +{
>> +    /* using register pair offset simm7 STP 0x29000000 | (ext)
>> +       | idx << 16 | r2 << 10 | FP(29) << 5 | r1 */
>> +    assert(idx > 0 && idx < 0x20);
>> +    tcg_out32(s, 0xa90003a0 | idx << 16 | r2 << 10 | r1);
>> +}
>> +
>> +static inline void tcg_out_load_p(TCGContext *s, TCGReg r1, TCGReg r2, int idx)
>> +{
>> +    /* using register pair offset simm7 LDP 0x29400000 | (ext)
>> +       | idx << 16 | r2 << 10 | FP(29) << 5 | r1 */
>> +    assert(idx > 0 && idx < 0x20);
>> +    tcg_out32(s, 0xa94003a0 | idx << 16 | r2 << 10 | r1);
>> +}
>> +
>> +static void tcg_out_op(TCGContext *s, TCGOpcode opc,
>> +                       const TCGArg *args, const int *const_args)
>> +{
>> +    int ext = 0;
>> +
>> +    switch (opc) {
>> +    case INDEX_op_exit_tb:
>> +        tcg_out_movi64(s, TCG_REG_X0, args[0]); /* load retval in X0 */
>> +        tcg_out_goto(s, (tcg_target_long)tb_ret_addr);
>> +        break;
>> +
>> +    case INDEX_op_goto_tb:
>> +#ifndef USE_DIRECT_JUMP
>> +#error "USE_DIRECT_JUMP required for aarch64"
>> +#endif
>> +        assert(s->tb_jmp_offset != NULL); /* consistency for USE_DIRECT_JUMP */
>> +        s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
>> +        /* actual branch destination will be patched by
>> +           aarch64_tb_set_jmp_target later, beware retranslation. */
>> +        tcg_out_goto_noaddr(s);
>> +        s->tb_next_offset[args[0]] = s->code_ptr - s->code_buf;
>> +        break;
>> +
>> +    case INDEX_op_call:
>> +        if (const_args[0])
>> +            tcg_out_call(s, args[0]);
>> +        else
>> +            tcg_out_callr(s, args[0]);
>> +        break;
>> +
>> +    case INDEX_op_br:
>> +        tcg_out_goto_label(s, args[0]);
>> +        break;
>> +
>> +    case INDEX_op_ld_i32:
>> +    case INDEX_op_ld_i64:
>> +    case INDEX_op_st_i32:
>> +    case INDEX_op_st_i64:
>> +    case INDEX_op_ld8u_i32:
>> +    case INDEX_op_ld8s_i32:
>> +    case INDEX_op_ld16u_i32:
>> +    case INDEX_op_ld16s_i32:
>> +    case INDEX_op_ld8u_i64:
>> +    case INDEX_op_ld8s_i64:
>> +    case INDEX_op_ld16u_i64:
>> +    case INDEX_op_ld16s_i64:
>> +    case INDEX_op_ld32u_i64:
>> +    case INDEX_op_ld32s_i64:
>> +    case INDEX_op_st8_i32:
>> +    case INDEX_op_st8_i64:
>> +    case INDEX_op_st16_i32:
>> +    case INDEX_op_st16_i64:
>> +    case INDEX_op_st32_i64:
>> +        tcg_out_ldst(s, aarch64_ldst_get_data(opc), aarch64_ldst_get_type(opc),
>> +                     args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_mov_i64: ext = 1;
>> +    case INDEX_op_mov_i32:
>> +        tcg_out_movr(s, ext, args[0], args[1]);
>> +        break;
>> +
>> +    case INDEX_op_movi_i64:
>> +        tcg_out_movi64(s, args[0], args[1]);
>> +        break;
>> +
>> +    case INDEX_op_movi_i32:
>> +        tcg_out_movi32(s, 0, args[0], args[1]);
>> +        break;
>> +
>> +    case INDEX_op_add_i64: ext = 1;
>> +    case INDEX_op_add_i32:
>> +        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_sub_i64: ext = 1;
>> +    case INDEX_op_sub_i32:
>> +        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_and_i64: ext = 1;
>> +    case INDEX_op_and_i32:
>> +        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_or_i64: ext = 1;
>> +    case INDEX_op_or_i32:
>> +        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_xor_i64: ext = 1;
>> +    case INDEX_op_xor_i32:
>> +        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_mul_i64: ext = 1;
>> +    case INDEX_op_mul_i32:
>> +        tcg_out_mul(s, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_shl_i64: ext = 1;
>> +    case INDEX_op_shl_i32:
>> +        if (const_args[2])      /* LSL / UBFM Wd, Wn, (32 - m) */
>> +            tcg_out_shl(s, ext, args[0], args[1], args[2]);
>> +        else                    /* LSL / LSLV */
>> +            tcg_out_shiftrot_reg(s, SRR_SHL, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_shr_i64: ext = 1;
>> +    case INDEX_op_shr_i32:
>> +        if (const_args[2])      /* LSR / UBFM Wd, Wn, m, 31 */
>> +            tcg_out_shr(s, ext, args[0], args[1], args[2]);
>> +        else                    /* LSR / LSRV */
>> +            tcg_out_shiftrot_reg(s, SRR_SHR, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_sar_i64: ext = 1;
>> +    case INDEX_op_sar_i32:
>> +        if (const_args[2])      /* ASR / SBFM Wd, Wn, m, 31 */
>> +            tcg_out_sar(s, ext, args[0], args[1], args[2]);
>> +        else                    /* ASR / ASRV */
>> +            tcg_out_shiftrot_reg(s, SRR_SAR, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_rotr_i64: ext = 1;
>> +    case INDEX_op_rotr_i32:
>> +        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, m */
>> +            tcg_out_rotr(s, ext, args[0], args[1], args[2]); /* XXX UNTESTED */
>> +        else                    /* ROR / RORV */
>> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_rotl_i64: ext = 1;
>> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
>> +        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, 32 - m */
>> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
>> +        else { /* no RSB in aarch64 unfortunately. */
>> +            /* XXX UNTESTED */
>> +            tcg_out_movi32(s, ext, TCG_REG_X8, ext ? 64 : 32);
>> +            tcg_out_arith(s, ARITH_SUB, ext, TCG_REG_X8, TCG_REG_X8, args[2]);
>> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], TCG_REG_X8);
> 
> I think you should either test this, or remove it [rot
> support is optional so you could put it back in a later
> patch].

Yes, I agree. I could not find an image which triggered that code path for register rotation amounts.

> 
>> +        }
>> +        break;
>> +
>> +    case INDEX_op_brcond_i64: ext = 1;
>> +    case INDEX_op_brcond_i32: /* CMP 0, 1, cond(2), label 3 */
>> +        tcg_out_cmp(s, ext, args[0], args[1]);
>> +        tcg_out_goto_label_cond(s, args[2], args[3]);
>> +        break;
>> +
>> +    case INDEX_op_setcond_i64: ext = 1;
>> +    case INDEX_op_setcond_i32:
>> +        tcg_out_movi32(s, ext, TCG_REG_X8, 0x01);
>> +        tcg_out_cmp(s, ext, args[1], args[2]);
>> +        tcg_out_csel(s, ext, args[0], TCG_REG_X8, TCG_REG_XZR,
>> +                     tcg_cond_to_aarch64_cond[args[3]]);
> 
> Better to use CSET Xd, cond [which is an alias for
> CSINC Xd, XZR, XZR, invert(cond)]

Ok.

> 
>> +        break;
>> +
>> +    case INDEX_op_qemu_ld8u:
>> +        tcg_out_qemu_ld(s, args, 0 | 0);
>> +        break;
>> +    case INDEX_op_qemu_ld8s:
>> +        tcg_out_qemu_ld(s, args, 4 | 0);
>> +        break;
>> +    case INDEX_op_qemu_ld16u:
>> +        tcg_out_qemu_ld(s, args, 0 | 1);
>> +        break;
>> +    case INDEX_op_qemu_ld16s:
>> +        tcg_out_qemu_ld(s, args, 4 | 1);
>> +        break;
>> +    case INDEX_op_qemu_ld32u:
>> +        tcg_out_qemu_ld(s, args, 0 | 2);
>> +        break;
>> +    case INDEX_op_qemu_ld32s:
>> +        tcg_out_qemu_ld(s, args, 4 | 2);
>> +        break;
>> +    case INDEX_op_qemu_ld32:
>> +        tcg_out_qemu_ld(s, args, 0 | 2);
>> +        break;
>> +    case INDEX_op_qemu_ld64:
>> +        tcg_out_qemu_ld(s, args, 0 | 3);
>> +        break;
>> +    case INDEX_op_qemu_st8:
>> +        tcg_out_qemu_st(s, args, 0);
>> +        break;
>> +    case INDEX_op_qemu_st16:
>> +        tcg_out_qemu_st(s, args, 1);
>> +        break;
>> +    case INDEX_op_qemu_st32:
>> +        tcg_out_qemu_st(s, args, 2);
>> +        break;
>> +    case INDEX_op_qemu_st64:
>> +        tcg_out_qemu_st(s, args, 3);
>> +        break;
>> +
>> +    default:
>> +        tcg_abort(); /* opcode not implemented */
>> +    }
>> +}
>> +
>> +static const TCGTargetOpDef aarch64_op_defs[] = {
>> +    { INDEX_op_exit_tb, { } },
>> +    { INDEX_op_goto_tb, { } },
>> +    { INDEX_op_call, { "ri" } },
>> +    { INDEX_op_br, { } },
>> +
>> +    { INDEX_op_mov_i32, { "r", "r" } },
>> +    { INDEX_op_mov_i64, { "r", "r" } },
>> +
>> +    { INDEX_op_movi_i32, { "r" } },
>> +    { INDEX_op_movi_i64, { "r" } },
>> +
>> +    { INDEX_op_ld8u_i32, { "r", "r" } },
>> +    { INDEX_op_ld8s_i32, { "r", "r" } },
>> +    { INDEX_op_ld16u_i32, { "r", "r" } },
>> +    { INDEX_op_ld16s_i32, { "r", "r" } },
>> +    { INDEX_op_ld_i32, { "r", "r" } },
>> +    { INDEX_op_ld8u_i64, { "r", "r" } },
>> +    { INDEX_op_ld8s_i64, { "r", "r" } },
>> +    { INDEX_op_ld16u_i64, { "r", "r" } },
>> +    { INDEX_op_ld16s_i64, { "r", "r" } },
>> +    { INDEX_op_ld32u_i64, { "r", "r" } },
>> +    { INDEX_op_ld32s_i64, { "r", "r" } },
>> +    { INDEX_op_ld_i64, { "r", "r" } },
>> +
>> +    { INDEX_op_st8_i32, { "r", "r" } },
>> +    { INDEX_op_st16_i32, { "r", "r" } },
>> +    { INDEX_op_st_i32, { "r", "r" } },
>> +    { INDEX_op_st8_i64, { "r", "r" } },
>> +    { INDEX_op_st16_i64, { "r", "r" } },
>> +    { INDEX_op_st32_i64, { "r", "r" } },
>> +    { INDEX_op_st_i64, { "r", "r" } },
>> +
>> +    { INDEX_op_add_i32, { "r", "r", "r" } },
>> +    { INDEX_op_add_i64, { "r", "r", "r" } },
>> +    { INDEX_op_sub_i32, { "r", "r", "r" } },
>> +    { INDEX_op_sub_i64, { "r", "r", "r" } },
>> +    { INDEX_op_mul_i32, { "r", "r", "r" } },
>> +    { INDEX_op_mul_i64, { "r", "r", "r" } },
>> +    { INDEX_op_and_i32, { "r", "r", "r" } },
>> +    { INDEX_op_and_i64, { "r", "r", "r" } },
>> +    { INDEX_op_or_i32, { "r", "r", "r" } },
>> +    { INDEX_op_or_i64, { "r", "r", "r" } },
>> +    { INDEX_op_xor_i32, { "r", "r", "r" } },
>> +    { INDEX_op_xor_i64, { "r", "r", "r" } },
>> +
>> +    { INDEX_op_shl_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_shr_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_sar_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_rotl_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_rotr_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_shl_i64, { "r", "r", "ri" } },
>> +    { INDEX_op_shr_i64, { "r", "r", "ri" } },
>> +    { INDEX_op_sar_i64, { "r", "r", "ri" } },
>> +    { INDEX_op_rotl_i64, { "r", "r", "ri" } },
>> +    { INDEX_op_rotr_i64, { "r", "r", "ri" } },
>> +
>> +    { INDEX_op_brcond_i32, { "r", "r" } },
>> +    { INDEX_op_setcond_i32, { "r", "r", "r" } },
>> +    { INDEX_op_brcond_i64, { "r", "r" } },
>> +    { INDEX_op_setcond_i64, { "r", "r", "r" } },
>> +
>> +    { INDEX_op_qemu_ld8u, { "r", "l" } },
>> +    { INDEX_op_qemu_ld8s, { "r", "l" } },
>> +    { INDEX_op_qemu_ld16u, { "r", "l" } },
>> +    { INDEX_op_qemu_ld16s, { "r", "l" } },
>> +    { INDEX_op_qemu_ld32u, { "r", "l" } },
>> +    { INDEX_op_qemu_ld32s, { "r", "l" } },
>> +
>> +    { INDEX_op_qemu_ld32, { "r", "l" } },
>> +    { INDEX_op_qemu_ld64, { "r", "l" } },
>> +
>> +    { INDEX_op_qemu_st8, { "l", "l" } },
>> +    { INDEX_op_qemu_st16, { "l", "l" } },
>> +    { INDEX_op_qemu_st32, { "l", "l" } },
>> +    { INDEX_op_qemu_st64, { "l", "l" } },
>> +    { -1 },
>> +};
>> +
>> +static void tcg_target_init(TCGContext *s)
>> +{
>> +#if !defined(CONFIG_USER_ONLY)
>> +    /* fail safe */
>> +    if ((1ULL << CPU_TLB_ENTRY_BITS) != sizeof(CPUTLBEntry))
>> +        tcg_abort();
>> +#endif
>> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffff);
>> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffff);
>> +
>> +    tcg_regset_set32(tcg_target_call_clobber_regs, 0,
>> +                     (1 << TCG_REG_X0) | (1 << TCG_REG_X1) |
>> +                     (1 << TCG_REG_X2) | (1 << TCG_REG_X3) |
>> +                     (1 << TCG_REG_X4) | (1 << TCG_REG_X5) |
>> +                     (1 << TCG_REG_X6) | (1 << TCG_REG_X7) |
>> +                     (1 << TCG_REG_X8) | (1 << TCG_REG_X9) |
>> +                     (1 << TCG_REG_X10) | (1 << TCG_REG_X11) |
>> +                     (1 << TCG_REG_X12) | (1 << TCG_REG_X13) |
>> +                     (1 << TCG_REG_X14) | (1 << TCG_REG_X15) |
>> +                     (1 << TCG_REG_X16) | (1 << TCG_REG_X17) |
>> +                     (1 << TCG_REG_X18) | (1 << TCG_REG_LR));
>> +
>> +    tcg_regset_clear(s->reserved_regs);
>> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
>> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_X8);
>> +
>> +    tcg_add_target_add_op_defs(aarch64_op_defs);
>> +    tcg_set_frame(s, TCG_AREG0, offsetof(CPUArchState, temp_buf),
>> +                  CPU_TEMP_BUF_NLONGS * sizeof(long));
> 
> tcg_set_frame() should be called in the prologue generation
> function, not here. Also, please don't use temp_buf, it is
> going to go away shortly, as per this patch:
>  http://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03859.html
> 

I understand that you want to get rid of temp_buf;
that would mean if I understand correctly using the stack for that end;
I feel a bit uneasy about the mechanics though, and the stability of the result:
could we add this to the TODO list for a successive change,
so that we can have a working version now and get a successive version to a whole new round of testing?

>> +}
>> +
>> +static void tcg_target_qemu_prologue(TCGContext *s)
>> +{
>> +    int r;
>> +    int frame_size; /* number of 16 byte items */
>> +
>> +    /* we need to save (FP, LR) and X19 to X28 */
>> +    frame_size = (1) + (TCG_REG_X27 - TCG_REG_X19) / 2 + 1;
> 
> The comment says "X19 to X28" and the code does X27 - X19:
> which is right?

The comment is right, and the code is technically working but it shows as misleading for the reader.
I will rewrite the callee-saved registers' contribution to the frame size (in 16 byte elements) as:

(TCG_REG_X28 - TCG_REG_X19) / 2 + 1;

> 
> Why the brackets round the first '1' ?

It represents the (FP, LR) pair, so the () should help the reader notice that it refers
to the couple (FP, LR) mentioned in the comment just above.
It clearly failed, so I will try to align the comment and the code better.

> 
>> +
>> +    /* push (fp, lr) and update sp to final frame size */
>> +    tcg_out_push_p(s, TCG_REG_FP, TCG_REG_LR, frame_size);
>> +
>> +    /* FP -> frame chain */
>> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
>> +
>> +    /* store callee-preserved regs x19..x28 */
>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>> +        tcg_out_store_p(s, r, r + 1, idx);
>> +    }
>> +
>> +    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
>> +    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
>> +
>> +    tb_ret_addr = s->code_ptr;
>> +
>> +    /* restore registers x19..x28 */
>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>> +        tcg_out_load_p(s, r, r + 1, idx);
>> +    }
>> +
>> +    /* pop (fp, lr), restore sp to previous frame, return */
>> +    tcg_out_pop_p(s, TCG_REG_FP, TCG_REG_LR, frame_size);
>> +    tcg_out_ret(s);
>> +}
>> diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
>> new file mode 100644
>> index 0000000..f28af09
>> --- /dev/null
>> +++ b/tcg/aarch64/tcg-target.h
>> @@ -0,0 +1,106 @@
>> +/*
>> + * Initial TCG Implementation for aarch64
>> + *
>> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
>> + * Written by Claudio Fontana
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * (at your option) any later version.
>> + *
>> + * See the COPYING file in the top-level directory for details.
>> + */
>> +
>> +#ifndef TCG_TARGET_AARCH64
>> +#define TCG_TARGET_AARCH64 1
>> +
>> +#undef TCG_TARGET_WORDS_BIGENDIAN
>> +#undef TCG_TARGET_STACK_GROWSUP
>> +
>> +typedef enum {
>> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3, TCG_REG_X4,
>> +    TCG_REG_X5, TCG_REG_X6, TCG_REG_X7, TCG_REG_X8, TCG_REG_X9,
>> +    TCG_REG_X10, TCG_REG_X11, TCG_REG_X12, TCG_REG_X13, TCG_REG_X14,
>> +    TCG_REG_X15, TCG_REG_X16, TCG_REG_X17, TCG_REG_X18, TCG_REG_X19,
>> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23, TCG_REG_X24,
>> +    TCG_REG_X25, TCG_REG_X26, TCG_REG_X27, TCG_REG_X28,
>> +    TCG_REG_FP,  /* frame pointer */
>> +    TCG_REG_LR, /* link register */
>> +    TCG_REG_SP,  /* stack pointer or zero register */
>> +    TCG_REG_XZR = TCG_REG_SP /* same register number */
>> +    /* program counter is not directly accessible! */
>> +} TCGReg;
>> +
>> +#define TCG_TARGET_NB_REGS 32
>> +#define TCG_CT_CONST_ARM 0x100
> 
> This define is never used. (Eventually you'll want to define
> some constraints for particular kinds of constant and some
> TCG_CT_CONST_* defines to go with them but for now we don't
> need either.)

Yes, that was my intention. I will remove it for now.

>> +
>> +/* used for function call generation */
>> +#define TCG_REG_CALL_STACK             TCG_REG_SP
>> +#define TCG_TARGET_STACK_ALIGN         16
>> +#define TCG_TARGET_CALL_ALIGN_ARGS      1
>> +#define TCG_TARGET_CALL_STACK_OFFSET   0
>> +
>> +/* optional instructions */
>> +#define TCG_TARGET_HAS_div_i32          0
>> +#define TCG_TARGET_HAS_ext8s_i32        0
>> +#define TCG_TARGET_HAS_ext16s_i32       0
>> +#define TCG_TARGET_HAS_ext8u_i32        0
>> +#define TCG_TARGET_HAS_ext16u_i32       0
>> +#define TCG_TARGET_HAS_bswap16_i32      0
>> +#define TCG_TARGET_HAS_bswap32_i32      0
>> +#define TCG_TARGET_HAS_not_i32          0
>> +#define TCG_TARGET_HAS_neg_i32          0
>> +#define TCG_TARGET_HAS_rot_i32          1
>> +#define TCG_TARGET_HAS_andc_i32         0
>> +#define TCG_TARGET_HAS_orc_i32          0
>> +#define TCG_TARGET_HAS_eqv_i32          0
>> +#define TCG_TARGET_HAS_nand_i32         0
>> +#define TCG_TARGET_HAS_nor_i32          0
>> +#define TCG_TARGET_HAS_deposit_i32      0
>> +#define TCG_TARGET_HAS_movcond_i32      0
>> +#define TCG_TARGET_HAS_add2_i32         0
>> +#define TCG_TARGET_HAS_sub2_i32         0
>> +#define TCG_TARGET_HAS_mulu2_i32        0
>> +#define TCG_TARGET_HAS_muls2_i32        0
>> +
>> +#define TCG_TARGET_HAS_div_i64          0
>> +#define TCG_TARGET_HAS_ext8s_i64        0
>> +#define TCG_TARGET_HAS_ext16s_i64       0
>> +#define TCG_TARGET_HAS_ext32s_i64       0
>> +#define TCG_TARGET_HAS_ext8u_i64        0
>> +#define TCG_TARGET_HAS_ext16u_i64       0
>> +#define TCG_TARGET_HAS_ext32u_i64       0
>> +#define TCG_TARGET_HAS_bswap16_i64      0
>> +#define TCG_TARGET_HAS_bswap32_i64      0
>> +#define TCG_TARGET_HAS_bswap64_i64      0
>> +#define TCG_TARGET_HAS_not_i64          0
>> +#define TCG_TARGET_HAS_neg_i64          0
>> +#define TCG_TARGET_HAS_rot_i64          1
>> +#define TCG_TARGET_HAS_andc_i64         0
>> +#define TCG_TARGET_HAS_orc_i64          0
>> +#define TCG_TARGET_HAS_eqv_i64          0
>> +#define TCG_TARGET_HAS_nand_i64         0
>> +#define TCG_TARGET_HAS_nor_i64          0
>> +#define TCG_TARGET_HAS_deposit_i64      0
>> +#define TCG_TARGET_HAS_movcond_i64      0
>> +#define TCG_TARGET_HAS_add2_i64         0
>> +#define TCG_TARGET_HAS_sub2_i64         0
>> +#define TCG_TARGET_HAS_mulu2_i64        0
>> +#define TCG_TARGET_HAS_muls2_i64        0
>> +
>> +enum {
>> +    TCG_AREG0 = TCG_REG_X19,
>> +};
>> +
>> +static inline void flush_icache_range(tcg_target_ulong start,
>> +                                      tcg_target_ulong stop)
>> +{
>> +#if QEMU_GNUC_PREREQ(4, 1)
>> +    __builtin___clear_cache((char *)start, (char *)stop);
>> +#else
>> +    /* XXX should provide alternative with IC <ic_op>, Xt */
>> +#error "need GNUC >= 4.1, alternative not implemented yet."
>> +#endif
> 
> I think we can just assume a GCC new enough to support
> __builtin___clear_cache(). Nobody's going to be compiling
> aarch64 code with a gcc that old, because they didn't
> support the architecture at all. You can drop the #if/#else
> completely.

Ok. Should we not support non-GCC compilers though?
It should not be terrible to emit the IC instruction,
I planned to do it (or let somebody else contribute it) as a next step.
But if the project assumption is to be GCC-only, then we can drop the #if completely.

>> +
>> +}
>> +
>> +#endif /* TCG_TARGET_AARCH64 */
>> --
>> 1.8.1
>>
>>
> 
> thanks
> -- PMM
> 

Thanks,

-- 
Claudio Fontana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-14 12:01             ` Claudio Fontana
@ 2013-05-14 12:25               ` Peter Maydell
  2013-05-14 15:19                 ` Richard Henderson
  2013-05-14 12:41               ` Laurent Desnogues
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-14 12:25 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson

On 14 May 2013 13:01, Claudio Fontana <claudio.fontana@huawei.com> wrote:
> On 13.05.2013 20:28, Peter Maydell wrote:
>> On 13 May 2013 14:33, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>>
>>> add preliminary support for TCG target aarch64.
>>
>> Thanks for this patch. Some comments below.
>>
>>> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
>>> ---
>>>  include/exec/exec-all.h  |    5 +-
>>>  tcg/aarch64/tcg-target.c | 1084 ++++++++++++++++++++++++++++++++++++++++++++++

Incidentally a 1000 line patch is pretty tedious to review
(even if most of it is ok then you have to go back over
the whole thing again when some small part changes), so it
might be worth splitting it up a little if there's a
reasonable split possible.

>> This list seems to not have all the registers in it.
>> You can put the registers used for AREG0 and the temp
>> reg in here -- TCG will correctly not use them because
>> (a) AREG0 is allocated as a fixed register and (b)
>> the temp is put in the reserved-regs list in tcg_target_init.
>>
>> It should be OK to use X16 and X17 as well, right?
>
> I see, I can add AREG0 (X19) and temp (X8) to the list then.
>
> I got cold feet about using X16 and X17 when I
> experienced their use by possibly libgthread and other
> system libraries, and due to their definitions as IP0
> and IP1 ("can be used by call veneers and PLT code").

If they can be used as call veneers then it must
be safe to use them inside a function (as callee
saves registers) because our caller can't be
expecting them to be preserved.

> But if you are sure they are safe to use I can add
> them to the set as temporary registers.

> I skipped X18 because of its definition as the
> "platform register".
> If you think that's a groundless fear, I can add that
> to the list as well.

I guess for cross-OS portability we should avoid it
(and so you should add it to the reserved_regs set
in tcg_target_init()). I don't think it matters
whether it appears in this array or not if it's
reserved. tcg/arm puts SP in the reg-alloc-order
array, for example, even though if we ever used it
we'd die horribly.

>>> +#ifdef CONFIG_SOFTMMU
>>> +    mem_index = args[2];
>>> +    s_bits = opc & 3;
>>> +
>>> +    /* Should generate something like the following:
>>> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
>>> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
>>> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
>>> +     */
>>
>> The comment says this, but you don't actually seem to have
>> the code to do it?
>>
>> And there definitely needs to be a test somewhere in
>> your generated code for "did the TLB hit or miss?"
>
> Yes, it's basically a TODO comment, adapted from the ARM TCG target.
> Right now we always go through the C helpers, which is of course slower.

In the ARM TCG target case, the comment is describing
what the following code actually generates (and the
bit which isn't implemented is marked as "not implemented
yet"). In your case the comment is neither describing
what you actually emit nor what you ought in an ideal
world to emit.

>>> +    case INDEX_op_rotl_i64: ext = 1;
>>> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
>>> +        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, 32 - m */
>>> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
>>> +        else { /* no RSB in aarch64 unfortunately. */
>>> +            /* XXX UNTESTED */
>>> +            tcg_out_movi32(s, ext, TCG_REG_X8, ext ? 64 : 32);
>>> +            tcg_out_arith(s, ARITH_SUB, ext, TCG_REG_X8, TCG_REG_X8, args[2]);
>>> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], TCG_REG_X8);
>>
>> I think you should either test this, or remove it [rot
>> support is optional so you could put it back in a later
>> patch].
>
> Yes, I agree. I could not find an image which triggered that
> code path for register rotation amounts.

Try PPC : rlwmn will generate a rotl (as will other insns).

>> tcg_set_frame() should be called in the prologue generation
>> function, not here. Also, please don't use temp_buf, it is
>> going to go away shortly, as per this patch:
>>  http://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03859.html
>>
>
> I understand that you want to get rid of temp_buf;
> that would mean if I understand correctly using the stack
> for that end; I feel a bit uneasy about the mechanics though,
> and the stability of the result:
> could we add this to the TODO list for a successive change,
> so that we can have a working version now and get a successive
> version to a whole new round of testing?

Sorry, no. At the moment no in-tree TCG target uses temp_buf,
and it's very likely that a patch to remove it will land
before your TCG target is added, at which point your code
will no longer compile. (This is also part of a general
principle where we don't let progressive code cleanups
go 'backwards' by admitting new code which still uses the
old deprecated mechanisms.)

>>> +{
>>> +#if QEMU_GNUC_PREREQ(4, 1)
>>> +    __builtin___clear_cache((char *)start, (char *)stop);
>>> +#else
>>> +    /* XXX should provide alternative with IC <ic_op>, Xt */
>>> +#error "need GNUC >= 4.1, alternative not implemented yet."
>>> +#endif
>>
>> I think we can just assume a GCC new enough to support
>> __builtin___clear_cache(). Nobody's going to be compiling
>> aarch64 code with a gcc that old, because they didn't
>> support the architecture at all. You can drop the #if/#else
>> completely.
>
> Ok. Should we not support non-GCC compilers though?

We do (clang, for instance) -- they just have to support
the same set of builtins as gcc.

> It should not be terrible to emit the IC instruction,

I really don't like handwritten assembly if it can
possibly be avoided -- it's a maintenance pain. This
is exactly what the gcc builtins are for, and we
use them fairly extensively. The 32 bit ARM code
only has the hand-written version because it predates
gcc versions which support this particular builtin.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-14 12:01             ` Claudio Fontana
  2013-05-14 12:25               ` Peter Maydell
@ 2013-05-14 12:41               ` Laurent Desnogues
  1 sibling, 0 replies; 60+ messages in thread
From: Laurent Desnogues @ 2013-05-14 12:41 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Paolo Bonzini

On Tue, May 14, 2013 at 2:01 PM, Claudio Fontana
<claudio.fontana@huawei.com> wrote:
[...]
>>> +static void tcg_target_qemu_prologue(TCGContext *s)
>>> +{
>>> +    int r;
>>> +    int frame_size; /* number of 16 byte items */
>>> +
>>> +    /* we need to save (FP, LR) and X19 to X28 */
>>> +    frame_size = (1) + (TCG_REG_X27 - TCG_REG_X19) / 2 + 1;
>>
>> The comment says "X19 to X28" and the code does X27 - X19:
>> which is right?
>
> The comment is right, and the code is technically working but it shows as misleading for the reader.
> I will rewrite the callee-saved registers' contribution to the frame size (in 16 byte elements) as:
>
> (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;

Shouldn't that be like this?

((TCG_REG_X28 - TCG_REG_X19 + 1) + 1) / 2 + 1;

The last +1 is for FP,LR as you explained.
The first +1 is needed to count the number of regs in the
interval [x19,x28].
The second +1 is needed because if the number of regs
is odd, you want to round up and not down.

Here that'd give us (9+1+1)/2+1 = 6.

Of course that's nitpicking because the callee-saved regs
shouldn't change :-)

>>
>> Why the brackets round the first '1' ?
>
> It represents the (FP, LR) pair, so the () should help the reader notice that it refers
> to the couple (FP, LR) mentioned in the comment just above.
> It clearly failed, so I will try to align the comment and the code better.
>
>>
>>> +
>>> +    /* push (fp, lr) and update sp to final frame size */
>>> +    tcg_out_push_p(s, TCG_REG_FP, TCG_REG_LR, frame_size);
>>> +
>>> +    /* FP -> frame chain */
>>> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
>>> +
>>> +    /* store callee-preserved regs x19..x28 */
>>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {

Shouldn't the comparison be against TCG_REG_X28?  It'd
be more readable.

>>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>>> +        tcg_out_store_p(s, r, r + 1, idx);
>>> +    }
>>> +
>>> +    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
>>> +    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
>>> +
>>> +    tb_ret_addr = s->code_ptr;
>>> +
>>> +    /* restore registers x19..x28 */
>>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {

Ditto.

Thanks,

Laurent

>>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>>> +        tcg_out_load_p(s, r, r + 1, idx);
>>> +    }
>>> +
>>> +    /* pop (fp, lr), restore sp to previous frame, return */
>>> +    tcg_out_pop_p(s, TCG_REG_FP, TCG_REG_LR, frame_size);
>>> +    tcg_out_ret(s);
>>> +}

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-13 19:49           ` Richard Henderson
@ 2013-05-14 14:05             ` Claudio Fontana
  2013-05-14 15:16               ` Richard Henderson
  0 siblings, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-14 14:05 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Paolo Bonzini, qemu-devel, Peter Maydell

On 13.05.2013 21:49, Richard Henderson wrote:
> On 05/13/2013 06:33 AM, Claudio Fontana wrote:
>> +enum aarch64_cond_code {
>> +    COND_EQ = 0x0,
>> +    COND_NE = 0x1,
>> +    COND_CS = 0x2,	/* Unsigned greater or equal */
>> +    COND_HS = 0x2,      /* ALIAS greater or equal */
> 
> Clearer to define aliases as COND_HS = COND_CS.

I agree, will change.

>> +static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value)
>> +{
>> +    uint32_t half, base, movk = 0, shift = 0;
>> +    if (!value) {
>> +        tcg_out_movr(s, 1, rd, TCG_REG_XZR);
>> +        return;
>> +    }
>> +    /* construct halfwords of the immediate with MOVZ with LSL */
>> +    /* using MOVZ 0x52800000 | extended reg.. */
>> +    base = 0xd2800000;
>> +
>> +    while (value) {
>> +        half = value & 0xffff;
>> +        if (half) {
>> +            /* Op can be MOVZ or MOVK */
>> +            tcg_out32(s, base | movk | shift | half << 5 | rd);
>> +            if (!movk)
>> +                movk = 0x20000000; /* morph next MOVZs into MOVKs */
>> +        }
>> +        value >>= 16;
>> +        shift += 0x00200000;
> 
> You'll almost certainly want to try ADP+ADD before decomposing into 3-4 mov[zk]
> instructions.

I don't know the ADP instruction; but this movi can be improved.
Right now it needs 1 to 4 mov[zk] instructions, depending on the value:
it depends on how many 0000h 16 bit holes there are in the 64bit value.
I was thinking a successive patch to experiment with this.

> 
>> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
>> +{
>> +    tcg_target_long target, offset;
>> +    target = (tcg_target_long)addr;
>> +    offset = (target - (tcg_target_long)jmp_addr) / 4;
>> +
>> +    if (offset <= -0x02000000 || offset >= 0x02000000) {
>> +        /* out of 26bit range */
>> +        tcg_abort();
>> +    }
> 
> See MAX_CODE_GEN_BUFFER_SIZE in translate-all.c.  Set this value to 128MB and
> then all cross-TB branches will be in range, and the abort won't trigger.

That's great, I missed that. I will add a change to that end in translate-all.c .

> 
>> +static inline void tcg_out_goto_label_cond(TCGContext *s, TCGCond c, int label_index)
>> +{
>> +    tcg_target_long offset;
>> +    /* backward conditional jump never seems to happen in practice,
>> +       so just always use the branch trampoline */
>> +    c = tcg_invert_cond(c);
>> +    offset = 2; /* skip current instr and the next */
>> +    tcg_out_goto_cond(s, c, offset);
>> +    tcg_out_goto_label(s, label_index); /* emit 26bit jump */
>> +}
> 
> Conditional branch range is +-1MB.  You'll never see a TB that large.  You
> don't need to emit a branch-across-branch.

Is there maybe a way to do it right even in the corner case where we have
a huge list of hundreds of thousands of instructions without jumps and then a conditional jump?
Are we _guaranteed_ to never see that large a TB with some kind of define,
similarly to MAX_CODE_GEN_BUFFER_SIZE?

I know, it's a corner case that would only trigger in a very strange program, but
I would propose to add this topic to the TODO list for successive patches.

> 
>> +    /* Should generate something like the following:
>> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
>> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
>> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
>> +     */
>> +#  if CPU_TLB_BITS > 8
>> +#   error "CPU_TLB_BITS too large"
>> +#  endif
> 
> I wonder if using UBFM to extract the TLB bits and BFM with XZR to clear the
> middle bits wouldn't be better, as you wouldn't be restricted on the size of
> CPU_TLB_BITS.  AFAICS it would be the same number of instructions.

Hmm..

> 
>> +    case INDEX_op_mov_i64: ext = 1;
>> +    case INDEX_op_mov_i32:
>> +        tcg_out_movr(s, ext, args[0], args[1]);
>> +        break;
> 
> See how the i386 backend uses macros to reduce the typing with these sorts of
> paired opcodes.

I saw the glue thing, and I try to stay away from that kind of preprocessor use,
as it makes it more difficult for newcomers to dig in, since it breaks gid / grepping
for symbols among other things.
I used instead the editor and regexps to generate the list.
Maybe this could be patched up later if that seems to be the consensus?

> 
>> +    case INDEX_op_rotl_i64: ext = 1;
>> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
>> +        if (const_args[2])      /* ROR / EXTR Wd, Wm, Wm, 32 - m */
>> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
>> +        else { /* no RSB in aarch64 unfortunately. */
>> +            /* XXX UNTESTED */
>> +            tcg_out_movi32(s, ext, TCG_REG_X8, ext ? 64 : 32);
> 
> But A64 does have shift counts that truncate to the width of the operation.
> Which means that the high bits may contain garbage, which means that you can
> compute this merely as ROR = -ROL, ignoring the 32/64.

I see.

> 
>> +    case INDEX_op_setcond_i64: ext = 1;
>> +    case INDEX_op_setcond_i32:
>> +        tcg_out_movi32(s, ext, TCG_REG_X8, 0x01);
>> +        tcg_out_cmp(s, ext, args[1], args[2]);
>> +        tcg_out_csel(s, ext, args[0], TCG_REG_X8, TCG_REG_XZR,
>> +                     tcg_cond_to_aarch64_cond[args[3]]);
> 
> See CSINC Wd,Wzr,Wzr,cond.  No need for the initial movi.

Yes, that was mentioned also by Peter, will change.

> 
>> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffff);
>> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffff);
> 
> Only half of your registers are marked available.

Oops, will fix. Thanks,

-- 
Claudio Fontana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-14 14:05             ` Claudio Fontana
@ 2013-05-14 15:16               ` Richard Henderson
  2013-05-14 16:26                 ` Richard Henderson
  0 siblings, 1 reply; 60+ messages in thread
From: Richard Henderson @ 2013-05-14 15:16 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel, Peter Maydell

On 05/14/2013 07:05 AM, Claudio Fontana wrote:
>> Conditional branch range is +-1MB.  You'll never see a TB that large.  You
>> don't need to emit a branch-across-branch.
> 
> Is there maybe a way to do it right even in the corner case where we have
> a huge list of hundreds of thousands of instructions without jumps and then a conditional jump?
> Are we _guaranteed_ to never see that large a TB with some kind of define,
> similarly to MAX_CODE_GEN_BUFFER_SIZE?

There are three mechanisms that all limit TB size:
  (1) OPC_MAX_SIZE, limiting the number of opcodes emitted,
  (2) CF_COUNT_MASK, limiting the number of instructions translated,
  (3) Instruction pointer crossing a page boundary, where we end a TB
      and re-verify the page protection bits of the new page.

Nr 1 is probably the most significant, since it most directly relates to
the number of output instructions, and thus the resulting TB size.


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-14 12:25               ` Peter Maydell
@ 2013-05-14 15:19                 ` Richard Henderson
  2013-05-16 14:39                   ` Claudio Fontana
  0 siblings, 1 reply; 60+ messages in thread
From: Richard Henderson @ 2013-05-14 15:19 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, Claudio Fontana, qemu-devel

On 05/14/2013 05:25 AM, Peter Maydell wrote:
>> Yes, I agree. I could not find an image which triggered that
>> > code path for register rotation amounts.
> Try PPC : rlwmn will generate a rotl (as will other insns).
> 

rlwmn will only generate constant rotations; at issue are
variable rotations.

Those ought be be generatable with current sources and x86
32-bit or 64-bit rotate insns though.  That cleanup was done
during this release cycle, so if Claudio's testing was on a
previous release...


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-14 15:16               ` Richard Henderson
@ 2013-05-14 16:26                 ` Richard Henderson
  0 siblings, 0 replies; 60+ messages in thread
From: Richard Henderson @ 2013-05-14 16:26 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel, Peter Maydell

On 05/14/2013 08:16 AM, Richard Henderson wrote:
> On 05/14/2013 07:05 AM, Claudio Fontana wrote:
>>> Conditional branch range is +-1MB.  You'll never see a TB that large.  You
>>> don't need to emit a branch-across-branch.
>>
>> Is there maybe a way to do it right even in the corner case where we have
>> a huge list of hundreds of thousands of instructions without jumps and then a conditional jump?
>> Are we _guaranteed_ to never see that large a TB with some kind of define,
>> similarly to MAX_CODE_GEN_BUFFER_SIZE?
> 
> There are three mechanisms that all limit TB size:
>   (1) OPC_MAX_SIZE, limiting the number of opcodes emitted,
>   (2) CF_COUNT_MASK, limiting the number of instructions translated,
>   (3) Instruction pointer crossing a page boundary, where we end a TB
>       and re-verify the page protection bits of the new page.
> 
> Nr 1 is probably the most significant, since it most directly relates to
> the number of output instructions, and thus the resulting TB size.

BTW, for comparison, tcg/s390/tcg-target.c works well enough with just 16 bits
on the relative branch insns; eight times smaller than your 19 bits.


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-14 15:19                 ` Richard Henderson
@ 2013-05-16 14:39                   ` Claudio Fontana
  0 siblings, 0 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-16 14:39 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, qemu-devel, Paolo Bonzini

On 14.05.2013 17:19, Richard Henderson wrote:
> On 05/14/2013 05:25 AM, Peter Maydell wrote:
>>> Yes, I agree. I could not find an image which triggered that
>>>> code path for register rotation amounts.
>> Try PPC : rlwmn will generate a rotl (as will other insns).
>>
> 
> rlwmn will only generate constant rotations; at issue are
> variable rotations.
> 
> Those ought be be generatable with current sources and x86
> 32-bit or 64-bit rotate insns though.  That cleanup was done
> during this release cycle, so if Claudio's testing was on a
> previous release...

Indeed, I was able to test that codepath today after rebasing on current QEMU.

We are working on a new patchset that tries to incorporate the changes discussed up to now.

Thanks,

Claudio

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2
  2013-03-14 15:57 [Qemu-devel] QEMU aarch64 TCG target Claudio Fontana
  2013-03-14 16:16 ` Peter Maydell
@ 2013-05-23  8:09 ` Claudio Fontana
  2013-05-23  8:14   ` [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
                     ` (4 more replies)
  1 sibling, 5 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-23  8:09 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson


This series implements preliminary support for the ARM aarch64 TCG target.

Limitations of this initial implementation (TODOs) include:

 * missing TLB lookup in qemu_ld/st [C helpers always called].
   An incremental patch, which requires this series, is coming up
   from teammate Jani Kokkonen to implement this.
 * most optional opcodes are not implemented yet (only rotation done).
 * CONFIG_SOFTMMU only
 * only little endian qemu targets supported

Tested running on a x86-64 physical machine running Foundation v8,
running a linux 3.8.0-rc6+ minimal host system based on linaro v8
image 201301271620 for user space.

Tested guests: arm v5 test image, i386 FreeDOS test image,
i386 linux test image, all from qemu-devel testing page.
Also tested on x86-64/linux built with buildroot,
and on arm v7/linux built with buildroot as well.

checkpatch emits a false positive for the last patch regarding
missing braces which are actually there. I suspect it is
because of a comment.

checkpatch also complains about the labeled statements in
the switch, which I think are in fact good for readability.

Claudio Fontana (4):
  include/elf.h: add aarch64 ELF machine and relocs
  tcg/aarch64: implement new TCG target for aarch64
  configure: permit compilation on arm aarch64
  tcg/aarch64: implement more low level ops in preparation of tlb lookup

 configure                |    8 +
 include/elf.h            |  129 +++++
 include/exec/exec-all.h  |    5 +-
 tcg/aarch64/tcg-target.c | 1203 ++++++++++++++++++++++++++++++++++++++++++++++
 tcg/aarch64/tcg-target.h |   99 ++++
 translate-all.c          |    2 +
 6 files changed, 1445 insertions(+), 1 deletion(-)
 create mode 100644 tcg/aarch64/tcg-target.c
 create mode 100644 tcg/aarch64/tcg-target.h

-- 
1.8.1

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs
  2013-05-23  8:09 ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Claudio Fontana
@ 2013-05-23  8:14   ` Claudio Fontana
  2013-05-23 13:18     ` Peter Maydell
  2013-05-28  8:09     ` Laurent Desnogues
  2013-05-23  8:18   ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-23  8:14 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson


we will use the 26bit relative relocs in the aarch64 tcg target.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 include/elf.h | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)

diff --git a/include/elf.h b/include/elf.h
index a21ea53..cf0d3e2 100644
--- a/include/elf.h
+++ b/include/elf.h
@@ -129,6 +129,8 @@ typedef int64_t  Elf64_Sxword;
 
 #define EM_XTENSA   94      /* Tensilica Xtensa */
 
+#define EM_AARCH64  183
+
 /* This is the info that is needed to parse the dynamic section of the file */
 #define DT_NULL		0
 #define DT_NEEDED	1
@@ -616,6 +618,133 @@ typedef struct {
 /* Keep this the last entry.  */
 #define R_ARM_NUM		256
 
+/* ARM Aarch64 relocation types */
+#define R_AARCH64_NONE                256 /* also accepts R_ARM_NONE (0) */
+/* static data relocations */
+#define R_AARCH64_ABS64               257
+#define R_AARCH64_ABS32               258
+#define R_AARCH64_ABS16               259
+#define R_AARCH64_PREL64              260
+#define R_AARCH64_PREL32              261
+#define R_AARCH64_PREL16              262
+/* static aarch64 group relocations */
+/* group relocs to create unsigned data value or address inline */
+#define R_AARCH64_MOVW_UABS_G0        263
+#define R_AARCH64_MOVW_UABS_G0_NC     264
+#define R_AARCH64_MOVW_UABS_G1        265
+#define R_AARCH64_MOVW_UABS_G1_NC     266
+#define R_AARCH64_MOVW_UABS_G2        267
+#define R_AARCH64_MOVW_UABS_G2_NC     268
+#define R_AARCH64_MOVW_UABS_G3        269
+/* group relocs to create signed data or offset value inline */
+#define R_AARCH64_MOVW_SABS_G0        270
+#define R_AARCH64_MOVW_SABS_G1        271
+#define R_AARCH64_MOVW_SABS_G2        272
+/* relocs to generate 19, 21, and 33 bit PC-relative addresses */
+#define R_AARCH64_LD_PREL_LO19        273
+#define R_AARCH64_ADR_PREL_LO21       274
+#define R_AARCH64_ADR_PREL_PG_HI21    275
+#define R_AARCH64_ADR_PREL_PG_HI21_NC 276
+#define R_AARCH64_ADD_ABS_LO12_NC     277
+#define R_AARCH64_LDST8_ABS_LO12_NC   278
+#define R_AARCH64_LDST16_ABS_LO12_NC  284
+#define R_AARCH64_LDST32_ABS_LO12_NC  285
+#define R_AARCH64_LDST64_ABS_LO12_NC  286
+#define R_AARCH64_LDST128_ABS_LO12_NC 299
+/* relocs for control-flow - all offsets as multiple of 4 */
+#define R_AARCH64_TSTBR14             279
+#define R_AARCH64_CONDBR19            280
+#define R_AARCH64_JUMP26              282
+#define R_AARCH64_CALL26              283
+/* group relocs to create pc-relative offset inline */
+#define R_AARCH64_MOVW_PREL_G0        287
+#define R_AARCH64_MOVW_PREL_G0_NC     288
+#define R_AARCH64_MOVW_PREL_G1        289
+#define R_AARCH64_MOVW_PREL_G1_NC     290
+#define R_AARCH64_MOVW_PREL_G2        291
+#define R_AARCH64_MOVW_PREL_G2_NC     292
+#define R_AARCH64_MOVW_PREL_G3        293
+/* group relocs to create a GOT-relative offset inline */
+#define R_AARCH64_MOVW_GOTOFF_G0      300
+#define R_AARCH64_MOVW_GOTOFF_G0_NC   301
+#define R_AARCH64_MOVW_GOTOFF_G1      302
+#define R_AARCH64_MOVW_GOTOFF_G1_NC   303
+#define R_AARCH64_MOVW_GOTOFF_G2      304
+#define R_AARCH64_MOVW_GOTOFF_G2_NC   305
+#define R_AARCH64_MOVW_GOTOFF_G3      306
+/* GOT-relative data relocs */
+#define R_AARCH64_GOTREL64            307
+#define R_AARCH64_GOTREL32            308
+/* GOT-relative instr relocs */
+#define R_AARCH64_GOT_LD_PREL19       309
+#define R_AARCH64_LD64_GOTOFF_LO15    310
+#define R_AARCH64_ADR_GOT_PAGE        311
+#define R_AARCH64_LD64_GOT_LO12_NC    312
+#define R_AARCH64_LD64_GOTPAGE_LO15   313
+/* General Dynamic TLS relocations */
+#define R_AARCH64_TLSGD_ADR_PREL21            512
+#define R_AARCH64_TLSGD_ADR_PAGE21            513
+#define R_AARCH64_TLSGD_ADD_LO12_NC           514
+#define R_AARCH64_TLSGD_MOVW_G1               515
+#define R_AARCH64_TLSGD_MOVW_G0_NC            516
+/* Local Dynamic TLS relocations */
+#define R_AARCH64_TLSLD_ADR_PREL21            517
+#define R_AARCH64_TLSLD_ADR_PAGE21            518
+#define R_AARCH64_TLSLD_ADD_LO12_NC           519
+#define R_AARCH64_TLSLD_MOVW_G1               520
+#define R_AARCH64_TLSLD_MOVW_G0_NC            521
+#define R_AARCH64_TLSLD_LD_PREL19             522
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G2        523
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G1        524
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G1_NC     525
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G0        526
+#define R_AARCH64_TLSLD_MOVW_DTPREL_G0_NC     527
+#define R_AARCH64_TLSLD_ADD_DTPREL_HI12       528
+#define R_AARCH64_TLSLD_ADD_DTPREL_LO12       529
+#define R_AARCH64_TLSLD_ADD_DTPREL_LO12_NC    530
+#define R_AARCH64_TLSLD_LDST8_DTPREL_LO12     531
+#define R_AARCH64_TLSLD_LDST8_DTPREL_LO12_NC  532
+#define R_AARCH64_TLSLD_LDST16_DTPREL_LO12    533
+#define R_AARCH64_TLSLD_LDST16_DTPREL_LO12_NC 534
+#define R_AARCH64_TLSLD_LDST32_DTPREL_LO12    535
+#define R_AARCH64_TLSLD_LDST32_DTPREL_LO12_NC 536
+#define R_AARCH64_TLSLD_LDST64_DTPREL_LO12    537
+#define R_AARCH64_TLSLD_LDST64_DTPREL_LO12_NC 538
+/* initial exec TLS relocations */
+#define R_AARCH64_TLSIE_MOVW_GOTTPREL_G1      539
+#define R_AARCH64_TLSIE_MOVW_GOTTPREL_G0_NC   540
+#define R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21   541
+#define R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC 542
+#define R_AARCH64_TLSIE_LD_GOTTPREL_PREL19    543
+/* local exec TLS relocations */
+#define R_AARCH64_TLSLE_MOVW_TPREL_G2         544
+#define R_AARCH64_TLSLE_MOVW_TPREL_G1         545
+#define R_AARCH64_TLSLE_MOVW_TPREL_G1_NC      546
+#define R_AARCH64_TLSLE_MOVW_TPREL_G0         547
+#define R_AARCH64_TLSLE_MOVW_TPREL_G0_NC      548
+#define R_AARCH64_TLSLE_ADD_TPREL_HI12        549
+#define R_AARCH64_TLSLE_ADD_TPREL_LO12        550
+#define R_AARCH64_TLSLE_ADD_TPREL_LO12_NC     551
+#define R_AARCH64_TLSLE_LDST8_TPREL_LO12      552
+#define R_AARCH64_TLSLE_LDST8_TPREL_LO12_NC   553
+#define R_AARCH64_TLSLE_LDST16_TPREL_LO12     554
+#define R_AARCH64_TLSLE_LDST16_TPREL_LO12_NC  555
+#define R_AARCH64_TLSLE_LDST32_TPREL_LO12     556
+#define R_AARCH64_TLSLE_LDST32_TPREL_LO12_NC  557
+#define R_AARCH64_TLSLE_LDST64_TPREL_LO12     558
+#define R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC  559
+/* Dynamic Relocations */
+#define R_AARCH64_COPY         1024
+#define R_AARCH64_GLOB_DAT     1025
+#define R_AARCH64_JUMP_SLOT    1026
+#define R_AARCH64_RELATIVE     1027
+#define R_AARCH64_TLS_DTPREL64 1028
+#define R_AARCH64_TLS_DTPMOD64 1029
+#define R_AARCH64_TLS_TPREL64  1030
+#define R_AARCH64_TLS_DTPREL32 1031
+#define R_AARCH64_TLS_DTPMOD32 1032
+#define R_AARCH64_TLS_TPREL32  1033
+
 /* s390 relocations defined by the ABIs */
 #define R_390_NONE		0	/* No reloc.  */
 #define R_390_8			1	/* Direct 8 bit.  */
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-23  8:09 ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Claudio Fontana
  2013-05-23  8:14   ` [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
@ 2013-05-23  8:18   ` Claudio Fontana
  2013-05-23 16:29     ` Richard Henderson
                       ` (3 more replies)
  2013-05-23  8:19   ` [Qemu-devel] [PATCH 3/4] configure: permit compilation on arm aarch64 Claudio Fontana
                     ` (2 subsequent siblings)
  4 siblings, 4 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-23  8:18 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson


add preliminary support for TCG target aarch64.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 include/exec/exec-all.h  |    5 +-
 tcg/aarch64/tcg-target.c | 1185 ++++++++++++++++++++++++++++++++++++++++++++++
 tcg/aarch64/tcg-target.h |   99 ++++
 translate-all.c          |    2 +
 4 files changed, 1290 insertions(+), 1 deletion(-)
 create mode 100644 tcg/aarch64/tcg-target.c
 create mode 100644 tcg/aarch64/tcg-target.h

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 6362074..5c31863 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -128,7 +128,7 @@ static inline void tlb_flush(CPUArchState *env, int flush_global)
 
 #if defined(__arm__) || defined(_ARCH_PPC) \
     || defined(__x86_64__) || defined(__i386__) \
-    || defined(__sparc__) \
+    || defined(__sparc__) || defined(__aarch64__) \
     || defined(CONFIG_TCG_INTERPRETER)
 #define USE_DIRECT_JUMP
 #endif
@@ -230,6 +230,9 @@ static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
     *(uint32_t *)jmp_addr = addr - (jmp_addr + 4);
     /* no need to flush icache explicitly */
 }
+#elif defined(__aarch64__)
+void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
+#define tb_set_jmp_target1 aarch64_tb_set_jmp_target
 #elif defined(__arm__)
 static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
 {
diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
new file mode 100644
index 0000000..da859c7
--- /dev/null
+++ b/tcg/aarch64/tcg-target.c
@@ -0,0 +1,1185 @@
+/*
+ * Initial TCG Implementation for aarch64
+ *
+ * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
+ * Written by Claudio Fontana
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.
+ *
+ * See the COPYING file in the top-level directory for details.
+ */
+
+#ifdef TARGET_WORDS_BIGENDIAN
+#error "Sorry, bigendian target not supported yet."
+#endif /* TARGET_WORDS_BIGENDIAN */
+
+#ifndef NDEBUG
+static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
+    "%x0", "%x1", "%x2", "%x3", "%x4", "%x5", "%x6", "%x7",
+    "%x8", "%x9", "%x10", "%x11", "%x12", "%x13", "%x14", "%x15",
+    "%x16", "%x17", "%x18", "%x19", "%x20", "%x21", "%x22", "%x23",
+    "%x24", "%x25", "%x26", "%x27", "%x28",
+    "%fp", /* frame pointer */
+    "%lr", /* link register */
+    "%sp",  /* stack pointer */
+};
+#endif /* NDEBUG */
+
+static const int tcg_target_reg_alloc_order[] = {
+    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23,
+    TCG_REG_X24, TCG_REG_X25, TCG_REG_X26, TCG_REG_X27,
+    TCG_REG_X28,
+
+    TCG_REG_X9, TCG_REG_X10, TCG_REG_X11, TCG_REG_X12,
+    TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
+    TCG_REG_X16, TCG_REG_X17,
+
+    TCG_REG_X18, TCG_REG_X19, /* will not use these, see tcg_target_init */
+
+    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
+    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
+
+    TCG_REG_X8, /* will not use, see tcg_target_init */
+};
+
+static const int tcg_target_call_iarg_regs[8] = {
+    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
+    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7
+};
+static const int tcg_target_call_oarg_regs[1] = {
+    TCG_REG_X0
+};
+
+#define TCG_REG_TMP TCG_REG_X8
+
+static inline void reloc_pc26(void *code_ptr, tcg_target_long target)
+{
+    tcg_target_long offset; uint32_t insn;
+    offset = (target - (tcg_target_long)code_ptr) / 4;
+    offset &= 0x03ffffff;
+    /* read instruction, mask away previous PC_REL26 parameter contents,
+       set the proper offset, then write back the instruction. */
+    insn = *(uint32_t *)code_ptr;
+    insn = (insn & 0xfc000000) | offset;
+    *(uint32_t *)code_ptr = insn;
+}
+
+static inline void reloc_pc19(void *code_ptr, tcg_target_long target)
+{
+    tcg_target_long offset; uint32_t insn;
+    offset = (target - (tcg_target_long)code_ptr) / 4;
+    offset &= 0x07ffff;
+    /* read instruction, mask away previous PC_REL19 parameter contents,
+       set the proper offset, then write back the instruction. */
+    insn = *(uint32_t *)code_ptr;
+    insn = (insn & 0xff00001f) | offset << 5; /* lower 5 bits = condition */
+    *(uint32_t *)code_ptr = insn;
+}
+
+static inline void patch_reloc(uint8_t *code_ptr, int type,
+                               tcg_target_long value, tcg_target_long addend)
+{
+    switch (type) {
+    case R_AARCH64_JUMP26:
+    case R_AARCH64_CALL26:
+        reloc_pc26(code_ptr, value);
+        break;
+    case R_AARCH64_CONDBR19:
+        reloc_pc19(code_ptr, value);
+        break;
+
+    default:
+        tcg_abort();
+    }
+}
+
+/* parse target specific constraints */
+static int target_parse_constraint(TCGArgConstraint *ct,
+                                   const char **pct_str)
+{
+    const char *ct_str = *pct_str;
+
+    switch (ct_str[0]) {
+    case 'r':
+        ct->ct |= TCG_CT_REG;
+        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
+        break;
+    case 'l': /* qemu_ld / qemu_st address, data_reg */
+        ct->ct |= TCG_CT_REG;
+        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
+#ifdef CONFIG_SOFTMMU
+        /* x0 and x1 will be overwritten when reading the tlb entry,
+           and x2, and x3 for helper args, better to avoid using them. */
+        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X0);
+        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X1);
+        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X2);
+        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X3);
+#endif
+        break;
+    default:
+        return -1;
+    }
+
+    ct_str++;
+    *pct_str = ct_str;
+    return 0;
+}
+
+static inline int tcg_target_const_match(tcg_target_long val,
+                                         const TCGArgConstraint *arg_ct)
+{
+    int ct = arg_ct->ct;
+
+    if (ct & TCG_CT_CONST) {
+        return 1;
+    }
+
+    return 0;
+}
+
+enum aarch64_cond_code {
+    COND_EQ = 0x0,
+    COND_NE = 0x1,
+    COND_CS = 0x2,     /* Unsigned greater or equal */
+    COND_HS = COND_CS, /* ALIAS greater or equal */
+    COND_CC = 0x3,     /* Unsigned less than */
+    COND_LO = COND_CC, /* ALIAS Lower */
+    COND_MI = 0x4,     /* Negative */
+    COND_PL = 0x5,     /* Zero or greater */
+    COND_VS = 0x6,     /* Overflow */
+    COND_VC = 0x7,     /* No overflow */
+    COND_HI = 0x8,     /* Unsigned greater than */
+    COND_LS = 0x9,     /* Unsigned less or equal */
+    COND_GE = 0xa,
+    COND_LT = 0xb,
+    COND_GT = 0xc,
+    COND_LE = 0xd,
+    COND_AL = 0xe,
+    COND_NV = 0xf, /* behaves like COND_AL here */
+};
+
+static const enum aarch64_cond_code tcg_cond_to_aarch64[] = {
+    [TCG_COND_EQ] = COND_EQ,
+    [TCG_COND_NE] = COND_NE,
+    [TCG_COND_LT] = COND_LT,
+    [TCG_COND_GE] = COND_GE,
+    [TCG_COND_LE] = COND_LE,
+    [TCG_COND_GT] = COND_GT,
+    /* unsigned */
+    [TCG_COND_LTU] = COND_LO,
+    [TCG_COND_GTU] = COND_HI,
+    [TCG_COND_GEU] = COND_HS,
+    [TCG_COND_LEU] = COND_LS,
+};
+
+/* opcodes for LDR / STR instructions with base + simm9 addressing */
+enum aarch64_ldst_op_data { /* size of the data moved */
+    LDST_8 = 0x38,
+    LDST_16 = 0x78,
+    LDST_32 = 0xb8,
+    LDST_64 = 0xf8,
+};
+enum aarch64_ldst_op_type { /* type of operation */
+    LDST_ST = 0x0,    /* store */
+    LDST_LD = 0x4,    /* load */
+    LDST_LD_S_X = 0x8,  /* load and sign-extend into Xt */
+    LDST_LD_S_W = 0xc,  /* load and sign-extend into Wt */
+};
+
+enum aarch64_arith_opc {
+    ARITH_ADD = 0x0b,
+    ARITH_SUB = 0x4b,
+    ARITH_AND = 0x0a,
+    ARITH_OR = 0x2a,
+    ARITH_XOR = 0x4a
+};
+
+enum aarch64_srr_opc {
+    SRR_SHL = 0x0,
+    SRR_SHR = 0x4,
+    SRR_SAR = 0x8,
+    SRR_ROR = 0xc
+};
+
+static inline enum aarch64_ldst_op_data
+aarch64_ldst_get_data(TCGOpcode tcg_op)
+{
+    switch (tcg_op) {
+    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
+    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
+    case INDEX_op_st8_i32: case INDEX_op_st8_i64:
+        return LDST_8;
+
+    case INDEX_op_ld16u_i32: case INDEX_op_ld16s_i32:
+    case INDEX_op_ld16u_i64: case INDEX_op_ld16s_i64:
+    case INDEX_op_st16_i32: case INDEX_op_st16_i64:
+        return LDST_16;
+
+    case INDEX_op_ld_i32: case INDEX_op_st_i32:
+    case INDEX_op_ld32u_i64: case INDEX_op_ld32s_i64:
+    case INDEX_op_st32_i64:
+        return LDST_32;
+
+    case INDEX_op_ld_i64: case INDEX_op_st_i64:
+        return LDST_64;
+
+    default:
+        tcg_abort();
+    }
+}
+
+static inline enum aarch64_ldst_op_type
+aarch64_ldst_get_type(TCGOpcode tcg_op)
+{
+    switch (tcg_op) {
+    case INDEX_op_st8_i32: case INDEX_op_st16_i32:
+    case INDEX_op_st8_i64: case INDEX_op_st16_i64:
+    case INDEX_op_st_i32:
+    case INDEX_op_st32_i64:
+    case INDEX_op_st_i64:
+        return LDST_ST;
+
+    case INDEX_op_ld8u_i32: case INDEX_op_ld16u_i32:
+    case INDEX_op_ld8u_i64: case INDEX_op_ld16u_i64:
+    case INDEX_op_ld_i32:
+    case INDEX_op_ld32u_i64:
+    case INDEX_op_ld_i64:
+        return LDST_LD;
+
+    case INDEX_op_ld8s_i32: case INDEX_op_ld16s_i32:
+        return LDST_LD_S_W;
+
+    case INDEX_op_ld8s_i64: case INDEX_op_ld16s_i64:
+    case INDEX_op_ld32s_i64:
+        return LDST_LD_S_X;
+
+    default:
+        tcg_abort();
+    }
+}
+
+static inline uint32_t tcg_in32(TCGContext *s)
+{
+    uint32_t v = *(uint32_t *)s->code_ptr;
+    return v;
+}
+
+static inline void tcg_out_ldst_9(TCGContext *s,
+                                  enum aarch64_ldst_op_data op_data,
+                                  enum aarch64_ldst_op_type op_type,
+                                  int rd, int rn, tcg_target_long offset)
+{
+    /* use LDUR with BASE register with 9bit signed unscaled offset */
+    unsigned int mod, off;
+
+    if (offset < 0) {
+        off = (256 + offset);
+        mod = 0x1;
+
+    } else {
+        off = offset;
+        mod = 0x0;
+    }
+
+    mod |= op_type;
+    tcg_out32(s, op_data << 24 | mod << 20 | off << 12 | rn << 5 | rd);
+}
+
+static inline void tcg_out_movr(TCGContext *s, int ext, int rd, int source)
+{
+    /* register to register move using MOV (shifted register with no shift) */
+    /* using MOV 0x2a0003e0 | (shift).. */
+    unsigned int base = ext ? 0xaa0003e0 : 0x2a0003e0;
+    tcg_out32(s, base | source << 16 | rd);
+}
+
+static inline void tcg_out_movi32(TCGContext *s, int ext, int rd,
+                                  uint32_t value)
+{
+    uint32_t half, base, movk = 0;
+    if (!value) {
+        tcg_out_movr(s, ext, rd, TCG_REG_XZR);
+        return;
+    }
+    /* construct halfwords of the immediate with MOVZ with LSL */
+    /* using MOVZ 0x52800000 | extended reg.. */
+    base = ext ? 0xd2800000 : 0x52800000;
+
+    half = value & 0xffff;
+    if (half) {
+        tcg_out32(s, base | half << 5 | rd);
+        movk = 0x20000000; /* morph next MOVZ into MOVK */
+    }
+
+    half = value >> 16;
+    if (half) { /* add shift 0x00200000. Op can be MOVZ or MOVK */
+        tcg_out32(s, base | movk | 0x00200000 | half << 5 | rd);
+    }
+}
+
+static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value)
+{
+    uint32_t half, base, movk = 0, shift = 0;
+    if (!value) {
+        tcg_out_movr(s, 1, rd, TCG_REG_XZR);
+        return;
+    }
+    /* construct halfwords of the immediate with MOVZ with LSL */
+    /* using MOVZ 0x52800000 | extended reg.. */
+    base = 0xd2800000;
+
+    while (value) {
+        half = value & 0xffff;
+        if (half) {
+            /* Op can be MOVZ or MOVK */
+            tcg_out32(s, base | movk | shift | half << 5 | rd);
+            if (!movk) {
+                movk = 0x20000000; /* morph next MOVZs into MOVKs */
+            }
+        }
+        value >>= 16;
+        shift += 0x00200000;
+    }
+}
+
+static inline void tcg_out_ldst_r(TCGContext *s,
+                                  enum aarch64_ldst_op_data op_data,
+                                  enum aarch64_ldst_op_type op_type,
+                                  int rd, int base, int regoff)
+{
+    /* load from memory to register using base + 64bit register offset */
+    /* using f.e. STR Wt, [Xn, Xm] 0xb8600800|(regoff << 16)|(base << 5)|rd */
+    /* the 0x6000 is for the "no extend field" */
+    tcg_out32(s, 0x00206800
+              | op_data << 24 | op_type << 20 | regoff << 16 | base << 5 | rd);
+}
+
+/* solve the whole ldst problem */
+static inline void tcg_out_ldst(TCGContext *s, enum aarch64_ldst_op_data data,
+                                enum aarch64_ldst_op_type type,
+                                int rd, int rn, tcg_target_long offset)
+{
+    if (offset > -256 && offset < 256) {
+        tcg_out_ldst_9(s, data, type, rd, rn, offset);
+
+    } else {
+        tcg_out_movi64(s, TCG_REG_TMP, offset);
+        tcg_out_ldst_r(s, data, type, rd, rn, TCG_REG_TMP);
+    }
+}
+
+static inline void tcg_out_movi(TCGContext *s, TCGType type,
+                                TCGReg rd, tcg_target_long value)
+{
+    if (type == TCG_TYPE_I64) {
+        tcg_out_movi64(s, rd, value);
+    } else {
+        tcg_out_movi32(s, 0, rd, value);
+    }
+}
+
+/* mov alias implemented with add immediate, useful to move to/from SP */
+static inline void tcg_out_movr_sp(TCGContext *s, int ext, int rd, int rn)
+{
+    /* using ADD 0x11000000 | (ext) | rn << 5 | rd */
+    unsigned int base = ext ? 0x91000000 : 0x11000000;
+    tcg_out32(s, base | rn << 5 | rd);
+}
+
+static inline void tcg_out_mov(TCGContext *s,
+                               TCGType type, TCGReg ret, TCGReg arg)
+{
+    if (ret != arg) {
+        tcg_out_movr(s, type == TCG_TYPE_I64, ret, arg);
+    }
+}
+
+static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg arg,
+                              TCGReg arg1, tcg_target_long arg2)
+{
+    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_LD,
+                 arg, arg1, arg2);
+}
+
+static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
+                              TCGReg arg1, tcg_target_long arg2)
+{
+    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_ST,
+                 arg, arg1, arg2);
+}
+
+static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
+                                 int ext, int rd, int rn, int rm)
+{
+    /* Using shifted register arithmetic operations */
+    /* if extended registry operation (64bit) just or with 0x80 << 24 */
+    unsigned int base = ext ? (0x80 | opc) << 24 : opc << 24;
+    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
+}
+
+static inline void tcg_out_mul(TCGContext *s, int ext, int rd, int rn, int rm)
+{
+    /* Using MADD 0x1b000000 with Ra = wzr alias MUL 0x1b007c00 */
+    unsigned int base = ext ? 0x9b007c00 : 0x1b007c00;
+    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
+}
+
+static inline void tcg_out_shiftrot_reg(TCGContext *s,
+                                        enum aarch64_srr_opc opc, int ext,
+                                        int rd, int rn, int rm)
+{
+    /* using 2-source data processing instructions 0x1ac02000 */
+    unsigned int base = ext ? 0x9ac02000 : 0x1ac02000;
+    tcg_out32(s, base | rm << 16 | opc << 8 | rn << 5 | rd);
+}
+
+static inline void tcg_out_ubfm(TCGContext *s, int ext,
+                                int rd, int rn, unsigned int a, unsigned int b)
+{
+    /* Using UBFM 0x53000000 Wd, Wn, a, b - ext encoding requires the 0x4 */
+    unsigned int base = ext ? 0xd3400000 : 0x53000000;
+    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
+}
+
+static inline void tcg_out_sbfm(TCGContext *s, int ext,
+                                int rd, int rn, unsigned int a, unsigned int b)
+{
+    /* Using SBFM 0x13000000 Wd, Wn, a, b - ext encoding requires the 0x4 */
+    unsigned int base = ext ? 0x93400000 : 0x13000000;
+    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
+}
+
+static inline void tcg_out_extr(TCGContext *s, int ext,
+                                int rd, int rn, int rm, unsigned int a)
+{
+    /* Using EXTR 0x13800000 Wd, Wn, Wm, a - ext encoding requires the 0x4 */
+    unsigned int base = ext ? 0x93c00000 : 0x13800000;
+    tcg_out32(s, base | rm << 16 | a << 10 | rn << 5 | rd);
+}
+
+static inline void tcg_out_shl(TCGContext *s, int ext,
+                               int rd, int rn, unsigned int m)
+{
+    int bits, max;
+    bits = ext ? 64 : 32;
+    max = bits - 1;
+    tcg_out_ubfm(s, ext, rd, rn, bits - (m & max), max - (m & max));
+}
+
+static inline void tcg_out_shr(TCGContext *s, int ext,
+                               int rd, int rn, unsigned int m)
+{
+    int max = ext ? 63 : 31;
+    tcg_out_ubfm(s, ext, rd, rn, m & max, max);
+}
+
+static inline void tcg_out_sar(TCGContext *s, int ext,
+                               int rd, int rn, unsigned int m)
+{
+    int max = ext ? 63 : 31;
+    tcg_out_sbfm(s, ext, rd, rn, m & max, max);
+}
+
+static inline void tcg_out_rotr(TCGContext *s, int ext,
+                                int rd, int rn, unsigned int m)
+{
+    int max = ext ? 63 : 31;
+    tcg_out_extr(s, ext, rd, rn, rn, m & max);
+}
+
+static inline void tcg_out_rotl(TCGContext *s, int ext,
+                                int rd, int rn, unsigned int m)
+{
+    int bits, max;
+    bits = ext ? 64 : 32;
+    max = bits - 1;
+    tcg_out_extr(s, ext, rd, rn, rn, bits - (m & max));
+}
+
+static inline void tcg_out_cmp(TCGContext *s, int ext,
+                               int rn, int rm)
+{
+    /* Using CMP alias SUBS wzr, Wn, Wm */
+    unsigned int base = ext ? 0xeb00001f : 0x6b00001f;
+    tcg_out32(s, base | rm << 16 | rn << 5);
+}
+
+static inline void tcg_out_cset(TCGContext *s, int ext,
+                                int rd, TCGCond c)
+{
+    /* Using CSET alias of CSINC 0x1a800400 Xd, XZR, XZR, invert(cond) */
+    unsigned int base = ext ? 0x9a9f07e0 : 0x1a9f07e0;
+    tcg_out32(s, base | tcg_cond_to_aarch64[tcg_invert_cond(c)] << 12 | rd);
+}
+
+static inline void tcg_out_goto(TCGContext *s, tcg_target_long target)
+{
+    tcg_target_long offset;
+    offset = (target - (tcg_target_long)s->code_ptr) / 4;
+
+    if (offset <= -0x02000000 || offset >= 0x02000000) {
+        /* out of 26bit range */
+        tcg_abort();
+    }
+
+    tcg_out32(s, 0x14000000 | (offset & 0x03ffffff));
+}
+
+static inline void tcg_out_goto_noaddr(TCGContext *s)
+{
+    /* We pay attention here to not modify the branch target by
+       reading from the buffer. This ensure that caches and memory are
+       kept coherent during retranslation.
+       Mask away possible garbage in the high bits for the first translation,
+       while keeping the offset bits for retranslation. */
+    uint32_t insn;
+    insn = (tcg_in32(s) & 0x03ffffff) | 0x14000000;
+    tcg_out32(s, insn);
+}
+
+static inline void tcg_out_goto_cond_noaddr(TCGContext *s, TCGCond c)
+{
+    /* see comments in tcg_out_goto_noaddr */
+    uint32_t insn;
+    insn = tcg_in32(s) & (0x07ffff << 5);
+    insn |= 0x54000000 | tcg_cond_to_aarch64[c];
+    tcg_out32(s, insn);
+}
+
+static inline void tcg_out_goto_cond(TCGContext *s, TCGCond c,
+                                     tcg_target_long target)
+{
+    tcg_target_long offset;
+    offset = (target - (tcg_target_long)s->code_ptr) / 4;
+
+    if (offset <= -0x3ffff || offset >= 0x3ffff) {
+        /* out of 19bit range */
+        tcg_abort();
+    }
+
+    offset &= 0x7ffff;
+    tcg_out32(s, 0x54000000 | tcg_cond_to_aarch64[c] | offset << 5);
+}
+
+static inline void tcg_out_callr(TCGContext *s, int reg)
+{
+    tcg_out32(s, 0xd63f0000 | reg << 5);
+}
+
+static inline void tcg_out_gotor(TCGContext *s, int reg)
+{
+    tcg_out32(s, 0xd61f0000 | reg << 5);
+}
+
+static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
+{
+    tcg_target_long offset;
+
+    offset = (target - (tcg_target_long)s->code_ptr) / 4;
+
+    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */
+        tcg_out_movi64(s, TCG_REG_TMP, target);
+        tcg_out_callr(s, TCG_REG_TMP);
+
+    } else {
+        tcg_out32(s, 0x94000000 | (offset & 0x03ffffff));
+    }
+}
+
+/* test a register against a bit pattern made of pattern_n repeated 1s.
+   For example, to test against 0111b (0x07), pass pattern_n = 3 */
+static inline void tcg_out_tst(TCGContext *s, int ext, int rn,
+                               tcg_target_ulong pattern_n)
+{
+    /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f. Ext requires 4. */
+    unsigned int base = ext ? 0xf240001f : 0x7200001f;
+    tcg_out32(s, base | (pattern_n - 1) << 10 | rn << 5);
+}
+
+static inline void tcg_out_ret(TCGContext *s)
+{
+    /* emit RET { LR } */
+    tcg_out32(s, 0xd65f03c0);
+}
+
+void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
+{
+    tcg_target_long target, offset;
+    target = (tcg_target_long)addr;
+    offset = (target - (tcg_target_long)jmp_addr) / 4;
+
+    if (offset <= -0x02000000 || offset >= 0x02000000) {
+        /* out of 26bit range */
+        tcg_abort();
+    }
+
+    patch_reloc((uint8_t *)jmp_addr, R_AARCH64_JUMP26, target, 0);
+    flush_icache_range(jmp_addr, jmp_addr + 4);
+}
+
+static inline void tcg_out_goto_label(TCGContext *s, int label_index)
+{
+    TCGLabel *l = &s->labels[label_index];
+
+    if (!l->has_value) {
+        tcg_out_reloc(s, s->code_ptr, R_AARCH64_JUMP26, label_index, 0);
+        tcg_out_goto_noaddr(s);
+
+    } else {
+        tcg_out_goto(s, l->u.value);
+    }
+}
+
+static inline void tcg_out_goto_label_cond(TCGContext *s,
+                                           TCGCond c, int label_index)
+{
+    TCGLabel *l = &s->labels[label_index];
+
+    if (!l->has_value) {
+        tcg_out_reloc(s, s->code_ptr, R_AARCH64_CONDBR19, label_index, 0);
+        tcg_out_goto_cond_noaddr(s, c);
+
+    } else {
+        tcg_out_goto_cond(s, c, l->u.value);
+    }
+}
+
+#ifdef CONFIG_SOFTMMU
+#include "exec/softmmu_defs.h"
+
+/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
+   int mmu_idx) */
+static const void * const qemu_ld_helpers[4] = {
+    helper_ldb_mmu,
+    helper_ldw_mmu,
+    helper_ldl_mmu,
+    helper_ldq_mmu,
+};
+
+/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
+   uintxx_t val, int mmu_idx) */
+static const void * const qemu_st_helpers[4] = {
+    helper_stb_mmu,
+    helper_stw_mmu,
+    helper_stl_mmu,
+    helper_stq_mmu,
+};
+
+#endif /* CONFIG_SOFTMMU */
+
+static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
+{
+    int addr_reg, data_reg;
+#ifdef CONFIG_SOFTMMU
+    int mem_index, s_bits;
+#endif
+    data_reg = args[0];
+    addr_reg = args[1];
+
+#ifdef CONFIG_SOFTMMU
+    mem_index = args[2];
+    s_bits = opc & 3;
+
+    /* TODO: insert TLB lookup here */
+
+#  if CPU_TLB_BITS > 8
+#   error "CPU_TLB_BITS too large"
+#  endif
+
+    /* all arguments passed via registers */
+    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
+    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
+    tcg_out_movi32(s, 0, TCG_REG_X2, mem_index);
+
+    tcg_out_movi64(s, TCG_REG_TMP, (uint64_t)qemu_ld_helpers[s_bits]);
+    tcg_out_callr(s, TCG_REG_TMP);
+
+    if (opc & 0x04) { /* sign extend */
+        unsigned int bits; bits = 8 * (1 << s_bits) - 1;
+        tcg_out_sbfm(s, 1, data_reg, TCG_REG_X0, 0, bits); /* 7|15|31 */
+
+    } else {
+        tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
+    }
+
+#else /* !CONFIG_SOFTMMU */
+    tcg_abort(); /* TODO */
+#endif
+}
+
+static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
+{
+    int addr_reg, data_reg;
+#ifdef CONFIG_SOFTMMU
+    int mem_index, s_bits;
+#endif
+    data_reg = args[0];
+    addr_reg = args[1];
+
+#ifdef CONFIG_SOFTMMU
+    mem_index = args[2];
+    s_bits = opc & 3;
+
+    /* TODO: here we should generate something like the following:
+     *  shr x8, addr_reg, #TARGET_PAGE_BITS
+     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
+     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
+     *  test ... XXX
+     */
+#  if CPU_TLB_BITS > 8
+#   error "CPU_TLB_BITS too large"
+#  endif
+
+    /* all arguments passed via registers */
+    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
+    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
+    tcg_out_movr(s, 1, TCG_REG_X2, data_reg);
+    tcg_out_movi32(s, 0, TCG_REG_X3, mem_index);
+
+    tcg_out_movi64(s, TCG_REG_TMP, (uint64_t)qemu_st_helpers[s_bits]);
+    tcg_out_callr(s, TCG_REG_TMP);
+
+#else /* !CONFIG_SOFTMMU */
+    tcg_abort(); /* TODO */
+#endif
+}
+
+static uint8_t *tb_ret_addr;
+
+/* callee stack use example:
+   stp     x29, x30, [sp,#-32]!
+   mov     x29, sp
+   stp     x1, x2, [sp,#16]
+   ...
+   ldp     x1, x2, [sp,#16]
+   ldp     x29, x30, [sp],#32
+   ret
+*/
+
+/* push r1 and r2, and alloc stack space for a total of
+   alloc_n elements (1 element=16 bytes, must be between 1 and 31. */
+static inline void tcg_out_push_pair(TCGContext *s,
+                                     TCGReg r1, TCGReg r2, int alloc_n)
+{
+    /* using indexed scaled simm7 STP 0x28800000 | (ext) | 0x01000000 (pre-idx)
+       | alloc_n * (-1) << 16 | r2 << 10 | sp(31) << 5 | r1 */
+    assert(alloc_n > 0 && alloc_n < 0x20);
+    alloc_n = (-alloc_n) & 0x3f;
+    tcg_out32(s, 0xa98003e0 | alloc_n << 16 | r2 << 10 | r1);
+}
+
+/* dealloc stack space for a total of alloc_n elements and pop r1, r2.  */
+static inline void tcg_out_pop_pair(TCGContext *s,
+                                 TCGReg r1, TCGReg r2, int alloc_n)
+{
+    /* using indexed scaled simm7 LDP 0x28c00000 | (ext) | nothing (post-idx)
+       | alloc_n << 16 | r2 << 10 | sp(31) << 5 | r1 */
+    assert(alloc_n > 0 && alloc_n < 0x20);
+    tcg_out32(s, 0xa8c003e0 | alloc_n << 16 | r2 << 10 | r1);
+}
+
+static inline void tcg_out_store_pair(TCGContext *s,
+                                      TCGReg r1, TCGReg r2, int idx)
+{
+    /* using register pair offset simm7 STP 0x29000000 | (ext)
+       | idx << 16 | r2 << 10 | fp(29) << 5 | r1 */
+    assert(idx > 0 && idx < 0x20);
+    tcg_out32(s, 0xa90003a0 | idx << 16 | r2 << 10 | r1);
+}
+
+static inline void tcg_out_load_pair(TCGContext *s,
+                                     TCGReg r1, TCGReg r2, int idx)
+{
+    /* using register pair offset simm7 LDP 0x29400000 | (ext)
+       | idx << 16 | r2 << 10 | fp(29) << 5 | r1 */
+    assert(idx > 0 && idx < 0x20);
+    tcg_out32(s, 0xa94003a0 | idx << 16 | r2 << 10 | r1);
+}
+
+static void tcg_out_op(TCGContext *s, TCGOpcode opc,
+                       const TCGArg *args, const int *const_args)
+{
+    int ext = 0;
+
+    switch (opc) {
+    case INDEX_op_exit_tb:
+        tcg_out_movi64(s, TCG_REG_X0, args[0]); /* load retval in X0 */
+        tcg_out_goto(s, (tcg_target_long)tb_ret_addr);
+        break;
+
+    case INDEX_op_goto_tb:
+#ifndef USE_DIRECT_JUMP
+#error "USE_DIRECT_JUMP required for aarch64"
+#endif
+        assert(s->tb_jmp_offset != NULL); /* consistency for USE_DIRECT_JUMP */
+        s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
+        /* actual branch destination will be patched by
+           aarch64_tb_set_jmp_target later, beware retranslation. */
+        tcg_out_goto_noaddr(s);
+        s->tb_next_offset[args[0]] = s->code_ptr - s->code_buf;
+        break;
+
+    case INDEX_op_call:
+        if (const_args[0]) {
+            tcg_out_call(s, args[0]);
+        } else {
+            tcg_out_callr(s, args[0]);
+        }
+        break;
+
+    case INDEX_op_br:
+        tcg_out_goto_label(s, args[0]);
+        break;
+
+    case INDEX_op_ld_i32:
+    case INDEX_op_ld_i64:
+    case INDEX_op_st_i32:
+    case INDEX_op_st_i64:
+    case INDEX_op_ld8u_i32:
+    case INDEX_op_ld8s_i32:
+    case INDEX_op_ld16u_i32:
+    case INDEX_op_ld16s_i32:
+    case INDEX_op_ld8u_i64:
+    case INDEX_op_ld8s_i64:
+    case INDEX_op_ld16u_i64:
+    case INDEX_op_ld16s_i64:
+    case INDEX_op_ld32u_i64:
+    case INDEX_op_ld32s_i64:
+    case INDEX_op_st8_i32:
+    case INDEX_op_st8_i64:
+    case INDEX_op_st16_i32:
+    case INDEX_op_st16_i64:
+    case INDEX_op_st32_i64:
+        tcg_out_ldst(s, aarch64_ldst_get_data(opc), aarch64_ldst_get_type(opc),
+                     args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_mov_i64: ext = 1;
+    case INDEX_op_mov_i32:
+        tcg_out_movr(s, ext, args[0], args[1]);
+        break;
+
+    case INDEX_op_movi_i64:
+        tcg_out_movi64(s, args[0], args[1]);
+        break;
+
+    case INDEX_op_movi_i32:
+        tcg_out_movi32(s, 0, args[0], args[1]);
+        break;
+
+    case INDEX_op_add_i64: ext = 1;
+    case INDEX_op_add_i32:
+        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_sub_i64: ext = 1;
+    case INDEX_op_sub_i32:
+        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_and_i64: ext = 1;
+    case INDEX_op_and_i32:
+        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_or_i64: ext = 1;
+    case INDEX_op_or_i32:
+        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_xor_i64: ext = 1;
+    case INDEX_op_xor_i32:
+        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_mul_i64: ext = 1;
+    case INDEX_op_mul_i32:
+        tcg_out_mul(s, ext, args[0], args[1], args[2]);
+        break;
+
+    case INDEX_op_shl_i64: ext = 1;
+    case INDEX_op_shl_i32:
+        if (const_args[2]) {    /* LSL / UBFM Wd, Wn, (32 - m) */
+            tcg_out_shl(s, ext, args[0], args[1], args[2]);
+        } else {                /* LSL / LSLV */
+            tcg_out_shiftrot_reg(s, SRR_SHL, ext, args[0], args[1], args[2]);
+        }
+        break;
+
+    case INDEX_op_shr_i64: ext = 1;
+    case INDEX_op_shr_i32:
+        if (const_args[2]) {    /* LSR / UBFM Wd, Wn, m, 31 */
+            tcg_out_shr(s, ext, args[0], args[1], args[2]);
+        } else {                /* LSR / LSRV */
+            tcg_out_shiftrot_reg(s, SRR_SHR, ext, args[0], args[1], args[2]);
+        }
+        break;
+
+    case INDEX_op_sar_i64: ext = 1;
+    case INDEX_op_sar_i32:
+        if (const_args[2]) {    /* ASR / SBFM Wd, Wn, m, 31 */
+            tcg_out_sar(s, ext, args[0], args[1], args[2]);
+        } else {                /* ASR / ASRV */
+            tcg_out_shiftrot_reg(s, SRR_SAR, ext, args[0], args[1], args[2]);
+        }
+        break;
+
+    case INDEX_op_rotr_i64: ext = 1;
+    case INDEX_op_rotr_i32:
+        if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, m */
+            tcg_out_rotr(s, ext, args[0], args[1], args[2]);
+        } else {                /* ROR / RORV */
+            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
+        }
+        break;
+
+    case INDEX_op_rotl_i64: ext = 1;
+    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
+        if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, 32 - m */
+            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
+        } else {
+            tcg_out_arith(s, ARITH_SUB, ext, args[2], TCG_REG_XZR, args[2]);
+            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
+        }
+        break;
+
+    case INDEX_op_brcond_i64: ext = 1;
+    case INDEX_op_brcond_i32: /* CMP 0, 1, cond(2), label 3 */
+        tcg_out_cmp(s, ext, args[0], args[1]);
+        tcg_out_goto_label_cond(s, args[2], args[3]);
+        break;
+
+    case INDEX_op_setcond_i64: ext = 1;
+    case INDEX_op_setcond_i32:
+        tcg_out_cmp(s, ext, args[1], args[2]);
+        tcg_out_cset(s, ext, args[0], args[3]);
+        break;
+
+    case INDEX_op_qemu_ld8u:
+        tcg_out_qemu_ld(s, args, 0 | 0);
+        break;
+    case INDEX_op_qemu_ld8s:
+        tcg_out_qemu_ld(s, args, 4 | 0);
+        break;
+    case INDEX_op_qemu_ld16u:
+        tcg_out_qemu_ld(s, args, 0 | 1);
+        break;
+    case INDEX_op_qemu_ld16s:
+        tcg_out_qemu_ld(s, args, 4 | 1);
+        break;
+    case INDEX_op_qemu_ld32u:
+        tcg_out_qemu_ld(s, args, 0 | 2);
+        break;
+    case INDEX_op_qemu_ld32s:
+        tcg_out_qemu_ld(s, args, 4 | 2);
+        break;
+    case INDEX_op_qemu_ld32:
+        tcg_out_qemu_ld(s, args, 0 | 2);
+        break;
+    case INDEX_op_qemu_ld64:
+        tcg_out_qemu_ld(s, args, 0 | 3);
+        break;
+    case INDEX_op_qemu_st8:
+        tcg_out_qemu_st(s, args, 0);
+        break;
+    case INDEX_op_qemu_st16:
+        tcg_out_qemu_st(s, args, 1);
+        break;
+    case INDEX_op_qemu_st32:
+        tcg_out_qemu_st(s, args, 2);
+        break;
+    case INDEX_op_qemu_st64:
+        tcg_out_qemu_st(s, args, 3);
+        break;
+
+    default:
+        tcg_abort(); /* opcode not implemented */
+    }
+}
+
+static const TCGTargetOpDef aarch64_op_defs[] = {
+    { INDEX_op_exit_tb, { } },
+    { INDEX_op_goto_tb, { } },
+    { INDEX_op_call, { "ri" } },
+    { INDEX_op_br, { } },
+
+    { INDEX_op_mov_i32, { "r", "r" } },
+    { INDEX_op_mov_i64, { "r", "r" } },
+
+    { INDEX_op_movi_i32, { "r" } },
+    { INDEX_op_movi_i64, { "r" } },
+
+    { INDEX_op_ld8u_i32, { "r", "r" } },
+    { INDEX_op_ld8s_i32, { "r", "r" } },
+    { INDEX_op_ld16u_i32, { "r", "r" } },
+    { INDEX_op_ld16s_i32, { "r", "r" } },
+    { INDEX_op_ld_i32, { "r", "r" } },
+    { INDEX_op_ld8u_i64, { "r", "r" } },
+    { INDEX_op_ld8s_i64, { "r", "r" } },
+    { INDEX_op_ld16u_i64, { "r", "r" } },
+    { INDEX_op_ld16s_i64, { "r", "r" } },
+    { INDEX_op_ld32u_i64, { "r", "r" } },
+    { INDEX_op_ld32s_i64, { "r", "r" } },
+    { INDEX_op_ld_i64, { "r", "r" } },
+
+    { INDEX_op_st8_i32, { "r", "r" } },
+    { INDEX_op_st16_i32, { "r", "r" } },
+    { INDEX_op_st_i32, { "r", "r" } },
+    { INDEX_op_st8_i64, { "r", "r" } },
+    { INDEX_op_st16_i64, { "r", "r" } },
+    { INDEX_op_st32_i64, { "r", "r" } },
+    { INDEX_op_st_i64, { "r", "r" } },
+
+    { INDEX_op_add_i32, { "r", "r", "r" } },
+    { INDEX_op_add_i64, { "r", "r", "r" } },
+    { INDEX_op_sub_i32, { "r", "r", "r" } },
+    { INDEX_op_sub_i64, { "r", "r", "r" } },
+    { INDEX_op_mul_i32, { "r", "r", "r" } },
+    { INDEX_op_mul_i64, { "r", "r", "r" } },
+    { INDEX_op_and_i32, { "r", "r", "r" } },
+    { INDEX_op_and_i64, { "r", "r", "r" } },
+    { INDEX_op_or_i32, { "r", "r", "r" } },
+    { INDEX_op_or_i64, { "r", "r", "r" } },
+    { INDEX_op_xor_i32, { "r", "r", "r" } },
+    { INDEX_op_xor_i64, { "r", "r", "r" } },
+
+    { INDEX_op_shl_i32, { "r", "r", "ri" } },
+    { INDEX_op_shr_i32, { "r", "r", "ri" } },
+    { INDEX_op_sar_i32, { "r", "r", "ri" } },
+    { INDEX_op_rotl_i32, { "r", "r", "ri" } },
+    { INDEX_op_rotr_i32, { "r", "r", "ri" } },
+    { INDEX_op_shl_i64, { "r", "r", "ri" } },
+    { INDEX_op_shr_i64, { "r", "r", "ri" } },
+    { INDEX_op_sar_i64, { "r", "r", "ri" } },
+    { INDEX_op_rotl_i64, { "r", "r", "ri" } },
+    { INDEX_op_rotr_i64, { "r", "r", "ri" } },
+
+    { INDEX_op_brcond_i32, { "r", "r" } },
+    { INDEX_op_setcond_i32, { "r", "r", "r" } },
+    { INDEX_op_brcond_i64, { "r", "r" } },
+    { INDEX_op_setcond_i64, { "r", "r", "r" } },
+
+    { INDEX_op_qemu_ld8u, { "r", "l" } },
+    { INDEX_op_qemu_ld8s, { "r", "l" } },
+    { INDEX_op_qemu_ld16u, { "r", "l" } },
+    { INDEX_op_qemu_ld16s, { "r", "l" } },
+    { INDEX_op_qemu_ld32u, { "r", "l" } },
+    { INDEX_op_qemu_ld32s, { "r", "l" } },
+
+    { INDEX_op_qemu_ld32, { "r", "l" } },
+    { INDEX_op_qemu_ld64, { "r", "l" } },
+
+    { INDEX_op_qemu_st8, { "l", "l" } },
+    { INDEX_op_qemu_st16, { "l", "l" } },
+    { INDEX_op_qemu_st32, { "l", "l" } },
+    { INDEX_op_qemu_st64, { "l", "l" } },
+    { -1 },
+};
+
+static void tcg_target_init(TCGContext *s)
+{
+#if !defined(CONFIG_USER_ONLY)
+    /* fail safe */
+    if ((1ULL << CPU_TLB_ENTRY_BITS) != sizeof(CPUTLBEntry)) {
+        tcg_abort();
+    }
+#endif
+    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffffffff);
+    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffffffff);
+
+    tcg_regset_set32(tcg_target_call_clobber_regs, 0,
+                     (1 << TCG_REG_X0) | (1 << TCG_REG_X1) |
+                     (1 << TCG_REG_X2) | (1 << TCG_REG_X3) |
+                     (1 << TCG_REG_X4) | (1 << TCG_REG_X5) |
+                     (1 << TCG_REG_X6) | (1 << TCG_REG_X7) |
+                     (1 << TCG_REG_X8) | (1 << TCG_REG_X9) |
+                     (1 << TCG_REG_X10) | (1 << TCG_REG_X11) |
+                     (1 << TCG_REG_X12) | (1 << TCG_REG_X13) |
+                     (1 << TCG_REG_X14) | (1 << TCG_REG_X15) |
+                     (1 << TCG_REG_X16) | (1 << TCG_REG_X17) |
+                     (1 << TCG_REG_X18) | (1 << TCG_REG_LR));
+
+    tcg_regset_clear(s->reserved_regs);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
+
+    tcg_add_target_add_op_defs(aarch64_op_defs);
+}
+
+static inline void tcg_out_addi(TCGContext *s,
+                                int ext, int rd, int rn, unsigned int aimm)
+{
+    /* add immediate aimm unsigned 12bit value (we use LSL 0 - no shift) */
+    /* using ADD 0x11000000 | (ext) | (aimm << 10) | (rn << 5) | rd */
+    unsigned int base = ext ? 0x91000000 : 0x11000000;
+    assert(aimm <= 0xfff);
+    tcg_out32(s, base | (aimm << 10) | (rn << 5) | rd);
+}
+
+static inline void tcg_out_subi(TCGContext *s,
+                                int ext, int rd, int rn, unsigned int aimm)
+{
+    /* sub immediate aimm unsigned 12bit value (we use LSL 0 - no shift) */
+    /* using SUB 0x51000000 | (ext) | (aimm << 10) | (rn << 5) | rd */
+    unsigned int base = ext ? 0xd1000000 : 0x51000000;
+    assert(aimm <= 0xfff);
+    tcg_out32(s, base | (aimm << 10) | (rn << 5) | rd);
+}
+
+static void tcg_target_qemu_prologue(TCGContext *s)
+{
+    /* NB: frame sizes are in 16 byte stack units! */
+    int frame_size_callee_saved, frame_size_tcg_locals;
+    int r;
+
+    /* save pairs             (FP, LR) and (X19, X20) .. (X27, X28) */
+    frame_size_callee_saved = (1) + (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;
+
+    /* frame size requirement for TCG local variables */
+    frame_size_tcg_locals = TCG_STATIC_CALL_ARGS_SIZE
+        + CPU_TEMP_BUF_NLONGS * sizeof(long)
+        + (TCG_TARGET_STACK_ALIGN - 1);
+    frame_size_tcg_locals &= ~(TCG_TARGET_STACK_ALIGN - 1);
+    frame_size_tcg_locals /= TCG_TARGET_STACK_ALIGN;
+
+    /* push (FP, LR) and update sp */
+    tcg_out_push_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
+
+    /* FP -> callee_saved */
+    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
+
+    /* store callee-preserved regs x19..x28 using FP -> callee_saved */
+    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
+        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
+        tcg_out_store_pair(s, r, r + 1, idx);
+    }
+
+    /* make stack space for TCG locals */
+    tcg_out_subi(s, 1, TCG_REG_SP, TCG_REG_SP,
+                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
+    /* inform TCG about how to find TCG locals with register, offset, size */
+    tcg_set_frame(s, TCG_REG_SP, TCG_STATIC_CALL_ARGS_SIZE,
+                  CPU_TEMP_BUF_NLONGS * sizeof(long));
+
+    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
+    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
+
+    tb_ret_addr = s->code_ptr;
+
+    /* remove TCG locals stack space */
+    tcg_out_addi(s, 1, TCG_REG_SP, TCG_REG_SP,
+                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
+
+    /* restore registers x19..x28.
+       FP must be preserved, so it still points to callee_saved area */
+    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
+        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
+        tcg_out_load_pair(s, r, r + 1, idx);
+    }
+
+    /* pop (FP, LR), restore SP to previous frame, return */
+    tcg_out_pop_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
+    tcg_out_ret(s);
+}
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
new file mode 100644
index 0000000..075ab2a
--- /dev/null
+++ b/tcg/aarch64/tcg-target.h
@@ -0,0 +1,99 @@
+/*
+ * Initial TCG Implementation for aarch64
+ *
+ * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
+ * Written by Claudio Fontana
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.
+ *
+ * See the COPYING file in the top-level directory for details.
+ */
+
+#ifndef TCG_TARGET_AARCH64
+#define TCG_TARGET_AARCH64 1
+
+#undef TCG_TARGET_WORDS_BIGENDIAN
+#undef TCG_TARGET_STACK_GROWSUP
+
+typedef enum {
+    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3, TCG_REG_X4,
+    TCG_REG_X5, TCG_REG_X6, TCG_REG_X7, TCG_REG_X8, TCG_REG_X9,
+    TCG_REG_X10, TCG_REG_X11, TCG_REG_X12, TCG_REG_X13, TCG_REG_X14,
+    TCG_REG_X15, TCG_REG_X16, TCG_REG_X17, TCG_REG_X18, TCG_REG_X19,
+    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23, TCG_REG_X24,
+    TCG_REG_X25, TCG_REG_X26, TCG_REG_X27, TCG_REG_X28,
+    TCG_REG_FP,  /* frame pointer */
+    TCG_REG_LR, /* link register */
+    TCG_REG_SP,  /* stack pointer or zero register */
+    TCG_REG_XZR = TCG_REG_SP /* same register number */
+    /* program counter is not directly accessible! */
+} TCGReg;
+
+#define TCG_TARGET_NB_REGS 32
+
+/* used for function call generation */
+#define TCG_REG_CALL_STACK              TCG_REG_SP
+#define TCG_TARGET_STACK_ALIGN          16
+#define TCG_TARGET_CALL_ALIGN_ARGS      1
+#define TCG_TARGET_CALL_STACK_OFFSET    0
+
+/* optional instructions */
+#define TCG_TARGET_HAS_div_i32          0
+#define TCG_TARGET_HAS_ext8s_i32        0
+#define TCG_TARGET_HAS_ext16s_i32       0
+#define TCG_TARGET_HAS_ext8u_i32        0
+#define TCG_TARGET_HAS_ext16u_i32       0
+#define TCG_TARGET_HAS_bswap16_i32      0
+#define TCG_TARGET_HAS_bswap32_i32      0
+#define TCG_TARGET_HAS_not_i32          0
+#define TCG_TARGET_HAS_neg_i32          0
+#define TCG_TARGET_HAS_rot_i32          1
+#define TCG_TARGET_HAS_andc_i32         0
+#define TCG_TARGET_HAS_orc_i32          0
+#define TCG_TARGET_HAS_eqv_i32          0
+#define TCG_TARGET_HAS_nand_i32         0
+#define TCG_TARGET_HAS_nor_i32          0
+#define TCG_TARGET_HAS_deposit_i32      0
+#define TCG_TARGET_HAS_movcond_i32      0
+#define TCG_TARGET_HAS_add2_i32         0
+#define TCG_TARGET_HAS_sub2_i32         0
+#define TCG_TARGET_HAS_mulu2_i32        0
+#define TCG_TARGET_HAS_muls2_i32        0
+
+#define TCG_TARGET_HAS_div_i64          0
+#define TCG_TARGET_HAS_ext8s_i64        0
+#define TCG_TARGET_HAS_ext16s_i64       0
+#define TCG_TARGET_HAS_ext32s_i64       0
+#define TCG_TARGET_HAS_ext8u_i64        0
+#define TCG_TARGET_HAS_ext16u_i64       0
+#define TCG_TARGET_HAS_ext32u_i64       0
+#define TCG_TARGET_HAS_bswap16_i64      0
+#define TCG_TARGET_HAS_bswap32_i64      0
+#define TCG_TARGET_HAS_bswap64_i64      0
+#define TCG_TARGET_HAS_not_i64          0
+#define TCG_TARGET_HAS_neg_i64          0
+#define TCG_TARGET_HAS_rot_i64          1
+#define TCG_TARGET_HAS_andc_i64         0
+#define TCG_TARGET_HAS_orc_i64          0
+#define TCG_TARGET_HAS_eqv_i64          0
+#define TCG_TARGET_HAS_nand_i64         0
+#define TCG_TARGET_HAS_nor_i64          0
+#define TCG_TARGET_HAS_deposit_i64      0
+#define TCG_TARGET_HAS_movcond_i64      0
+#define TCG_TARGET_HAS_add2_i64         0
+#define TCG_TARGET_HAS_sub2_i64         0
+#define TCG_TARGET_HAS_mulu2_i64        0
+#define TCG_TARGET_HAS_muls2_i64        0
+
+enum {
+    TCG_AREG0 = TCG_REG_X19,
+};
+
+static inline void flush_icache_range(tcg_target_ulong start,
+                                      tcg_target_ulong stop)
+{
+    __builtin___clear_cache((char *)start, (char *)stop);
+}
+
+#endif /* TCG_TARGET_AARCH64 */
diff --git a/translate-all.c b/translate-all.c
index da93608..9d265bf 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -461,6 +461,8 @@ static inline PageDesc *page_find(tb_page_addr_t index)
 # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
 #elif defined(__sparc__)
 # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
+#elif defined(__aarch64__)
+# define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
 #elif defined(__arm__)
 # define MAX_CODE_GEN_BUFFER_SIZE  (16u * 1024 * 1024)
 #elif defined(__s390x__)
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 3/4] configure: permit compilation on arm aarch64
  2013-05-23  8:09 ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Claudio Fontana
  2013-05-23  8:14   ` [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
  2013-05-23  8:18   ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
@ 2013-05-23  8:19   ` Claudio Fontana
  2013-05-23 13:24     ` Peter Maydell
  2013-05-23  8:22   ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: more ops in preparation of tlb lookup Claudio Fontana
  2013-05-23 12:37   ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Andreas Färber
  4 siblings, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-23  8:19 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson


support compiling on aarch64.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 configure | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/configure b/configure
index 9439f1c..9cc398c 100755
--- a/configure
+++ b/configure
@@ -384,6 +384,8 @@ elif check_define __s390__ ; then
   fi
 elif check_define __arm__ ; then
   cpu="arm"
+elif check_define __aarch64__ ; then
+  cpu="aarch64"
 elif check_define __hppa__ ; then
   cpu="hppa"
 else
@@ -406,6 +408,9 @@ case "$cpu" in
   armv*b|armv*l|arm)
     cpu="arm"
   ;;
+  aarch64)
+    cpu="aarch64"
+  ;;
   hppa|parisc|parisc64)
     cpu="hppa"
   ;;
@@ -4114,6 +4119,9 @@ if test "$linux" = "yes" ; then
   s390x)
     linux_arch=s390
     ;;
+  aarch64)
+    linux_arch=arm64
+    ;;
   *)
     # For most CPUs the kernel architecture name and QEMU CPU name match.
     linux_arch="$cpu"
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH 4/4] tcg/aarch64: more ops in preparation of tlb lookup
  2013-05-23  8:09 ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Claudio Fontana
                     ` (2 preceding siblings ...)
  2013-05-23  8:19   ` [Qemu-devel] [PATCH 3/4] configure: permit compilation on arm aarch64 Claudio Fontana
@ 2013-05-23  8:22   ` Claudio Fontana
  2013-05-23 12:37   ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Andreas Färber
  4 siblings, 0 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-23  8:22 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson


add SUBS to the arithmetic instructions and add a shift parameter to
all arithmetic instructions, so we can make use of shifted registers.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 tcg/aarch64/tcg-target.c | 36 +++++++++++++++++++++++++++---------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index da859c7..5440659 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -190,6 +190,7 @@ enum aarch64_ldst_op_type { /* type of operation */
 enum aarch64_arith_opc {
     ARITH_ADD = 0x0b,
     ARITH_SUB = 0x4b,
+    ARITH_SUBS = 0x6b,
     ARITH_AND = 0x0a,
     ARITH_OR = 0x2a,
     ARITH_XOR = 0x4a
@@ -410,12 +411,20 @@ static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
 }
 
 static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
-                                 int ext, int rd, int rn, int rm)
+                                 int ext, int rd, int rn, int rm, int shift_imm)
 {
     /* Using shifted register arithmetic operations */
     /* if extended registry operation (64bit) just or with 0x80 << 24 */
-    unsigned int base = ext ? (0x80 | opc) << 24 : opc << 24;
-    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
+    unsigned int shift, base = ext ? (0x80 | opc) << 24 : opc << 24;
+    if (shift_imm == 0) {
+        shift = 0;
+    } else if (shift_imm > 0) {
+        shift = shift_imm << 10 | 1 << 22;
+    } else /* (shift_imm < 0) */ {
+        shift = (-shift_imm) << 10;
+    }
+
+    tcg_out32(s, base | rm << 16 | shift | rn << 5 | rd);
 }
 
 static inline void tcg_out_mul(TCGContext *s, int ext, int rd, int rn, int rm)
@@ -597,6 +606,15 @@ static inline void tcg_out_tst(TCGContext *s, int ext, int rn,
     tcg_out32(s, base | (pattern_n - 1) << 10 | rn << 5);
 }
 
+/* and a register with a bit pattern, similarly to TST, no flags change */
+static inline void tcg_out_andi(TCGContext *s, int ext, int rd,
+                                int rn, tcg_target_ulong pattern_n)
+{
+    /* using AND 0x12000000. Ext requires 4. */
+    unsigned int base = ext ? 0x92400000 : 0x12000000;
+    tcg_out32(s, base | (pattern_n - 1) << 10 | rn << 5);
+}
+
 static inline void tcg_out_ret(TCGContext *s)
 {
     /* emit RET { LR } */
@@ -870,27 +888,27 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 
     case INDEX_op_add_i64: ext = 1;
     case INDEX_op_add_i32:
-        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_sub_i64: ext = 1;
     case INDEX_op_sub_i32:
-        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_and_i64: ext = 1;
     case INDEX_op_and_i32:
-        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_or_i64: ext = 1;
     case INDEX_op_or_i32:
-        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_xor_i64: ext = 1;
     case INDEX_op_xor_i32:
-        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_mul_i64: ext = 1;
@@ -939,7 +957,7 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
         if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, 32 - m */
             tcg_out_rotl(s, ext, args[0], args[1], args[2]);
         } else {
-            tcg_out_arith(s, ARITH_SUB, ext, args[2], TCG_REG_XZR, args[2]);
+            tcg_out_arith(s, ARITH_SUB, ext, args[2], TCG_REG_XZR, args[2], 0);
             tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
         }
         break;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2
  2013-05-23  8:09 ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Claudio Fontana
                     ` (3 preceding siblings ...)
  2013-05-23  8:22   ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: more ops in preparation of tlb lookup Claudio Fontana
@ 2013-05-23 12:37   ` Andreas Färber
  2013-05-23 12:50     ` Peter Maydell
  4 siblings, 1 reply; 60+ messages in thread
From: Andreas Färber @ 2013-05-23 12:37 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Paolo Bonzini

Hi,

Am 23.05.2013 10:09, schrieb Claudio Fontana:
> 
> This series implements preliminary support for the ARM aarch64 TCG target.
[snip]

Generally, please post patch series without --in-reply-to= and use
--subject-prefix="PATCH v2" etc. plus a change log in the cover letter
to distinguish iterations.

http://wiki.qemu.org/Contribute/SubmitAPatch

If Big Endian targets are not yet supported, should this rather be an
RFC? Or is that just about some unimplemented opcodes?

Regards,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2
  2013-05-23 12:37   ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Andreas Färber
@ 2013-05-23 12:50     ` Peter Maydell
  2013-05-23 12:53       ` Andreas Färber
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-23 12:50 UTC (permalink / raw)
  To: Andreas Färber
  Cc: Paolo Bonzini, Claudio Fontana, qemu-devel, Richard Henderson

On 23 May 2013 13:37, Andreas Färber <afaerber@suse.de> wrote:
> If Big Endian targets are not yet supported, should this rather be an
> RFC? Or is that just about some unimplemented opcodes?

I'm happy for us to wait until an actual big-endian system
running Linux appears before we worry about it. #error if anybody
tries it is perfectly fine. (I would be surprised if we got
it right for the 32 bit bigendian hosts, for that matter:
there simply aren't really any systems out there running
big-endian Linux ARM which you could test it on.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2
  2013-05-23 12:50     ` Peter Maydell
@ 2013-05-23 12:53       ` Andreas Färber
  2013-05-23 13:03         ` Peter Maydell
  0 siblings, 1 reply; 60+ messages in thread
From: Andreas Färber @ 2013-05-23 12:53 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Paolo Bonzini, Claudio Fontana, qemu-devel, Richard Henderson

Am 23.05.2013 14:50, schrieb Peter Maydell:
> On 23 May 2013 13:37, Andreas Färber <afaerber@suse.de> wrote:
>> If Big Endian targets are not yet supported, should this rather be an
>> RFC? Or is that just about some unimplemented opcodes?
> 
> I'm happy for us to wait until an actual big-endian system
> running Linux appears before we worry about it. #error if anybody
> tries it is perfectly fine. (I would be surprised if we got
> it right for the 32 bit bigendian hosts, for that matter:
> there simply aren't really any systems out there running
> big-endian Linux ARM which you could test it on.)

I was worried about Big Endian QEMU targets (ppc, sparc, etc.), not
about Big Endian ARM hosts. If only half our targets are supported by a
TCG backend, then the default target list is busted, whether hardcoded
or default-configs-generated.

Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2
  2013-05-23 12:53       ` Andreas Färber
@ 2013-05-23 13:03         ` Peter Maydell
  2013-05-23 13:27           ` Claudio Fontana
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-23 13:03 UTC (permalink / raw)
  To: Andreas Färber
  Cc: Paolo Bonzini, Claudio Fontana, qemu-devel, Richard Henderson

On 23 May 2013 13:53, Andreas Färber <afaerber@suse.de> wrote:
> Am 23.05.2013 14:50, schrieb Peter Maydell:
>> I'm happy for us to wait until an actual big-endian system
>> running Linux appears before we worry about it.

> I was worried about Big Endian QEMU targets (ppc, sparc, etc.), not
> about Big Endian ARM hosts.

Oops, yes; I agree. guest big-endian support is pretty trivial
(all you need to do is byteswap on load and store) and easily
testable (run a BE guest) so the TCG backend should just
implement it from the start.

-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs
  2013-05-23  8:14   ` [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
@ 2013-05-23 13:18     ` Peter Maydell
  2013-05-28  8:09     ` Laurent Desnogues
  1 sibling, 0 replies; 60+ messages in thread
From: Peter Maydell @ 2013-05-23 13:18 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson

On 23 May 2013 09:14, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>
> we will use the 26bit relative relocs in the aarch64 tcg target.
>
> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] configure: permit compilation on arm aarch64
  2013-05-23  8:19   ` [Qemu-devel] [PATCH 3/4] configure: permit compilation on arm aarch64 Claudio Fontana
@ 2013-05-23 13:24     ` Peter Maydell
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Maydell @ 2013-05-23 13:24 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson

On 23 May 2013 09:19, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>
> support compiling on aarch64.
>
> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2
  2013-05-23 13:03         ` Peter Maydell
@ 2013-05-23 13:27           ` Claudio Fontana
  0 siblings, 0 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-23 13:27 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Paolo Bonzini, Richard Henderson, Andreas Färber, qemu-devel

On 23.05.2013 15:03, Peter Maydell wrote:
> On 23 May 2013 13:53, Andreas Färber <afaerber@suse.de> wrote:
>> Am 23.05.2013 14:50, schrieb Peter Maydell:
>>> I'm happy for us to wait until an actual big-endian system
>>> running Linux appears before we worry about it.
> 
>> I was worried about Big Endian QEMU targets (ppc, sparc, etc.), not
>> about Big Endian ARM hosts.
> 
> Oops, yes; I agree. guest big-endian support is pretty trivial
> (all you need to do is byteswap on load and store) and easily
> testable (run a BE guest) so the TCG backend should just
> implement it from the start.
> 
> -- PMM

Ok, I will implement in version 3 of the series.

Thanks,

Claudio

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-23  8:18   ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
@ 2013-05-23 16:29     ` Richard Henderson
  2013-05-24  8:53       ` Claudio Fontana
  2013-05-23 16:39     ` Peter Maydell
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 60+ messages in thread
From: Richard Henderson @ 2013-05-23 16:29 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Peter Maydell, qemu-devel, Paolo Bonzini

On 05/23/2013 01:18 AM, Claudio Fontana wrote:
> +static inline void patch_reloc(uint8_t *code_ptr, int type,
> +                               tcg_target_long value, tcg_target_long addend)
> +{
> +    switch (type) {
> +    case R_AARCH64_JUMP26:
> +    case R_AARCH64_CALL26:
> +        reloc_pc26(code_ptr, value);
> +        break;
> +    case R_AARCH64_CONDBR19:
> +        reloc_pc19(code_ptr, value);
> +        break;

The addend operand may always be zero atm, but honor it anyway.

> +static inline void tcg_out_ldst_9(TCGContext *s,
> +                                  enum aarch64_ldst_op_data op_data,
> +                                  enum aarch64_ldst_op_type op_type,
> +                                  int rd, int rn, tcg_target_long offset)

Universally, use TCGReg for arguments that must be registers.

> +static inline void tcg_out_movi32(TCGContext *s, int ext, int rd,
> +                                  uint32_t value)
> +{
> +    uint32_t half, base, movk = 0;
> +    if (!value) {
> +        tcg_out_movr(s, ext, rd, TCG_REG_XZR);
> +        return;
> +    }

No real need to special case zero; it's just an extra test slowing down the
compiler.

> +    /* construct halfwords of the immediate with MOVZ with LSL */
> +    /* using MOVZ 0x52800000 | extended reg.. */
> +    base = ext ? 0xd2800000 : 0x52800000;

Why is ext an argument to movi32?  Don't we know just because of the name that
we only case about 32-bit data?  And thus you should always be writing to the
Wn registers, which automatically zero the high bits of the Xn register?

Although honestly, if you wanted to keep "ext", you could just merge the two
movi routines.  For most tcg targets that's what we already do -- the arm port
from whence you copied this is the outlier, because it wanted to add a
condition argument.

> +/* solve the whole ldst problem */
> +static inline void tcg_out_ldst(TCGContext *s, enum aarch64_ldst_op_data data,
> +                                enum aarch64_ldst_op_type type,
> +                                int rd, int rn, tcg_target_long offset)
> +{
> +    if (offset > -256 && offset < 256) {
> +        tcg_out_ldst_9(s, data, type, rd, rn, offset);

Ouch, that's not much room.  You'll overflow that regularly getting to the
various register slots in env.  You really are going to want to be able to make
use of the scaled 12-bit offset.

That said, this is certainly ok for now.

> +/* mov alias implemented with add immediate, useful to move to/from SP */
> +static inline void tcg_out_movr_sp(TCGContext *s, int ext, int rd, int rn)
> +{
> +    /* using ADD 0x11000000 | (ext) | rn << 5 | rd */
> +    unsigned int base = ext ? 0x91000000 : 0x11000000;
> +    tcg_out32(s, base | rn << 5 | rd);
> +}

Any reason not to just make this the move register function?  That is,
assuming there's a real reason you set up that frame pointer...

It's starting to look like "ext" could be set to 0x80000000 (or really a
symbolic alias of that) and written as x | ext everwhere, instead of
conditionals.  At least in most cases.

> +    /* all arguments passed via registers */
> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);

addr_reg almost certainly needs to be zero-extended for 32-bit guests, easily
done by setting ext = 0 here.

> +        unsigned int bits; bits = 8 * (1 << s_bits) - 1;

  unsigned int bits = ...

> +#else /* !CONFIG_SOFTMMU */
> +    tcg_abort(); /* TODO */
> +#endif

This really is even easier: zero-extend (if needed), add GUEST_BASE (often held
in a reserved register for 64-bit targets), perform the load/store.  And of
course for aarch64, the add can be done via reg+reg addressing.

See e.g. ppc64 for how to conditionally reserve a register containing GUEST_BASE.

> +    case INDEX_op_rotl_i64: ext = 1;
> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
> +        if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, 32 - m */
> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
> +        } else {
> +            tcg_out_arith(s, ARITH_SUB, ext, args[2], TCG_REG_XZR, args[2]);
> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);

You can't clobber the args[2] register here.  You need to use the TMP register.

And fwiw, you can always use ext = 0 for that negation, since we don't care
about the high bits.

> +    case INDEX_op_setcond_i64: ext = 1;
> +    case INDEX_op_setcond_i32:
> +        tcg_out_cmp(s, ext, args[1], args[2]);
> +        tcg_out_cset(s, ext, args[0], args[3]);
> +        break;

ext = 0 for the cset, since the result is always 0/1.

> +static void tcg_target_qemu_prologue(TCGContext *s)
> +{
> +    /* NB: frame sizes are in 16 byte stack units! */
> +    int frame_size_callee_saved, frame_size_tcg_locals;
> +    int r;
> +
> +    /* save pairs             (FP, LR) and (X19, X20) .. (X27, X28) */
> +    frame_size_callee_saved = (1) + (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;
> +
> +    /* frame size requirement for TCG local variables */
> +    frame_size_tcg_locals = TCG_STATIC_CALL_ARGS_SIZE
> +        + CPU_TEMP_BUF_NLONGS * sizeof(long)
> +        + (TCG_TARGET_STACK_ALIGN - 1);
> +    frame_size_tcg_locals &= ~(TCG_TARGET_STACK_ALIGN - 1);
> +    frame_size_tcg_locals /= TCG_TARGET_STACK_ALIGN;
> +
> +    /* push (FP, LR) and update sp */
> +    tcg_out_push_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
> +
> +    /* FP -> callee_saved */
> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);

You initialize FP, but you don't reserve the register, so it's going to get
clobbered.  We don't actually use the frame pointer in the translated code, so
I don't think there's any call to actually initialize it either.


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-23  8:18   ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
  2013-05-23 16:29     ` Richard Henderson
@ 2013-05-23 16:39     ` Peter Maydell
  2013-05-24  8:51       ` Claudio Fontana
  2013-05-27  9:47     ` Laurent Desnogues
  2013-05-28 13:14     ` Laurent Desnogues
  3 siblings, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-23 16:39 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Paolo Bonzini, qemu-devel, Richard Henderson

On 23 May 2013 09:18, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>
> add preliminary support for TCG target aarch64.

Richard's handling the technical bits of the review, so
just some minor style nits here.

I tested this on the foundation model and was able to boot
a 32-bit-ARM kernel.

> +static inline void reloc_pc19(void *code_ptr, tcg_target_long target)
> +{
> +    tcg_target_long offset; uint32_t insn;
> +    offset = (target - (tcg_target_long)code_ptr) / 4;
> +    offset &= 0x07ffff;
> +    /* read instruction, mask away previous PC_REL19 parameter contents,
> +       set the proper offset, then write back the instruction. */
> +    insn = *(uint32_t *)code_ptr;
> +    insn = (insn & 0xff00001f) | offset << 5; /* lower 5 bits = condition */

You can say
    insn = deposit32(insn, 5, 19, offset);
here rather than doing
    offset &= 0x07ffff;
    insn = (insn & 0xff00001f) | offset << 5;

(might as well also use deposit32 for consistency in the pc26 function.)

> +static inline enum aarch64_ldst_op_data
> +aarch64_ldst_get_data(TCGOpcode tcg_op)
> +{
> +    switch (tcg_op) {
> +    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
> +    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
> +    case INDEX_op_st8_i32: case INDEX_op_st8_i64:

One case per line, please (here and elsewhere).

> +static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
> +{
> +    tcg_target_long offset;
> +
> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
> +
> +    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */
> +        tcg_out_movi64(s, TCG_REG_TMP, target);
> +        tcg_out_callr(s, TCG_REG_TMP);
> +

Stray blank line.

> +    case INDEX_op_mov_i64: ext = 1;

Please don't put code on the same line as a case statement.
Also fall-through cases should have an explicit /* fall through */
comment (except in the case where there is no code at all
between one case statement and the next).

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-23 16:39     ` Peter Maydell
@ 2013-05-24  8:51       ` Claudio Fontana
  2013-05-27  9:10         ` Claudio Fontana
  0 siblings, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-24  8:51 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, Jani Kokkonen, qemu-devel, Richard Henderson

On 23.05.2013 18:39, Peter Maydell wrote:
> On 23 May 2013 09:18, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>
>> add preliminary support for TCG target aarch64.
> 
> Richard's handling the technical bits of the review, so
> just some minor style nits here.
> 
> I tested this on the foundation model and was able to boot
> a 32-bit-ARM kernel.
> 
>> +static inline void reloc_pc19(void *code_ptr, tcg_target_long target)
>> +{
>> +    tcg_target_long offset; uint32_t insn;
>> +    offset = (target - (tcg_target_long)code_ptr) / 4;
>> +    offset &= 0x07ffff;
>> +    /* read instruction, mask away previous PC_REL19 parameter contents,
>> +       set the proper offset, then write back the instruction. */
>> +    insn = *(uint32_t *)code_ptr;
>> +    insn = (insn & 0xff00001f) | offset << 5; /* lower 5 bits = condition */
> 
> You can say
>     insn = deposit32(insn, 5, 19, offset);
> here rather than doing
>     offset &= 0x07ffff;
>     insn = (insn & 0xff00001f) | offset << 5;
> 
> (might as well also use deposit32 for consistency in the pc26 function.)

Ok, I'll make use of it.

>> +static inline enum aarch64_ldst_op_data
>> +aarch64_ldst_get_data(TCGOpcode tcg_op)
>> +{
>> +    switch (tcg_op) {
>> +    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
>> +    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
>> +    case INDEX_op_st8_i32: case INDEX_op_st8_i64:
> 
> One case per line, please (here and elsewhere).

Will comply.

>> +static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
>> +{
>> +    tcg_target_long offset;
>> +
>> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
>> +
>> +    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */
>> +        tcg_out_movi64(s, TCG_REG_TMP, target);
>> +        tcg_out_callr(s, TCG_REG_TMP);
>> +
> 
> Stray blank line.

I should remove this \n I assume. Ok.

>> +    case INDEX_op_mov_i64: ext = 1;
> 
> Please don't put code on the same line as a case statement.
> Also fall-through cases should have an explicit /* fall through */
> comment (except in the case where there is no code at all
> between one case statement and the next).

Will change for the next version.

> thanks
> -- PMM
> 

Thank you,

Claudio

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-23 16:29     ` Richard Henderson
@ 2013-05-24  8:53       ` Claudio Fontana
  2013-05-24 17:02         ` Richard Henderson
  0 siblings, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-24  8:53 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, Jani Kokkonen, qemu-devel, Paolo Bonzini

On 23.05.2013 18:29, Richard Henderson wrote:
> On 05/23/2013 01:18 AM, Claudio Fontana wrote:
>> +static inline void patch_reloc(uint8_t *code_ptr, int type,
>> +                               tcg_target_long value, tcg_target_long addend)
>> +{
>> +    switch (type) {
>> +    case R_AARCH64_JUMP26:
>> +    case R_AARCH64_CALL26:
>> +        reloc_pc26(code_ptr, value);
>> +        break;
>> +    case R_AARCH64_CONDBR19:
>> +        reloc_pc19(code_ptr, value);
>> +        break;
> 
> The addend operand may always be zero atm, but honor it anyway.

Ok, I will add addend to value.

> 
>> +static inline void tcg_out_ldst_9(TCGContext *s,
>> +                                  enum aarch64_ldst_op_data op_data,
>> +                                  enum aarch64_ldst_op_type op_type,
>> +                                  int rd, int rn, tcg_target_long offset)
> 
> Universally, use TCGReg for arguments that must be registers.

Good catch, thanks.

>> +static inline void tcg_out_movi32(TCGContext *s, int ext, int rd,
>> +                                  uint32_t value)
>> +{
>> +    uint32_t half, base, movk = 0;
>> +    if (!value) {
>> +        tcg_out_movr(s, ext, rd, TCG_REG_XZR);
>> +        return;
>> +    }
> 
> No real need to special case zero; it's just an extra test slowing down the
> compiler.

Yes, we need to handle the special case zero.
Otherwise no instruction at all would be emitted for value 0.

>> +    /* construct halfwords of the immediate with MOVZ with LSL */
>> +    /* using MOVZ 0x52800000 | extended reg.. */
>> +    base = ext ? 0xd2800000 : 0x52800000;
> 
> Why is ext an argument to movi32?  Don't we know just because of the name that
> we only case about 32-bit data?  And thus you should always be writing to the
> Wn registers, which automatically zero the high bits of the Xn register?
> 
> Although honestly, if you wanted to keep "ext", you could just merge the two
> movi routines.  For most tcg targets that's what we already do -- the arm port
> from whence you copied this is the outlier, because it wanted to add a
> condition argument.

I think that the idea to merge the two is worth considering, it will reduce confusion.
I will give it a try, and review 32/64 movi use in the whole thing.

I actually don't know whether to prefer ext=0 or ext=1,
in the sense that it would be useful to know whether using the extended registers
with a small constant is performance-wise preferable to using the 32bit operation,
and relying on 0-extension. See also the rotation comment below.

>> +/* solve the whole ldst problem */
>> +static inline void tcg_out_ldst(TCGContext *s, enum aarch64_ldst_op_data data,
>> +                                enum aarch64_ldst_op_type type,
>> +                                int rd, int rn, tcg_target_long offset)
>> +{
>> +    if (offset > -256 && offset < 256) {
>> +        tcg_out_ldst_9(s, data, type, rd, rn, offset);
> 
> Ouch, that's not much room.  You'll overflow that regularly getting to the
> various register slots in env.  You really are going to want to be able to make
> use of the scaled 12-bit offset.
> 
> That said, this is certainly ok for now.
> 
>> +/* mov alias implemented with add immediate, useful to move to/from SP */
>> +static inline void tcg_out_movr_sp(TCGContext *s, int ext, int rd, int rn)
>> +{
>> +    /* using ADD 0x11000000 | (ext) | rn << 5 | rd */
>> +    unsigned int base = ext ? 0x91000000 : 0x11000000;
>> +    tcg_out32(s, base | rn << 5 | rd);
>> +}
> 
> Any reason not to just make this the move register function?  That is,
> assuming there's a real reason you set up that frame pointer...

The reason I have separate functions is to keep both the possibilities to move
XZR to a register, and to move to/from SP.
The Frame Pointer is not involved specifically in movr_sp.

There is reason to use FP, although we could use SP instead, in my view.
Using FP to reach the local callee-saved data seems more obvious to me,
and less surprising for the reader. It makes the code in the prologue also
easier to understand, as we can restore SP without losing the ability
to reach the callee-saved data, keeping the order of operations consistent. 

> It's starting to look like "ext" could be set to 0x80000000 (or really a
> symbolic alias of that) and written as x | ext everwhere, instead of
> conditionals.  At least in most cases.

Hmm it could be, although I would need two defines,
since there seem to be two formats in which ext is encoded, the one with the
(0x80 << 24) and the one with the 0x4.

>> +    /* all arguments passed via registers */
>> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
>> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
> 
> addr_reg almost certainly needs to be zero-extended for 32-bit guests, easily
> done by setting ext = 0 here.

I can easily put an #ifdef just to be sure.

>> +        unsigned int bits; bits = 8 * (1 << s_bits) - 1;
> 
>   unsigned int bits = ...
> 
>> +#else /* !CONFIG_SOFTMMU */
>> +    tcg_abort(); /* TODO */
>> +#endif
> 
> This really is even easier: zero-extend (if needed), add GUEST_BASE (often held
> in a reserved register for 64-bit targets), perform the load/store.  And of
> course for aarch64, the add can be done via reg+reg addressing.
> 
> See e.g. ppc64 for how to conditionally reserve a register containing GUEST_BASE.
> 

That's ok, but can we keep user mode for a separate patch? Testing is the reason.

>> +    case INDEX_op_rotl_i64: ext = 1;
>> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
>> +        if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, 32 - m */
>> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
>> +        } else {
>> +            tcg_out_arith(s, ARITH_SUB, ext, args[2], TCG_REG_XZR, args[2]);
>> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
> 
> You can't clobber the args[2] register here.  You need to use the TMP register.

Ok.

> 
> And fwiw, you can always use ext = 0 for that negation, since we don't care
> about the high bits.

Yes. See comment above about movi32/movi64.

>> +    case INDEX_op_setcond_i64: ext = 1;
>> +    case INDEX_op_setcond_i32:
>> +        tcg_out_cmp(s, ext, args[1], args[2]);
>> +        tcg_out_cset(s, ext, args[0], args[3]);
>> +        break;
> 
> ext = 0 for the cset, since the result is always 0/1.

Another instance of the ext/noext: here you seem to assume that using ext = 0 with small constants is better,
and I assume better here means better performance. Should I assume that that is generally the case?

>> +static void tcg_target_qemu_prologue(TCGContext *s)
>> +{
>> +    /* NB: frame sizes are in 16 byte stack units! */
>> +    int frame_size_callee_saved, frame_size_tcg_locals;
>> +    int r;
>> +
>> +    /* save pairs             (FP, LR) and (X19, X20) .. (X27, X28) */
>> +    frame_size_callee_saved = (1) + (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;
>> +
>> +    /* frame size requirement for TCG local variables */
>> +    frame_size_tcg_locals = TCG_STATIC_CALL_ARGS_SIZE
>> +        + CPU_TEMP_BUF_NLONGS * sizeof(long)
>> +        + (TCG_TARGET_STACK_ALIGN - 1);
>> +    frame_size_tcg_locals &= ~(TCG_TARGET_STACK_ALIGN - 1);
>> +    frame_size_tcg_locals /= TCG_TARGET_STACK_ALIGN;
>> +
>> +    /* push (FP, LR) and update sp */
>> +    tcg_out_push_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
>> +
>> +    /* FP -> callee_saved */
>> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
> 
> You initialize FP, but you don't reserve the register, so it's going to get
> clobbered.  We don't actually use the frame pointer in the translated code, so
> I don't think there's any call to actually initialize it either.

The FP is not going to be clobbered, not by code here and not by called code.

It is not going to be clobbered between our use before the jump and after the
jump, because all the called functions need to preserve FP as mandated by the
calling conventions.

It is not going to be clobbered from the point of view of our caller,
because we save (FP, LR) along with (X19, X20) .. (X27, X28) and restore them
before returning.

We use FP to point to the callee_saved registers, and to move to/from them
in the tcg_out_store_pair and tcg_out_load_pair functions.

> r~
> 

Claudio

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-24  8:53       ` Claudio Fontana
@ 2013-05-24 17:02         ` Richard Henderson
  2013-05-24 17:08           ` Peter Maydell
  2013-05-27 11:43           ` Claudio Fontana
  0 siblings, 2 replies; 60+ messages in thread
From: Richard Henderson @ 2013-05-24 17:02 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Peter Maydell, Jani Kokkonen, qemu-devel, Paolo Bonzini

On 05/24/2013 01:53 AM, Claudio Fontana wrote:
>> No real need to special case zero; it's just an extra test slowing down the
>> compiler.
> 
> Yes, we need to handle the special case zero.
> Otherwise no instruction at all would be emitted for value 0.

Hmm, true.  Although I'd been thinking more along the lines of
arranging the code such that we'd use movz to set the zero.

> I actually don't know whether to prefer ext=0 or ext=1,
> in the sense that it would be useful to know whether using the extended registers
> with a small constant is performance-wise preferable to using the 32bit operation,
> and relying on 0-extension. See also the rotation comment below.

>From the armv8 isa overview:

# Rationale: [...] By maintaining this semantic information in the instruction
# set, implementations can exploit this information to avoid expending energy
# or cycles to compute, forward and store the unused upper 32 bits of such
# data types. Implementations are free to exploit this freedom in whatever way
# they choose to save energy.

>> addr_reg almost certainly needs to be zero-extended for 32-bit guests, easily
>> done by setting ext = 0 here.
> 
> I can easily put an #ifdef just to be sure.

No ifdef, just the TARGET_LONG_BITS == 64 comparison works.

>> You initialize FP, but you don't reserve the register, so it's going to get
>> clobbered.  We don't actually use the frame pointer in the translated code, so
>> I don't think there's any call to actually initialize it either.
> 
> The FP is not going to be clobbered, not by code here and not by called code.
> 
> It is not going to be clobbered between our use before the jump and after the
> jump, because all the called functions need to preserve FP as mandated by the
> calling conventions.
> 
> It is not going to be clobbered from the point of view of our caller,
> because we save (FP, LR) along with (X19, X20) .. (X27, X28) and restore them
> before returning.

Ah, well, I didn't see it mentioned here,

> +    tcg_regset_clear(s->reserved_regs);
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */

but hadn't noticed that it's not listed in the reg_alloc_order.

> We use FP to point to the callee_saved registers, and to move to/from them
> in the tcg_out_store_pair and tcg_out_load_pair functions.

I hadn't noticed you'd hard-coded FP into the load/store_pair functions.
Let's *really* not do that.  Even if we decide to continue using it, let's
pass it in explicitly.

But I don't see that you're really gaining anything in the prologue from
using FP instead of SP.  It seems like a waste of a register to me.


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-24 17:02         ` Richard Henderson
@ 2013-05-24 17:08           ` Peter Maydell
  2013-05-24 17:17             ` Richard Henderson
  2013-05-27 11:43           ` Claudio Fontana
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-24 17:08 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Paolo Bonzini, Jani Kokkonen, Claudio Fontana, qemu-devel

On 24 May 2013 18:02, Richard Henderson <rth@twiddle.net> wrote:
> On 05/24/2013 01:53 AM, Claudio Fontana wrote:
>> We use FP to point to the callee_saved registers, and to move to/from them
>> in the tcg_out_store_pair and tcg_out_load_pair functions.
>
> I hadn't noticed you'd hard-coded FP into the load/store_pair functions.
> Let's *really* not do that.  Even if we decide to continue using it, let's
> pass it in explicitly.
>
> But I don't see that you're really gaining anything in the prologue from
> using FP instead of SP.  It seems like a waste of a register to me.

Where's the waste? The procedure calling standard mandates that we
set FP up, so it's not like we can use it as a general purpose
register anywhere. I agree that we shouldn't hardcode tcg_out_store_pair
to use FP as a base, but there's no particular reason not to use
it at this point in the prologue since it happens to be convenient.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-24 17:08           ` Peter Maydell
@ 2013-05-24 17:17             ` Richard Henderson
  2013-05-24 17:28               ` Peter Maydell
  0 siblings, 1 reply; 60+ messages in thread
From: Richard Henderson @ 2013-05-24 17:17 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, Jani Kokkonen, Claudio Fontana, qemu-devel

On 05/24/2013 10:08 AM, Peter Maydell wrote:
> Where's the waste? The procedure calling standard mandates that we
> set FP up, so it's not like we can use it as a general purpose
> register anywhere.

Well, the calling standard is another document that's not available yet, so
obviously I don't know the rationale for that decision.  But it does seem like
a register performing no useful function...


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-24 17:17             ` Richard Henderson
@ 2013-05-24 17:28               ` Peter Maydell
  2013-05-24 17:54                 ` Richard Henderson
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2013-05-24 17:28 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Paolo Bonzini, Jani Kokkonen, Claudio Fontana, qemu-devel

On 24 May 2013 18:17, Richard Henderson <rth@twiddle.net> wrote:
> On 05/24/2013 10:08 AM, Peter Maydell wrote:
>> Where's the waste? The procedure calling standard mandates that we
>> set FP up, so it's not like we can use it as a general purpose
>> register anywhere.
>
> Well, the calling standard is another document that's not available
> yet

Nope, it's been available for ages, along with the ELF and DWARF
specs and the C++ ABI:
  http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0055b/index.html

> so obviously I don't know the rationale for that decision.  But
> it does seem like a register performing no useful function...

It does what a frame pointer usually does, ie permits the debugger
(and other tools) to unwind the stack.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-24 17:28               ` Peter Maydell
@ 2013-05-24 17:54                 ` Richard Henderson
  0 siblings, 0 replies; 60+ messages in thread
From: Richard Henderson @ 2013-05-24 17:54 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, Jani Kokkonen, Claudio Fontana, qemu-devel

On 05/24/2013 10:28 AM, Peter Maydell wrote:
> It does what a frame pointer usually does, ie permits the debugger
> (and other tools) to unwind the stack.
> 

And is there perchance a reason we've been dropping the frame pointer from new
abis, like x86_64?  On the side unwind information does the job as well.

Which reminds me, I've been meaning to add the jit unwind info to the arm tcg
port at some point...


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-24  8:51       ` Claudio Fontana
@ 2013-05-27  9:10         ` Claudio Fontana
  2013-05-27 10:40           ` Peter Maydell
  2013-05-27 17:05           ` Richard Henderson
  0 siblings, 2 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-27  9:10 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Jani Kokkonen, qemu-devel, Richard Henderson

(removing Paolo from CC as agreed with him)

On 24.05.2013 10:51, Claudio Fontana wrote:
> On 23.05.2013 18:39, Peter Maydell wrote:
>> On 23 May 2013 09:18, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>>
>>> add preliminary support for TCG target aarch64.
>>
>> Richard's handling the technical bits of the review, so
>> just some minor style nits here.
>>
>> I tested this on the foundation model and was able to boot
>> a 32-bit-ARM kernel.
>>
>>> +static inline void reloc_pc19(void *code_ptr, tcg_target_long target)
>>> +{
>>> +    tcg_target_long offset; uint32_t insn;
>>> +    offset = (target - (tcg_target_long)code_ptr) / 4;
>>> +    offset &= 0x07ffff;
>>> +    /* read instruction, mask away previous PC_REL19 parameter contents,
>>> +       set the proper offset, then write back the instruction. */
>>> +    insn = *(uint32_t *)code_ptr;
>>> +    insn = (insn & 0xff00001f) | offset << 5; /* lower 5 bits = condition */
>>
>> You can say
>>     insn = deposit32(insn, 5, 19, offset);
>> here rather than doing
>>     offset &= 0x07ffff;
>>     insn = (insn & 0xff00001f) | offset << 5;
>>
>> (might as well also use deposit32 for consistency in the pc26 function.)
> 
> Ok, I'll make use of it.
> 
>>> +static inline enum aarch64_ldst_op_data
>>> +aarch64_ldst_get_data(TCGOpcode tcg_op)
>>> +{
>>> +    switch (tcg_op) {
>>> +    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
>>> +    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
>>> +    case INDEX_op_st8_i32: case INDEX_op_st8_i64:
>>
>> One case per line, please (here and elsewhere).
> 
> Will comply.
> 
>>> +static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
>>> +{
>>> +    tcg_target_long offset;
>>> +
>>> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
>>> +
>>> +    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */
>>> +        tcg_out_movi64(s, TCG_REG_TMP, target);
>>> +        tcg_out_callr(s, TCG_REG_TMP);
>>> +
>>
>> Stray blank line.
> 
> I should remove this \n I assume. Ok.
> 
>>> +    case INDEX_op_mov_i64: ext = 1;
>>
>> Please don't put code on the same line as a case statement.
>> Also fall-through cases should have an explicit /* fall through */
>> comment (except in the case where there is no code at all
>> between one case statement and the next).

Would it be acceptable to put a comment at the beginning of the function
describing ext use, to avoiding a series of /* fall through */ comments?

Like this:

/* ext will be set in the switch below, which will fall through
   to the common code. It triggers the use of extended registers
   where appropriate. */

and then going:

case INDEX_op_something_64:
    ext = 1;
case INDEX_op_something_32:
    the_actual_meat(s, ext, ...);
    break;

> 
> Will change for the next version.
> 
>> thanks
>> -- PMM
>>

Claudio

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-23  8:18   ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
  2013-05-23 16:29     ` Richard Henderson
  2013-05-23 16:39     ` Peter Maydell
@ 2013-05-27  9:47     ` Laurent Desnogues
  2013-05-27 10:13       ` Claudio Fontana
  2013-05-28 13:14     ` Laurent Desnogues
  3 siblings, 1 reply; 60+ messages in thread
From: Laurent Desnogues @ 2013-05-27  9:47 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Paolo Bonzini

Hi,

basically pointing out what I pointed for v1.

On Thu, May 23, 2013 at 10:18 AM, Claudio Fontana
<claudio.fontana@huawei.com> wrote:
>
> add preliminary support for TCG target aarch64.
>
> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
> ---
>  include/exec/exec-all.h  |    5 +-
>  tcg/aarch64/tcg-target.c | 1185 ++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/aarch64/tcg-target.h |   99 ++++
>  translate-all.c          |    2 +
>  4 files changed, 1290 insertions(+), 1 deletion(-)
>  create mode 100644 tcg/aarch64/tcg-target.c
>  create mode 100644 tcg/aarch64/tcg-target.h
>
[...]
> +++ b/tcg/aarch64/tcg-target.c
[...]
> +static void tcg_target_qemu_prologue(TCGContext *s)
> +{
> +    /* NB: frame sizes are in 16 byte stack units! */
> +    int frame_size_callee_saved, frame_size_tcg_locals;
> +    int r;
> +
> +    /* save pairs             (FP, LR) and (X19, X20) .. (X27, X28) */
> +    frame_size_callee_saved = (1) + (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;

Please add a comment about this computation.

> +
> +    /* frame size requirement for TCG local variables */
> +    frame_size_tcg_locals = TCG_STATIC_CALL_ARGS_SIZE
> +        + CPU_TEMP_BUF_NLONGS * sizeof(long)
> +        + (TCG_TARGET_STACK_ALIGN - 1);
> +    frame_size_tcg_locals &= ~(TCG_TARGET_STACK_ALIGN - 1);
> +    frame_size_tcg_locals /= TCG_TARGET_STACK_ALIGN;
> +
> +    /* push (FP, LR) and update sp */
> +    tcg_out_push_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
> +
> +    /* FP -> callee_saved */
> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
> +
> +    /* store callee-preserved regs x19..x28 using FP -> callee_saved */
> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {

TCG_REG_X27 -> TCG_REG_X28.

> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
> +        tcg_out_store_pair(s, r, r + 1, idx);
> +    }
> +
> +    /* make stack space for TCG locals */
> +    tcg_out_subi(s, 1, TCG_REG_SP, TCG_REG_SP,
> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
> +    /* inform TCG about how to find TCG locals with register, offset, size */
> +    tcg_set_frame(s, TCG_REG_SP, TCG_STATIC_CALL_ARGS_SIZE,
> +                  CPU_TEMP_BUF_NLONGS * sizeof(long));
> +
> +    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
> +    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
> +
> +    tb_ret_addr = s->code_ptr;
> +
> +    /* remove TCG locals stack space */
> +    tcg_out_addi(s, 1, TCG_REG_SP, TCG_REG_SP,
> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
> +
> +    /* restore registers x19..x28.
> +       FP must be preserved, so it still points to callee_saved area */
> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {

TCG_REG_X27 -> TCG_REG_X28.

Thanks,

Laurent

> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
> +        tcg_out_load_pair(s, r, r + 1, idx);
> +    }
> +
> +    /* pop (FP, LR), restore SP to previous frame, return */
> +    tcg_out_pop_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
> +    tcg_out_ret(s);
> +}
> diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
> new file mode 100644
> index 0000000..075ab2a
> --- /dev/null
> +++ b/tcg/aarch64/tcg-target.h
> @@ -0,0 +1,99 @@
> +/*
> + * Initial TCG Implementation for aarch64
> + *
> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
> + * Written by Claudio Fontana
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * (at your option) any later version.
> + *
> + * See the COPYING file in the top-level directory for details.
> + */
> +
> +#ifndef TCG_TARGET_AARCH64
> +#define TCG_TARGET_AARCH64 1
> +
> +#undef TCG_TARGET_WORDS_BIGENDIAN
> +#undef TCG_TARGET_STACK_GROWSUP
> +
> +typedef enum {
> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3, TCG_REG_X4,
> +    TCG_REG_X5, TCG_REG_X6, TCG_REG_X7, TCG_REG_X8, TCG_REG_X9,
> +    TCG_REG_X10, TCG_REG_X11, TCG_REG_X12, TCG_REG_X13, TCG_REG_X14,
> +    TCG_REG_X15, TCG_REG_X16, TCG_REG_X17, TCG_REG_X18, TCG_REG_X19,
> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23, TCG_REG_X24,
> +    TCG_REG_X25, TCG_REG_X26, TCG_REG_X27, TCG_REG_X28,
> +    TCG_REG_FP,  /* frame pointer */
> +    TCG_REG_LR, /* link register */
> +    TCG_REG_SP,  /* stack pointer or zero register */
> +    TCG_REG_XZR = TCG_REG_SP /* same register number */
> +    /* program counter is not directly accessible! */
> +} TCGReg;
> +
> +#define TCG_TARGET_NB_REGS 32
> +
> +/* used for function call generation */
> +#define TCG_REG_CALL_STACK              TCG_REG_SP
> +#define TCG_TARGET_STACK_ALIGN          16
> +#define TCG_TARGET_CALL_ALIGN_ARGS      1
> +#define TCG_TARGET_CALL_STACK_OFFSET    0
> +
> +/* optional instructions */
> +#define TCG_TARGET_HAS_div_i32          0
> +#define TCG_TARGET_HAS_ext8s_i32        0
> +#define TCG_TARGET_HAS_ext16s_i32       0
> +#define TCG_TARGET_HAS_ext8u_i32        0
> +#define TCG_TARGET_HAS_ext16u_i32       0
> +#define TCG_TARGET_HAS_bswap16_i32      0
> +#define TCG_TARGET_HAS_bswap32_i32      0
> +#define TCG_TARGET_HAS_not_i32          0
> +#define TCG_TARGET_HAS_neg_i32          0
> +#define TCG_TARGET_HAS_rot_i32          1
> +#define TCG_TARGET_HAS_andc_i32         0
> +#define TCG_TARGET_HAS_orc_i32          0
> +#define TCG_TARGET_HAS_eqv_i32          0
> +#define TCG_TARGET_HAS_nand_i32         0
> +#define TCG_TARGET_HAS_nor_i32          0
> +#define TCG_TARGET_HAS_deposit_i32      0
> +#define TCG_TARGET_HAS_movcond_i32      0
> +#define TCG_TARGET_HAS_add2_i32         0
> +#define TCG_TARGET_HAS_sub2_i32         0
> +#define TCG_TARGET_HAS_mulu2_i32        0
> +#define TCG_TARGET_HAS_muls2_i32        0
> +
> +#define TCG_TARGET_HAS_div_i64          0
> +#define TCG_TARGET_HAS_ext8s_i64        0
> +#define TCG_TARGET_HAS_ext16s_i64       0
> +#define TCG_TARGET_HAS_ext32s_i64       0
> +#define TCG_TARGET_HAS_ext8u_i64        0
> +#define TCG_TARGET_HAS_ext16u_i64       0
> +#define TCG_TARGET_HAS_ext32u_i64       0
> +#define TCG_TARGET_HAS_bswap16_i64      0
> +#define TCG_TARGET_HAS_bswap32_i64      0
> +#define TCG_TARGET_HAS_bswap64_i64      0
> +#define TCG_TARGET_HAS_not_i64          0
> +#define TCG_TARGET_HAS_neg_i64          0
> +#define TCG_TARGET_HAS_rot_i64          1
> +#define TCG_TARGET_HAS_andc_i64         0
> +#define TCG_TARGET_HAS_orc_i64          0
> +#define TCG_TARGET_HAS_eqv_i64          0
> +#define TCG_TARGET_HAS_nand_i64         0
> +#define TCG_TARGET_HAS_nor_i64          0
> +#define TCG_TARGET_HAS_deposit_i64      0
> +#define TCG_TARGET_HAS_movcond_i64      0
> +#define TCG_TARGET_HAS_add2_i64         0
> +#define TCG_TARGET_HAS_sub2_i64         0
> +#define TCG_TARGET_HAS_mulu2_i64        0
> +#define TCG_TARGET_HAS_muls2_i64        0
> +
> +enum {
> +    TCG_AREG0 = TCG_REG_X19,
> +};
> +
> +static inline void flush_icache_range(tcg_target_ulong start,
> +                                      tcg_target_ulong stop)
> +{
> +    __builtin___clear_cache((char *)start, (char *)stop);
> +}
> +
> +#endif /* TCG_TARGET_AARCH64 */
> diff --git a/translate-all.c b/translate-all.c
> index da93608..9d265bf 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -461,6 +461,8 @@ static inline PageDesc *page_find(tb_page_addr_t index)
>  # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
>  #elif defined(__sparc__)
>  # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
> +#elif defined(__aarch64__)
> +# define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
>  #elif defined(__arm__)
>  # define MAX_CODE_GEN_BUFFER_SIZE  (16u * 1024 * 1024)
>  #elif defined(__s390x__)
> --
> 1.8.1
>
>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-27  9:47     ` Laurent Desnogues
@ 2013-05-27 10:13       ` Claudio Fontana
  2013-05-27 10:28         ` Laurent Desnogues
  0 siblings, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-27 10:13 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: Peter Maydell, qemu-devel, Richard Henderson

Hello,

On 27.05.2013 11:47, Laurent Desnogues wrote:
> Hi,
> 
> basically pointing out what I pointed for v1.
> 
> On Thu, May 23, 2013 at 10:18 AM, Claudio Fontana
> <claudio.fontana@huawei.com> wrote:
>>
>> add preliminary support for TCG target aarch64.
>>
>> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
>> ---
>>  include/exec/exec-all.h  |    5 +-
>>  tcg/aarch64/tcg-target.c | 1185 ++++++++++++++++++++++++++++++++++++++++++++++
>>  tcg/aarch64/tcg-target.h |   99 ++++
>>  translate-all.c          |    2 +
>>  4 files changed, 1290 insertions(+), 1 deletion(-)
>>  create mode 100644 tcg/aarch64/tcg-target.c
>>  create mode 100644 tcg/aarch64/tcg-target.h
>>
> [...]
>> +++ b/tcg/aarch64/tcg-target.c
> [...]
>> +static void tcg_target_qemu_prologue(TCGContext *s)
>> +{
>> +    /* NB: frame sizes are in 16 byte stack units! */
>> +    int frame_size_callee_saved, frame_size_tcg_locals;
>> +    int r;
>> +
>> +    /* save pairs             (FP, LR) and (X19, X20) .. (X27, X28) */
>> +    frame_size_callee_saved = (1) + (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;
> 
> Please add a comment about this computation.

Frame sizes are calculated in 16 byte units, which is the first comment.
What this computation does is commented on the line above the computation.
Each unit represents a pair, (FP, LR), (X19, X20) .. (X27, X28).
The calculation just counts the number of units to save, there seems to be little else to comment about as far as I can see.

>> +
>> +    /* frame size requirement for TCG local variables */
>> +    frame_size_tcg_locals = TCG_STATIC_CALL_ARGS_SIZE
>> +        + CPU_TEMP_BUF_NLONGS * sizeof(long)
>> +        + (TCG_TARGET_STACK_ALIGN - 1);
>> +    frame_size_tcg_locals &= ~(TCG_TARGET_STACK_ALIGN - 1);
>> +    frame_size_tcg_locals /= TCG_TARGET_STACK_ALIGN;
>> +
>> +    /* push (FP, LR) and update sp */
>> +    tcg_out_push_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
>> +
>> +    /* FP -> callee_saved */
>> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
>> +
>> +    /* store callee-preserved regs x19..x28 using FP -> callee_saved */
>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
> 
> TCG_REG_X27 -> TCG_REG_X28.

That would be a mistake: we are using X19 as the first element of couple (X19,X20),
so the last element is identified by X27 as the first element of couple (X27,X28).
Note the r += 2, and the fact that we are storing r and r+1, as you can see...
 
>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>> +        tcg_out_store_pair(s, r, r + 1, idx);

... here.

>> +    }
>> +
>> +    /* make stack space for TCG locals */
>> +    tcg_out_subi(s, 1, TCG_REG_SP, TCG_REG_SP,
>> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
>> +    /* inform TCG about how to find TCG locals with register, offset, size */
>> +    tcg_set_frame(s, TCG_REG_SP, TCG_STATIC_CALL_ARGS_SIZE,
>> +                  CPU_TEMP_BUF_NLONGS * sizeof(long));
>> +
>> +    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
>> +    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
>> +
>> +    tb_ret_addr = s->code_ptr;
>> +
>> +    /* remove TCG locals stack space */
>> +    tcg_out_addi(s, 1, TCG_REG_SP, TCG_REG_SP,
>> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
>> +
>> +    /* restore registers x19..x28.
>> +       FP must be preserved, so it still points to callee_saved area */
>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
> 
> TCG_REG_X27 -> TCG_REG_X28.
> Thanks,
> 
> Laurent
> 
>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>> +        tcg_out_load_pair(s, r, r + 1, idx);
>> +    }
>> +
>> +    /* pop (FP, LR), restore SP to previous frame, return */
>> +    tcg_out_pop_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
>> +    tcg_out_ret(s);
>> +}
>> diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
>> new file mode 100644
>> index 0000000..075ab2a
>> --- /dev/null
>> +++ b/tcg/aarch64/tcg-target.h
>> @@ -0,0 +1,99 @@
>> +/*
>> + * Initial TCG Implementation for aarch64
>> + *
>> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
>> + * Written by Claudio Fontana
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * (at your option) any later version.
>> + *
>> + * See the COPYING file in the top-level directory for details.
>> + */
>> +
>> +#ifndef TCG_TARGET_AARCH64
>> +#define TCG_TARGET_AARCH64 1
>> +
>> +#undef TCG_TARGET_WORDS_BIGENDIAN
>> +#undef TCG_TARGET_STACK_GROWSUP
>> +
>> +typedef enum {
>> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3, TCG_REG_X4,
>> +    TCG_REG_X5, TCG_REG_X6, TCG_REG_X7, TCG_REG_X8, TCG_REG_X9,
>> +    TCG_REG_X10, TCG_REG_X11, TCG_REG_X12, TCG_REG_X13, TCG_REG_X14,
>> +    TCG_REG_X15, TCG_REG_X16, TCG_REG_X17, TCG_REG_X18, TCG_REG_X19,
>> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23, TCG_REG_X24,
>> +    TCG_REG_X25, TCG_REG_X26, TCG_REG_X27, TCG_REG_X28,
>> +    TCG_REG_FP,  /* frame pointer */
>> +    TCG_REG_LR, /* link register */
>> +    TCG_REG_SP,  /* stack pointer or zero register */
>> +    TCG_REG_XZR = TCG_REG_SP /* same register number */
>> +    /* program counter is not directly accessible! */
>> +} TCGReg;
>> +
>> +#define TCG_TARGET_NB_REGS 32
>> +
>> +/* used for function call generation */
>> +#define TCG_REG_CALL_STACK              TCG_REG_SP
>> +#define TCG_TARGET_STACK_ALIGN          16
>> +#define TCG_TARGET_CALL_ALIGN_ARGS      1
>> +#define TCG_TARGET_CALL_STACK_OFFSET    0
>> +
>> +/* optional instructions */
>> +#define TCG_TARGET_HAS_div_i32          0
>> +#define TCG_TARGET_HAS_ext8s_i32        0
>> +#define TCG_TARGET_HAS_ext16s_i32       0
>> +#define TCG_TARGET_HAS_ext8u_i32        0
>> +#define TCG_TARGET_HAS_ext16u_i32       0
>> +#define TCG_TARGET_HAS_bswap16_i32      0
>> +#define TCG_TARGET_HAS_bswap32_i32      0
>> +#define TCG_TARGET_HAS_not_i32          0
>> +#define TCG_TARGET_HAS_neg_i32          0
>> +#define TCG_TARGET_HAS_rot_i32          1
>> +#define TCG_TARGET_HAS_andc_i32         0
>> +#define TCG_TARGET_HAS_orc_i32          0
>> +#define TCG_TARGET_HAS_eqv_i32          0
>> +#define TCG_TARGET_HAS_nand_i32         0
>> +#define TCG_TARGET_HAS_nor_i32          0
>> +#define TCG_TARGET_HAS_deposit_i32      0
>> +#define TCG_TARGET_HAS_movcond_i32      0
>> +#define TCG_TARGET_HAS_add2_i32         0
>> +#define TCG_TARGET_HAS_sub2_i32         0
>> +#define TCG_TARGET_HAS_mulu2_i32        0
>> +#define TCG_TARGET_HAS_muls2_i32        0
>> +
>> +#define TCG_TARGET_HAS_div_i64          0
>> +#define TCG_TARGET_HAS_ext8s_i64        0
>> +#define TCG_TARGET_HAS_ext16s_i64       0
>> +#define TCG_TARGET_HAS_ext32s_i64       0
>> +#define TCG_TARGET_HAS_ext8u_i64        0
>> +#define TCG_TARGET_HAS_ext16u_i64       0
>> +#define TCG_TARGET_HAS_ext32u_i64       0
>> +#define TCG_TARGET_HAS_bswap16_i64      0
>> +#define TCG_TARGET_HAS_bswap32_i64      0
>> +#define TCG_TARGET_HAS_bswap64_i64      0
>> +#define TCG_TARGET_HAS_not_i64          0
>> +#define TCG_TARGET_HAS_neg_i64          0
>> +#define TCG_TARGET_HAS_rot_i64          1
>> +#define TCG_TARGET_HAS_andc_i64         0
>> +#define TCG_TARGET_HAS_orc_i64          0
>> +#define TCG_TARGET_HAS_eqv_i64          0
>> +#define TCG_TARGET_HAS_nand_i64         0
>> +#define TCG_TARGET_HAS_nor_i64          0
>> +#define TCG_TARGET_HAS_deposit_i64      0
>> +#define TCG_TARGET_HAS_movcond_i64      0
>> +#define TCG_TARGET_HAS_add2_i64         0
>> +#define TCG_TARGET_HAS_sub2_i64         0
>> +#define TCG_TARGET_HAS_mulu2_i64        0
>> +#define TCG_TARGET_HAS_muls2_i64        0
>> +
>> +enum {
>> +    TCG_AREG0 = TCG_REG_X19,
>> +};
>> +
>> +static inline void flush_icache_range(tcg_target_ulong start,
>> +                                      tcg_target_ulong stop)
>> +{
>> +    __builtin___clear_cache((char *)start, (char *)stop);
>> +}
>> +
>> +#endif /* TCG_TARGET_AARCH64 */
>> diff --git a/translate-all.c b/translate-all.c
>> index da93608..9d265bf 100644
>> --- a/translate-all.c
>> +++ b/translate-all.c
>> @@ -461,6 +461,8 @@ static inline PageDesc *page_find(tb_page_addr_t index)
>>  # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
>>  #elif defined(__sparc__)
>>  # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
>> +#elif defined(__aarch64__)
>> +# define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
>>  #elif defined(__arm__)
>>  # define MAX_CODE_GEN_BUFFER_SIZE  (16u * 1024 * 1024)
>>  #elif defined(__s390x__)
>> --
>> 1.8.1

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-27 10:13       ` Claudio Fontana
@ 2013-05-27 10:28         ` Laurent Desnogues
  0 siblings, 0 replies; 60+ messages in thread
From: Laurent Desnogues @ 2013-05-27 10:28 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Peter Maydell, qemu-devel, Richard Henderson

On Mon, May 27, 2013 at 12:13 PM, Claudio Fontana
<claudio.fontana@huawei.com> wrote:
> Hello,
>
> On 27.05.2013 11:47, Laurent Desnogues wrote:
>> Hi,
>>
>> basically pointing out what I pointed for v1.
>>
>> On Thu, May 23, 2013 at 10:18 AM, Claudio Fontana
>> <claudio.fontana@huawei.com> wrote:
>>>
>>> add preliminary support for TCG target aarch64.
>>>
>>> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
>>> ---
>>>  include/exec/exec-all.h  |    5 +-
>>>  tcg/aarch64/tcg-target.c | 1185 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  tcg/aarch64/tcg-target.h |   99 ++++
>>>  translate-all.c          |    2 +
>>>  4 files changed, 1290 insertions(+), 1 deletion(-)
>>>  create mode 100644 tcg/aarch64/tcg-target.c
>>>  create mode 100644 tcg/aarch64/tcg-target.h
>>>
>> [...]
>>> +++ b/tcg/aarch64/tcg-target.c
>> [...]
>>> +static void tcg_target_qemu_prologue(TCGContext *s)
>>> +{
>>> +    /* NB: frame sizes are in 16 byte stack units! */
>>> +    int frame_size_callee_saved, frame_size_tcg_locals;
>>> +    int r;
>>> +
>>> +    /* save pairs             (FP, LR) and (X19, X20) .. (X27, X28) */
>>> +    frame_size_callee_saved = (1) + (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;
>>
>> Please add a comment about this computation.
>
> Frame sizes are calculated in 16 byte units, which is the first comment.
> What this computation does is commented on the line above the computation.
> Each unit represents a pair, (FP, LR), (X19, X20) .. (X27, X28).
> The calculation just counts the number of units to save, there seems to be little else to comment about as far as I can see.

You really should read my original message, where I explained
why your computation is odd (pun intended) ;-)

>>> +
>>> +    /* frame size requirement for TCG local variables */
>>> +    frame_size_tcg_locals = TCG_STATIC_CALL_ARGS_SIZE
>>> +        + CPU_TEMP_BUF_NLONGS * sizeof(long)
>>> +        + (TCG_TARGET_STACK_ALIGN - 1);
>>> +    frame_size_tcg_locals &= ~(TCG_TARGET_STACK_ALIGN - 1);
>>> +    frame_size_tcg_locals /= TCG_TARGET_STACK_ALIGN;
>>> +
>>> +    /* push (FP, LR) and update sp */
>>> +    tcg_out_push_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
>>> +
>>> +    /* FP -> callee_saved */
>>> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
>>> +
>>> +    /* store callee-preserved regs x19..x28 using FP -> callee_saved */
>>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
>>
>> TCG_REG_X27 -> TCG_REG_X28.
>
> That would be a mistake: we are using X19 as the first element of couple (X19,X20),
> so the last element is identified by X27 as the first element of couple (X27,X28).
> Note the r += 2, and the fact that we are storing r and r+1, as you can see...

Sorry but using TCG_REG_X28 wouldn't be a mistake, try it by
hand.

Now imagine the PCS would have said X19-X29 should be saved.
If you had used X28 as end of loop test, the code would have
been wrong.

My comments aren't pointing at real bugs, but just try to improve
readability. You can safely ignore them as the PCS isn't likely to
change :)


Laurent


>>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>>> +        tcg_out_store_pair(s, r, r + 1, idx);
>
> ... here.
>
>>> +    }
>>> +
>>> +    /* make stack space for TCG locals */
>>> +    tcg_out_subi(s, 1, TCG_REG_SP, TCG_REG_SP,
>>> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
>>> +    /* inform TCG about how to find TCG locals with register, offset, size */
>>> +    tcg_set_frame(s, TCG_REG_SP, TCG_STATIC_CALL_ARGS_SIZE,
>>> +                  CPU_TEMP_BUF_NLONGS * sizeof(long));
>>> +
>>> +    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
>>> +    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
>>> +
>>> +    tb_ret_addr = s->code_ptr;
>>> +
>>> +    /* remove TCG locals stack space */
>>> +    tcg_out_addi(s, 1, TCG_REG_SP, TCG_REG_SP,
>>> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
>>> +
>>> +    /* restore registers x19..x28.
>>> +       FP must be preserved, so it still points to callee_saved area */
>>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
>>
>> TCG_REG_X27 -> TCG_REG_X28.
>> Thanks,
>>
>> Laurent
>>
>>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>>> +        tcg_out_load_pair(s, r, r + 1, idx);
>>> +    }
>>> +
>>> +    /* pop (FP, LR), restore SP to previous frame, return */
>>> +    tcg_out_pop_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
>>> +    tcg_out_ret(s);
>>> +}
>>> diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
>>> new file mode 100644
>>> index 0000000..075ab2a
>>> --- /dev/null
>>> +++ b/tcg/aarch64/tcg-target.h
>>> @@ -0,0 +1,99 @@
>>> +/*
>>> + * Initial TCG Implementation for aarch64
>>> + *
>>> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
>>> + * Written by Claudio Fontana
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>> + * (at your option) any later version.
>>> + *
>>> + * See the COPYING file in the top-level directory for details.
>>> + */
>>> +
>>> +#ifndef TCG_TARGET_AARCH64
>>> +#define TCG_TARGET_AARCH64 1
>>> +
>>> +#undef TCG_TARGET_WORDS_BIGENDIAN
>>> +#undef TCG_TARGET_STACK_GROWSUP
>>> +
>>> +typedef enum {
>>> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3, TCG_REG_X4,
>>> +    TCG_REG_X5, TCG_REG_X6, TCG_REG_X7, TCG_REG_X8, TCG_REG_X9,
>>> +    TCG_REG_X10, TCG_REG_X11, TCG_REG_X12, TCG_REG_X13, TCG_REG_X14,
>>> +    TCG_REG_X15, TCG_REG_X16, TCG_REG_X17, TCG_REG_X18, TCG_REG_X19,
>>> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23, TCG_REG_X24,
>>> +    TCG_REG_X25, TCG_REG_X26, TCG_REG_X27, TCG_REG_X28,
>>> +    TCG_REG_FP,  /* frame pointer */
>>> +    TCG_REG_LR, /* link register */
>>> +    TCG_REG_SP,  /* stack pointer or zero register */
>>> +    TCG_REG_XZR = TCG_REG_SP /* same register number */
>>> +    /* program counter is not directly accessible! */
>>> +} TCGReg;
>>> +
>>> +#define TCG_TARGET_NB_REGS 32
>>> +
>>> +/* used for function call generation */
>>> +#define TCG_REG_CALL_STACK              TCG_REG_SP
>>> +#define TCG_TARGET_STACK_ALIGN          16
>>> +#define TCG_TARGET_CALL_ALIGN_ARGS      1
>>> +#define TCG_TARGET_CALL_STACK_OFFSET    0
>>> +
>>> +/* optional instructions */
>>> +#define TCG_TARGET_HAS_div_i32          0
>>> +#define TCG_TARGET_HAS_ext8s_i32        0
>>> +#define TCG_TARGET_HAS_ext16s_i32       0
>>> +#define TCG_TARGET_HAS_ext8u_i32        0
>>> +#define TCG_TARGET_HAS_ext16u_i32       0
>>> +#define TCG_TARGET_HAS_bswap16_i32      0
>>> +#define TCG_TARGET_HAS_bswap32_i32      0
>>> +#define TCG_TARGET_HAS_not_i32          0
>>> +#define TCG_TARGET_HAS_neg_i32          0
>>> +#define TCG_TARGET_HAS_rot_i32          1
>>> +#define TCG_TARGET_HAS_andc_i32         0
>>> +#define TCG_TARGET_HAS_orc_i32          0
>>> +#define TCG_TARGET_HAS_eqv_i32          0
>>> +#define TCG_TARGET_HAS_nand_i32         0
>>> +#define TCG_TARGET_HAS_nor_i32          0
>>> +#define TCG_TARGET_HAS_deposit_i32      0
>>> +#define TCG_TARGET_HAS_movcond_i32      0
>>> +#define TCG_TARGET_HAS_add2_i32         0
>>> +#define TCG_TARGET_HAS_sub2_i32         0
>>> +#define TCG_TARGET_HAS_mulu2_i32        0
>>> +#define TCG_TARGET_HAS_muls2_i32        0
>>> +
>>> +#define TCG_TARGET_HAS_div_i64          0
>>> +#define TCG_TARGET_HAS_ext8s_i64        0
>>> +#define TCG_TARGET_HAS_ext16s_i64       0
>>> +#define TCG_TARGET_HAS_ext32s_i64       0
>>> +#define TCG_TARGET_HAS_ext8u_i64        0
>>> +#define TCG_TARGET_HAS_ext16u_i64       0
>>> +#define TCG_TARGET_HAS_ext32u_i64       0
>>> +#define TCG_TARGET_HAS_bswap16_i64      0
>>> +#define TCG_TARGET_HAS_bswap32_i64      0
>>> +#define TCG_TARGET_HAS_bswap64_i64      0
>>> +#define TCG_TARGET_HAS_not_i64          0
>>> +#define TCG_TARGET_HAS_neg_i64          0
>>> +#define TCG_TARGET_HAS_rot_i64          1
>>> +#define TCG_TARGET_HAS_andc_i64         0
>>> +#define TCG_TARGET_HAS_orc_i64          0
>>> +#define TCG_TARGET_HAS_eqv_i64          0
>>> +#define TCG_TARGET_HAS_nand_i64         0
>>> +#define TCG_TARGET_HAS_nor_i64          0
>>> +#define TCG_TARGET_HAS_deposit_i64      0
>>> +#define TCG_TARGET_HAS_movcond_i64      0
>>> +#define TCG_TARGET_HAS_add2_i64         0
>>> +#define TCG_TARGET_HAS_sub2_i64         0
>>> +#define TCG_TARGET_HAS_mulu2_i64        0
>>> +#define TCG_TARGET_HAS_muls2_i64        0
>>> +
>>> +enum {
>>> +    TCG_AREG0 = TCG_REG_X19,
>>> +};
>>> +
>>> +static inline void flush_icache_range(tcg_target_ulong start,
>>> +                                      tcg_target_ulong stop)
>>> +{
>>> +    __builtin___clear_cache((char *)start, (char *)stop);
>>> +}
>>> +
>>> +#endif /* TCG_TARGET_AARCH64 */
>>> diff --git a/translate-all.c b/translate-all.c
>>> index da93608..9d265bf 100644
>>> --- a/translate-all.c
>>> +++ b/translate-all.c
>>> @@ -461,6 +461,8 @@ static inline PageDesc *page_find(tb_page_addr_t index)
>>>  # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
>>>  #elif defined(__sparc__)
>>>  # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
>>> +#elif defined(__aarch64__)
>>> +# define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
>>>  #elif defined(__arm__)
>>>  # define MAX_CODE_GEN_BUFFER_SIZE  (16u * 1024 * 1024)
>>>  #elif defined(__s390x__)
>>> --
>>> 1.8.1
>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-27  9:10         ` Claudio Fontana
@ 2013-05-27 10:40           ` Peter Maydell
  2013-05-27 17:05           ` Richard Henderson
  1 sibling, 0 replies; 60+ messages in thread
From: Peter Maydell @ 2013-05-27 10:40 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Jani Kokkonen, qemu-devel, Richard Henderson

On 27 May 2013 10:10, Claudio Fontana <claudio.fontana@huawei.com> wrote:
> Would it be acceptable to put a comment at the beginning of the function
> describing ext use, to avoiding a series of /* fall through */ comments?

The 'fall through' comments are for the benefit of automatic checking
and linting tools, not just human readers, so you can't abbreviate them,
I'm afraid.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-24 17:02         ` Richard Henderson
  2013-05-24 17:08           ` Peter Maydell
@ 2013-05-27 11:43           ` Claudio Fontana
  2013-05-27 18:47             ` Richard Henderson
  1 sibling, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-27 11:43 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, Jani Kokkonen, qemu-devel

On 24.05.2013 19:02, Richard Henderson wrote:
> On 05/24/2013 01:53 AM, Claudio Fontana wrote:
>>> No real need to special case zero; it's just an extra test slowing down the
>>> compiler.
>>
>> Yes, we need to handle the special case zero.
>> Otherwise no instruction at all would be emitted for value 0.
> 
> Hmm, true.  Although I'd been thinking more along the lines of
> arranging the code such that we'd use movz to set the zero.

I think we need to keep treating zero specially if we want to keep the optimization where we don't emit needless MOVK instructions for half-words of value 0000h.

I can however make one single function out of movi32 and movi64, it could look like this:

if (!value) {
    tcg_out_movr(s, 0, rd, TCG_REG_ZXR);
    return;
}

base = (value > 0xffffffff) ? 0xd2800000 : 0x52800000;

while (value) {
    /* etc etc */
}

>> I actually don't know whether to prefer ext=0 or ext=1,
>> in the sense that it would be useful to know whether using the extended registers
>> with a small constant is performance-wise preferable to using the 32bit operation,
>> and relying on 0-extension. See also the rotation comment below.
> 
>>From the armv8 isa overview:
> 
> # Rationale: [...] By maintaining this semantic information in the instruction
> # set, implementations can exploit this information to avoid expending energy
> # or cycles to compute, forward and store the unused upper 32 bits of such
> # data types. Implementations are free to exploit this freedom in whatever way
> # they choose to save energy.

I did not notice that, that solves the issue.

>>> addr_reg almost certainly needs to be zero-extended for 32-bit guests, easily
>>> done by setting ext = 0 here.
>>
>> I can easily put an #ifdef just to be sure.
> 
> No ifdef, just the TARGET_LONG_BITS == 64 comparison works.
> 
>>> You initialize FP, but you don't reserve the register, so it's going to get
>>> clobbered.  We don't actually use the frame pointer in the translated code, so
>>> I don't think there's any call to actually initialize it either.
>>
>> The FP is not going to be clobbered, not by code here and not by called code.
>>
>> It is not going to be clobbered between our use before the jump and after the
>> jump, because all the called functions need to preserve FP as mandated by the
>> calling conventions.
>>
>> It is not going to be clobbered from the point of view of our caller,
>> because we save (FP, LR) along with (X19, X20) .. (X27, X28) and restore them
>> before returning.
> 
> Ah, well, I didn't see it mentioned here,
> 
>> +    tcg_regset_clear(s->reserved_regs);
>> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
>> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
>> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
> 
> but hadn't noticed that it's not listed in the reg_alloc_order.
> 
>> We use FP to point to the callee_saved registers, and to move to/from them
>> in the tcg_out_store_pair and tcg_out_load_pair functions.
> 
> I hadn't noticed you'd hard-coded FP into the load/store_pair functions.
> Let's *really* not do that.  Even if we decide to continue using it, let's
> pass it in explicitly.
> 
> But I don't see that you're really gaining anything in the prologue from
> using FP instead of SP.  It seems like a waste of a register to me.
> 
> 
> r~
> 


-- 
Claudio Fontana
Server OS Architect
Huawei Technologies Duesseldorf GmbH
Riesstraße 25 - 80992 München

office: +49 89 158834 4135
mobile: +49 15253060158

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-27  9:10         ` Claudio Fontana
  2013-05-27 10:40           ` Peter Maydell
@ 2013-05-27 17:05           ` Richard Henderson
  1 sibling, 0 replies; 60+ messages in thread
From: Richard Henderson @ 2013-05-27 17:05 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Peter Maydell, Jani Kokkonen, qemu-devel

On 2013-05-27 02:10, Claudio Fontana wrote:
>>>> +    case INDEX_op_mov_i64: ext = 1;
>>>
>>> Please don't put code on the same line as a case statement.
>>> Also fall-through cases should have an explicit /* fall through */
>>> comment (except in the case where there is no code at all
>>> between one case statement and the next).
>
> Would it be acceptable to put a comment at the beginning of the function
> describing ext use, to avoiding a series of /* fall through */ comments?
>
> Like this:
>
> /* ext will be set in the switch below, which will fall through
>     to the common code. It triggers the use of extended registers
>     where appropriate. */
>
> and then going:
>
> case INDEX_op_something_64:
>      ext = 1;
> case INDEX_op_something_32:
>      the_actual_meat(s, ext, ...);
>      break;

I'll again suggest using the macro expansion that the i386 port uses.


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-27 11:43           ` Claudio Fontana
@ 2013-05-27 18:47             ` Richard Henderson
  2013-05-27 21:14               ` [Qemu-devel] [PATCH 3/3] " Laurent Desnogues
  2013-05-28  7:17               ` [Qemu-devel] [PATCH 2/4] " Claudio Fontana
  0 siblings, 2 replies; 60+ messages in thread
From: Richard Henderson @ 2013-05-27 18:47 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Peter Maydell, Jani Kokkonen, qemu-devel

On 2013-05-27 04:43, Claudio Fontana wrote:
>> Hmm, true.  Although I'd been thinking more along the lines of
>> arranging the code such that we'd use movz to set the zero.
>
> I think we need to keep treating zero specially if we want to keep the optimization where we don't emit needless MOVK instructions for half-words of value 0000h.
>
> I can however make one single function out of movi32 and movi64, it could look like this:
>
> if (!value) {
>      tcg_out_movr(s, 0, rd, TCG_REG_ZXR);
>      return;
> }
>
> base = (value > 0xffffffff) ? 0xd2800000 : 0x52800000;
>
> while (value) {
>      /* etc etc */
> }


     if (type == TCG_TYPE_I32) {
         value = (uint32_t)value;
         ext = 0;
     } else if (value <= 0xffffffff) {
         ext = 0;
     } else {
         ext = 0x80000000;
     }

     base = 0x52800000;  /* MOVZ */
     do {
         int shift = ctz64(value) & (63 & -16);
         int half = (value >> shift) & 0xffff;
         tcg_out32(s, base | ext | half << 5 | rd);
         value &= ~(0xffffUL << shift);
         base = 0x72800000;  /* MOVK */
     } while (value != 0);


Since we go through the loop at least once, we emit the movz for zero input. 
No need for any extra tests.  And using ctz we can iterate fewer times.


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-27 18:47             ` Richard Henderson
@ 2013-05-27 21:14               ` Laurent Desnogues
  2013-05-28 13:01                 ` Claudio Fontana
  2013-05-28  7:17               ` [Qemu-devel] [PATCH 2/4] " Claudio Fontana
  1 sibling, 1 reply; 60+ messages in thread
From: Laurent Desnogues @ 2013-05-27 21:14 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Jani Kokkonen, Claudio Fontana, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1639 bytes --]

On Monday, May 27, 2013, Richard Henderson <rth@twiddle.net> wrote:
> On 2013-05-27 04:43, Claudio Fontana wrote:
>>>
>>> Hmm, true.  Although I'd been thinking more along the lines of
>>> arranging the code such that we'd use movz to set the zero.
>>
>> I think we need to keep treating zero specially if we want to keep the
optimization where we don't emit needless MOVK instructions for half-words
of value 0000h.
>>
>> I can however make one single function out of movi32 and movi64, it
could look like this:
>>
>> if (!value) {
>>      tcg_out_movr(s, 0, rd, TCG_REG_ZXR);
>>      return;
>> }
>>
>> base = (value > 0xffffffff) ? 0xd2800000 : 0x52800000;
>>
>> while (value) {
>>      /* etc etc */
>> }
>
>
>     if (type == TCG_TYPE_I32) {
>         value = (uint32_t)value;
>         ext = 0;
>     } else if (value <= 0xffffffff) {
>         ext = 0;
>     } else {
>         ext = 0x80000000;
>     }
>
>     base = 0x52800000;  /* MOVZ */
>     do {
>         int shift = ctz64(value) & (63 & -16);
>         int half = (value >> shift) & 0xffff;
>         tcg_out32(s, base | ext | half << 5 | rd);
>         value &= ~(0xffffUL << shift);
>         base = 0x72800000;  /* MOVK */
>     } while (value != 0);
>
>
> Since we go through the loop at least once, we emit the movz for zero
input. No need for any extra tests.  And using ctz we can iterate fewer
times.

You could probably go one step further and use the logical
immediate encoding.  Look at build_immediate_table in binutils
opcodes/aarch64-opc.c.  The problem would be the use of
binary search; perhaps one could come up with some perfect
hash function.


Laurent

[-- Attachment #2: Type: text/html, Size: 2095 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-27 18:47             ` Richard Henderson
  2013-05-27 21:14               ` [Qemu-devel] [PATCH 3/3] " Laurent Desnogues
@ 2013-05-28  7:17               ` Claudio Fontana
  2013-05-28 14:52                 ` Richard Henderson
  1 sibling, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-28  7:17 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, Jani Kokkonen, qemu-devel

On 27.05.2013 20:47, Richard Henderson wrote:
> On 2013-05-27 04:43, Claudio Fontana wrote:
>>> Hmm, true.  Although I'd been thinking more along the lines of
>>> arranging the code such that we'd use movz to set the zero.
>>
>> I think we need to keep treating zero specially if we want to keep the optimization where we don't emit needless MOVK instructions for half-words of value 0000h.
>>
>> I can however make one single function out of movi32 and movi64, it could look like this:
>>
>> if (!value) {
>>      tcg_out_movr(s, 0, rd, TCG_REG_ZXR);
>>      return;
>> }
>>
>> base = (value > 0xffffffff) ? 0xd2800000 : 0x52800000;
>>
>> while (value) {
>>      /* etc etc */
>> }
> 
> 
>     if (type == TCG_TYPE_I32) {
>         value = (uint32_t)value;
>         ext = 0;
>     } else if (value <= 0xffffffff) {
>         ext = 0;
>     } else {
>         ext = 0x80000000;
>     }

The check for type is probably unnecessary, since we don't gain anything (we still have to check something once), so I'd rather use a uint64_t parameter and then just check for value < 0xffffffff.

> 
>     base = 0x52800000;  /* MOVZ */
>     do {
>         int shift = ctz64(value) & (63 & -16);
>         int half = (value >> shift) & 0xffff;
>         tcg_out32(s, base | ext | half << 5 | rd);
>         value &= ~(0xffffUL << shift);
>         base = 0x72800000;  /* MOVK */
>     } while (value != 0);
> 
> 
> Since we go through the loop at least once, we emit the movz for zero input. No need for any extra tests.  And using ctz we can iterate fewer times.

Of course, doh. I'll make use of do..while. 

Thanks,

Claudio

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs
  2013-05-23  8:14   ` [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
  2013-05-23 13:18     ` Peter Maydell
@ 2013-05-28  8:09     ` Laurent Desnogues
  1 sibling, 0 replies; 60+ messages in thread
From: Laurent Desnogues @ 2013-05-28  8:09 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Paolo Bonzini

Hello,

On Thu, May 23, 2013 at 10:14 AM, Claudio Fontana
<claudio.fontana@huawei.com> wrote:
>
> we will use the 26bit relative relocs in the aarch64 tcg target.

Is there really any point in adding all of the relocation types?
i386 doesn't, mips doesn't, x86_64 doesn't.  I didn't check the
others.

I guess we can at least get rid of dynamic relocs.

Thanks,

Laurent

> Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
> ---
>  include/elf.h | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 129 insertions(+)
>
> diff --git a/include/elf.h b/include/elf.h
> index a21ea53..cf0d3e2 100644
> --- a/include/elf.h
> +++ b/include/elf.h
> @@ -129,6 +129,8 @@ typedef int64_t  Elf64_Sxword;
>
>  #define EM_XTENSA   94      /* Tensilica Xtensa */
>
> +#define EM_AARCH64  183
> +
>  /* This is the info that is needed to parse the dynamic section of the file */
>  #define DT_NULL                0
>  #define DT_NEEDED      1
> @@ -616,6 +618,133 @@ typedef struct {
>  /* Keep this the last entry.  */
>  #define R_ARM_NUM              256
>
> +/* ARM Aarch64 relocation types */
> +#define R_AARCH64_NONE                256 /* also accepts R_ARM_NONE (0) */
> +/* static data relocations */
> +#define R_AARCH64_ABS64               257
> +#define R_AARCH64_ABS32               258
> +#define R_AARCH64_ABS16               259
> +#define R_AARCH64_PREL64              260
> +#define R_AARCH64_PREL32              261
> +#define R_AARCH64_PREL16              262
> +/* static aarch64 group relocations */
> +/* group relocs to create unsigned data value or address inline */
> +#define R_AARCH64_MOVW_UABS_G0        263
> +#define R_AARCH64_MOVW_UABS_G0_NC     264
> +#define R_AARCH64_MOVW_UABS_G1        265
> +#define R_AARCH64_MOVW_UABS_G1_NC     266
> +#define R_AARCH64_MOVW_UABS_G2        267
> +#define R_AARCH64_MOVW_UABS_G2_NC     268
> +#define R_AARCH64_MOVW_UABS_G3        269
> +/* group relocs to create signed data or offset value inline */
> +#define R_AARCH64_MOVW_SABS_G0        270
> +#define R_AARCH64_MOVW_SABS_G1        271
> +#define R_AARCH64_MOVW_SABS_G2        272
> +/* relocs to generate 19, 21, and 33 bit PC-relative addresses */
> +#define R_AARCH64_LD_PREL_LO19        273
> +#define R_AARCH64_ADR_PREL_LO21       274
> +#define R_AARCH64_ADR_PREL_PG_HI21    275
> +#define R_AARCH64_ADR_PREL_PG_HI21_NC 276
> +#define R_AARCH64_ADD_ABS_LO12_NC     277
> +#define R_AARCH64_LDST8_ABS_LO12_NC   278
> +#define R_AARCH64_LDST16_ABS_LO12_NC  284
> +#define R_AARCH64_LDST32_ABS_LO12_NC  285
> +#define R_AARCH64_LDST64_ABS_LO12_NC  286
> +#define R_AARCH64_LDST128_ABS_LO12_NC 299
> +/* relocs for control-flow - all offsets as multiple of 4 */
> +#define R_AARCH64_TSTBR14             279
> +#define R_AARCH64_CONDBR19            280
> +#define R_AARCH64_JUMP26              282
> +#define R_AARCH64_CALL26              283
> +/* group relocs to create pc-relative offset inline */
> +#define R_AARCH64_MOVW_PREL_G0        287
> +#define R_AARCH64_MOVW_PREL_G0_NC     288
> +#define R_AARCH64_MOVW_PREL_G1        289
> +#define R_AARCH64_MOVW_PREL_G1_NC     290
> +#define R_AARCH64_MOVW_PREL_G2        291
> +#define R_AARCH64_MOVW_PREL_G2_NC     292
> +#define R_AARCH64_MOVW_PREL_G3        293
> +/* group relocs to create a GOT-relative offset inline */
> +#define R_AARCH64_MOVW_GOTOFF_G0      300
> +#define R_AARCH64_MOVW_GOTOFF_G0_NC   301
> +#define R_AARCH64_MOVW_GOTOFF_G1      302
> +#define R_AARCH64_MOVW_GOTOFF_G1_NC   303
> +#define R_AARCH64_MOVW_GOTOFF_G2      304
> +#define R_AARCH64_MOVW_GOTOFF_G2_NC   305
> +#define R_AARCH64_MOVW_GOTOFF_G3      306
> +/* GOT-relative data relocs */
> +#define R_AARCH64_GOTREL64            307
> +#define R_AARCH64_GOTREL32            308
> +/* GOT-relative instr relocs */
> +#define R_AARCH64_GOT_LD_PREL19       309
> +#define R_AARCH64_LD64_GOTOFF_LO15    310
> +#define R_AARCH64_ADR_GOT_PAGE        311
> +#define R_AARCH64_LD64_GOT_LO12_NC    312
> +#define R_AARCH64_LD64_GOTPAGE_LO15   313
> +/* General Dynamic TLS relocations */
> +#define R_AARCH64_TLSGD_ADR_PREL21            512
> +#define R_AARCH64_TLSGD_ADR_PAGE21            513
> +#define R_AARCH64_TLSGD_ADD_LO12_NC           514
> +#define R_AARCH64_TLSGD_MOVW_G1               515
> +#define R_AARCH64_TLSGD_MOVW_G0_NC            516
> +/* Local Dynamic TLS relocations */
> +#define R_AARCH64_TLSLD_ADR_PREL21            517
> +#define R_AARCH64_TLSLD_ADR_PAGE21            518
> +#define R_AARCH64_TLSLD_ADD_LO12_NC           519
> +#define R_AARCH64_TLSLD_MOVW_G1               520
> +#define R_AARCH64_TLSLD_MOVW_G0_NC            521
> +#define R_AARCH64_TLSLD_LD_PREL19             522
> +#define R_AARCH64_TLSLD_MOVW_DTPREL_G2        523
> +#define R_AARCH64_TLSLD_MOVW_DTPREL_G1        524
> +#define R_AARCH64_TLSLD_MOVW_DTPREL_G1_NC     525
> +#define R_AARCH64_TLSLD_MOVW_DTPREL_G0        526
> +#define R_AARCH64_TLSLD_MOVW_DTPREL_G0_NC     527
> +#define R_AARCH64_TLSLD_ADD_DTPREL_HI12       528
> +#define R_AARCH64_TLSLD_ADD_DTPREL_LO12       529
> +#define R_AARCH64_TLSLD_ADD_DTPREL_LO12_NC    530
> +#define R_AARCH64_TLSLD_LDST8_DTPREL_LO12     531
> +#define R_AARCH64_TLSLD_LDST8_DTPREL_LO12_NC  532
> +#define R_AARCH64_TLSLD_LDST16_DTPREL_LO12    533
> +#define R_AARCH64_TLSLD_LDST16_DTPREL_LO12_NC 534
> +#define R_AARCH64_TLSLD_LDST32_DTPREL_LO12    535
> +#define R_AARCH64_TLSLD_LDST32_DTPREL_LO12_NC 536
> +#define R_AARCH64_TLSLD_LDST64_DTPREL_LO12    537
> +#define R_AARCH64_TLSLD_LDST64_DTPREL_LO12_NC 538
> +/* initial exec TLS relocations */
> +#define R_AARCH64_TLSIE_MOVW_GOTTPREL_G1      539
> +#define R_AARCH64_TLSIE_MOVW_GOTTPREL_G0_NC   540
> +#define R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21   541
> +#define R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC 542
> +#define R_AARCH64_TLSIE_LD_GOTTPREL_PREL19    543
> +/* local exec TLS relocations */
> +#define R_AARCH64_TLSLE_MOVW_TPREL_G2         544
> +#define R_AARCH64_TLSLE_MOVW_TPREL_G1         545
> +#define R_AARCH64_TLSLE_MOVW_TPREL_G1_NC      546
> +#define R_AARCH64_TLSLE_MOVW_TPREL_G0         547
> +#define R_AARCH64_TLSLE_MOVW_TPREL_G0_NC      548
> +#define R_AARCH64_TLSLE_ADD_TPREL_HI12        549
> +#define R_AARCH64_TLSLE_ADD_TPREL_LO12        550
> +#define R_AARCH64_TLSLE_ADD_TPREL_LO12_NC     551
> +#define R_AARCH64_TLSLE_LDST8_TPREL_LO12      552
> +#define R_AARCH64_TLSLE_LDST8_TPREL_LO12_NC   553
> +#define R_AARCH64_TLSLE_LDST16_TPREL_LO12     554
> +#define R_AARCH64_TLSLE_LDST16_TPREL_LO12_NC  555
> +#define R_AARCH64_TLSLE_LDST32_TPREL_LO12     556
> +#define R_AARCH64_TLSLE_LDST32_TPREL_LO12_NC  557
> +#define R_AARCH64_TLSLE_LDST64_TPREL_LO12     558
> +#define R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC  559
> +/* Dynamic Relocations */
> +#define R_AARCH64_COPY         1024
> +#define R_AARCH64_GLOB_DAT     1025
> +#define R_AARCH64_JUMP_SLOT    1026
> +#define R_AARCH64_RELATIVE     1027
> +#define R_AARCH64_TLS_DTPREL64 1028
> +#define R_AARCH64_TLS_DTPMOD64 1029
> +#define R_AARCH64_TLS_TPREL64  1030
> +#define R_AARCH64_TLS_DTPREL32 1031
> +#define R_AARCH64_TLS_DTPMOD32 1032
> +#define R_AARCH64_TLS_TPREL32  1033
> +
>  /* s390 relocations defined by the ABIs */
>  #define R_390_NONE             0       /* No reloc.  */
>  #define R_390_8                        1       /* Direct 8 bit.  */
> --
> 1.8.1
>
>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-27 21:14               ` [Qemu-devel] [PATCH 3/3] " Laurent Desnogues
@ 2013-05-28 13:01                 ` Claudio Fontana
  2013-05-28 13:09                   ` Laurent Desnogues
  0 siblings, 1 reply; 60+ messages in thread
From: Claudio Fontana @ 2013-05-28 13:01 UTC (permalink / raw)
  To: Laurent Desnogues
  Cc: Peter Maydell, Jani Kokkonen, qemu-devel, Richard Henderson

On 27.05.2013 23:14, Laurent Desnogues wrote:
> 
> 
> On Monday, May 27, 2013, Richard Henderson <rth@twiddle.net <mailto:rth@twiddle.net>> wrote:
>> On 2013-05-27 04:43, Claudio Fontana wrote:
>>>>
>>>> Hmm, true.  Although I'd been thinking more along the lines of
>>>> arranging the code such that we'd use movz to set the zero.
>>>
>>> I think we need to keep treating zero specially if we want to keep the optimization where we don't emit needless MOVK instructions for half-words of value 0000h.
>>>
>>> I can however make one single function out of movi32 and movi64, it could look like this:
>>>
>>> if (!value) {
>>>      tcg_out_movr(s, 0, rd, TCG_REG_ZXR);
>>>      return;
>>> }
>>>
>>> base = (value > 0xffffffff) ? 0xd2800000 : 0x52800000;
>>>
>>> while (value) {
>>>      /* etc etc */
>>> }
>>
>>
>>     if (type == TCG_TYPE_I32) {
>>         value = (uint32_t)value;
>>         ext = 0;
>>     } else if (value <= 0xffffffff) {
>>         ext = 0;
>>     } else {
>>         ext = 0x80000000;
>>     }
>>
>>     base = 0x52800000;  /* MOVZ */
>>     do {
>>         int shift = ctz64(value) & (63 & -16);
>>         int half = (value >> shift) & 0xffff;
>>         tcg_out32(s, base | ext | half << 5 | rd);
>>         value &= ~(0xffffUL << shift);
>>         base = 0x72800000;  /* MOVK */
>>     } while (value != 0);
>>
>>
>> Since we go through the loop at least once, we emit the movz for zero input. No need for any extra tests.  And using ctz we can iterate fewer times.
> 
> You could probably go one step further and use the logical
> immediate encoding.  Look at build_immediate_table in binutils
> opcodes/aarch64-opc.c.  The problem would be the use of
> binary search; perhaps one could come up with some perfect
> hash function.
> 
> 
> Laurent

if it's ok I would go with a variation of Richard's approach.
A more sofisticated approach, like the one you suggest, could be added by a successive patch.

Thanks,

Claudio

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64
  2013-05-28 13:01                 ` Claudio Fontana
@ 2013-05-28 13:09                   ` Laurent Desnogues
  0 siblings, 0 replies; 60+ messages in thread
From: Laurent Desnogues @ 2013-05-28 13:09 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Peter Maydell, Jani Kokkonen, qemu-devel, Richard Henderson

On Tue, May 28, 2013 at 3:01 PM, Claudio Fontana
<claudio.fontana@huawei.com> wrote:
> On 27.05.2013 23:14, Laurent Desnogues wrote:
>>
>>
>> On Monday, May 27, 2013, Richard Henderson <rth@twiddle.net <mailto:rth@twiddle.net>> wrote:
>>> On 2013-05-27 04:43, Claudio Fontana wrote:
>>>>>
>>>>> Hmm, true.  Although I'd been thinking more along the lines of
>>>>> arranging the code such that we'd use movz to set the zero.
>>>>
>>>> I think we need to keep treating zero specially if we want to keep the optimization where we don't emit needless MOVK instructions for half-words of value 0000h.
>>>>
>>>> I can however make one single function out of movi32 and movi64, it could look like this:
>>>>
>>>> if (!value) {
>>>>      tcg_out_movr(s, 0, rd, TCG_REG_ZXR);
>>>>      return;
>>>> }
>>>>
>>>> base = (value > 0xffffffff) ? 0xd2800000 : 0x52800000;
>>>>
>>>> while (value) {
>>>>      /* etc etc */
>>>> }
>>>
>>>
>>>     if (type == TCG_TYPE_I32) {
>>>         value = (uint32_t)value;
>>>         ext = 0;
>>>     } else if (value <= 0xffffffff) {
>>>         ext = 0;
>>>     } else {
>>>         ext = 0x80000000;
>>>     }
>>>
>>>     base = 0x52800000;  /* MOVZ */
>>>     do {
>>>         int shift = ctz64(value) & (63 & -16);
>>>         int half = (value >> shift) & 0xffff;
>>>         tcg_out32(s, base | ext | half << 5 | rd);
>>>         value &= ~(0xffffUL << shift);
>>>         base = 0x72800000;  /* MOVK */
>>>     } while (value != 0);
>>>
>>>
>>> Since we go through the loop at least once, we emit the movz for zero input. No need for any extra tests.  And using ctz we can iterate fewer times.
>>
>> You could probably go one step further and use the logical
>> immediate encoding.  Look at build_immediate_table in binutils
>> opcodes/aarch64-opc.c.  The problem would be the use of
>> binary search; perhaps one could come up with some perfect
>> hash function.
>>
>>
>> Laurent
>
> if it's ok I would go with a variation of Richard's approach.
> A more sofisticated approach, like the one you suggest, could be added by a successive patch.


I definitely agree.


Laurent

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-23  8:18   ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
                       ` (2 preceding siblings ...)
  2013-05-27  9:47     ` Laurent Desnogues
@ 2013-05-28 13:14     ` Laurent Desnogues
  2013-05-28 14:37       ` Claudio Fontana
  3 siblings, 1 reply; 60+ messages in thread
From: Laurent Desnogues @ 2013-05-28 13:14 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Peter Maydell, qemu-devel, Richard Henderson

Hi Claudio,

here are some minor tweaks and comments.

You work is very interesting and will form a good basis for further
improvements. Thanks!

> diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
> new file mode 100644
> index 0000000..da859c7
> --- /dev/null
> +++ b/tcg/aarch64/tcg-target.c
> @@ -0,0 +1,1185 @@
> +/*
> + * Initial TCG Implementation for aarch64
> + *
> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
> + * Written by Claudio Fontana
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * (at your option) any later version.
> + *
> + * See the COPYING file in the top-level directory for details.
> + */
> +
> +#ifdef TARGET_WORDS_BIGENDIAN
> +#error "Sorry, bigendian target not supported yet."
> +#endif /* TARGET_WORDS_BIGENDIAN */
> +
> +#ifndef NDEBUG
> +static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
> +    "%x0", "%x1", "%x2", "%x3", "%x4", "%x5", "%x6", "%x7",
> +    "%x8", "%x9", "%x10", "%x11", "%x12", "%x13", "%x14", "%x15",
> +    "%x16", "%x17", "%x18", "%x19", "%x20", "%x21", "%x22", "%x23",
> +    "%x24", "%x25", "%x26", "%x27", "%x28",
> +    "%fp", /* frame pointer */
> +    "%lr", /* link register */
> +    "%sp",  /* stack pointer */
> +};
> +#endif /* NDEBUG */
> +
> +static const int tcg_target_reg_alloc_order[] = {
> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23,
> +    TCG_REG_X24, TCG_REG_X25, TCG_REG_X26, TCG_REG_X27,
> +    TCG_REG_X28,
> +
> +    TCG_REG_X9, TCG_REG_X10, TCG_REG_X11, TCG_REG_X12,
> +    TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
> +    TCG_REG_X16, TCG_REG_X17,
> +
> +    TCG_REG_X18, TCG_REG_X19, /* will not use these, see tcg_target_init */
> +
> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
> +    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
> +
> +    TCG_REG_X8, /* will not use, see tcg_target_init */
> +};
> +
> +static const int tcg_target_call_iarg_regs[8] = {
> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
> +    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7
> +};
> +static const int tcg_target_call_oarg_regs[1] = {
> +    TCG_REG_X0
> +};
> +
> +#define TCG_REG_TMP TCG_REG_X8
> +
> +static inline void reloc_pc26(void *code_ptr, tcg_target_long target)
> +{
> +    tcg_target_long offset; uint32_t insn;
> +    offset = (target - (tcg_target_long)code_ptr) / 4;
> +    offset &= 0x03ffffff;
> +    /* read instruction, mask away previous PC_REL26 parameter contents,
> +       set the proper offset, then write back the instruction. */
> +    insn = *(uint32_t *)code_ptr;
> +    insn = (insn & 0xfc000000) | offset;
> +    *(uint32_t *)code_ptr = insn;
> +}
> +
> +static inline void reloc_pc19(void *code_ptr, tcg_target_long target)
> +{
> +    tcg_target_long offset; uint32_t insn;
> +    offset = (target - (tcg_target_long)code_ptr) / 4;
> +    offset &= 0x07ffff;
> +    /* read instruction, mask away previous PC_REL19 parameter contents,
> +       set the proper offset, then write back the instruction. */
> +    insn = *(uint32_t *)code_ptr;
> +    insn = (insn & 0xff00001f) | offset << 5; /* lower 5 bits = condition */

The comment is wrong: only the lower 4 bits form the condition.

> +    *(uint32_t *)code_ptr = insn;
> +}
> +
> +static inline void patch_reloc(uint8_t *code_ptr, int type,
> +                               tcg_target_long value, tcg_target_long addend)
> +{
> +    switch (type) {
> +    case R_AARCH64_JUMP26:
> +    case R_AARCH64_CALL26:
> +        reloc_pc26(code_ptr, value);
> +        break;
> +    case R_AARCH64_CONDBR19:
> +        reloc_pc19(code_ptr, value);
> +        break;
> +
> +    default:
> +        tcg_abort();
> +    }
> +}
> +
> +/* parse target specific constraints */
> +static int target_parse_constraint(TCGArgConstraint *ct,
> +                                   const char **pct_str)
> +{
> +    const char *ct_str = *pct_str;
> +
> +    switch (ct_str[0]) {
> +    case 'r':
> +        ct->ct |= TCG_CT_REG;
> +        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
> +        break;
> +    case 'l': /* qemu_ld / qemu_st address, data_reg */
> +        ct->ct |= TCG_CT_REG;
> +        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
> +#ifdef CONFIG_SOFTMMU
> +        /* x0 and x1 will be overwritten when reading the tlb entry,
> +           and x2, and x3 for helper args, better to avoid using them. */
> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X0);
> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X1);
> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X2);
> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X3);
> +#endif
> +        break;
> +    default:
> +        return -1;
> +    }
> +
> +    ct_str++;
> +    *pct_str = ct_str;
> +    return 0;
> +}
> +
> +static inline int tcg_target_const_match(tcg_target_long val,
> +                                         const TCGArgConstraint *arg_ct)
> +{
> +    int ct = arg_ct->ct;
> +
> +    if (ct & TCG_CT_CONST) {
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +enum aarch64_cond_code {
> +    COND_EQ = 0x0,
> +    COND_NE = 0x1,
> +    COND_CS = 0x2,     /* Unsigned greater or equal */
> +    COND_HS = COND_CS, /* ALIAS greater or equal */
> +    COND_CC = 0x3,     /* Unsigned less than */
> +    COND_LO = COND_CC, /* ALIAS Lower */
> +    COND_MI = 0x4,     /* Negative */
> +    COND_PL = 0x5,     /* Zero or greater */
> +    COND_VS = 0x6,     /* Overflow */
> +    COND_VC = 0x7,     /* No overflow */
> +    COND_HI = 0x8,     /* Unsigned greater than */
> +    COND_LS = 0x9,     /* Unsigned less or equal */
> +    COND_GE = 0xa,
> +    COND_LT = 0xb,
> +    COND_GT = 0xc,
> +    COND_LE = 0xd,
> +    COND_AL = 0xe,
> +    COND_NV = 0xf, /* behaves like COND_AL here */
> +};
> +
> +static const enum aarch64_cond_code tcg_cond_to_aarch64[] = {
> +    [TCG_COND_EQ] = COND_EQ,
> +    [TCG_COND_NE] = COND_NE,
> +    [TCG_COND_LT] = COND_LT,
> +    [TCG_COND_GE] = COND_GE,
> +    [TCG_COND_LE] = COND_LE,
> +    [TCG_COND_GT] = COND_GT,
> +    /* unsigned */
> +    [TCG_COND_LTU] = COND_LO,
> +    [TCG_COND_GTU] = COND_HI,
> +    [TCG_COND_GEU] = COND_HS,
> +    [TCG_COND_LEU] = COND_LS,
> +};
> +
> +/* opcodes for LDR / STR instructions with base + simm9 addressing */
> +enum aarch64_ldst_op_data { /* size of the data moved */
> +    LDST_8 = 0x38,
> +    LDST_16 = 0x78,
> +    LDST_32 = 0xb8,
> +    LDST_64 = 0xf8,
> +};
> +enum aarch64_ldst_op_type { /* type of operation */
> +    LDST_ST = 0x0,    /* store */
> +    LDST_LD = 0x4,    /* load */
> +    LDST_LD_S_X = 0x8,  /* load and sign-extend into Xt */
> +    LDST_LD_S_W = 0xc,  /* load and sign-extend into Wt */
> +};
> +
> +enum aarch64_arith_opc {
> +    ARITH_ADD = 0x0b,
> +    ARITH_SUB = 0x4b,
> +    ARITH_AND = 0x0a,
> +    ARITH_OR = 0x2a,
> +    ARITH_XOR = 0x4a
> +};
> +
> +enum aarch64_srr_opc {
> +    SRR_SHL = 0x0,
> +    SRR_SHR = 0x4,
> +    SRR_SAR = 0x8,
> +    SRR_ROR = 0xc
> +};
> +
> +static inline enum aarch64_ldst_op_data
> +aarch64_ldst_get_data(TCGOpcode tcg_op)
> +{
> +    switch (tcg_op) {
> +    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
> +    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
> +    case INDEX_op_st8_i32: case INDEX_op_st8_i64:
> +        return LDST_8;
> +
> +    case INDEX_op_ld16u_i32: case INDEX_op_ld16s_i32:
> +    case INDEX_op_ld16u_i64: case INDEX_op_ld16s_i64:
> +    case INDEX_op_st16_i32: case INDEX_op_st16_i64:
> +        return LDST_16;
> +
> +    case INDEX_op_ld_i32: case INDEX_op_st_i32:
> +    case INDEX_op_ld32u_i64: case INDEX_op_ld32s_i64:
> +    case INDEX_op_st32_i64:
> +        return LDST_32;
> +
> +    case INDEX_op_ld_i64: case INDEX_op_st_i64:
> +        return LDST_64;
> +
> +    default:
> +        tcg_abort();
> +    }
> +}
> +
> +static inline enum aarch64_ldst_op_type
> +aarch64_ldst_get_type(TCGOpcode tcg_op)
> +{
> +    switch (tcg_op) {
> +    case INDEX_op_st8_i32: case INDEX_op_st16_i32:
> +    case INDEX_op_st8_i64: case INDEX_op_st16_i64:
> +    case INDEX_op_st_i32:
> +    case INDEX_op_st32_i64:
> +    case INDEX_op_st_i64:
> +        return LDST_ST;
> +
> +    case INDEX_op_ld8u_i32: case INDEX_op_ld16u_i32:
> +    case INDEX_op_ld8u_i64: case INDEX_op_ld16u_i64:
> +    case INDEX_op_ld_i32:
> +    case INDEX_op_ld32u_i64:
> +    case INDEX_op_ld_i64:
> +        return LDST_LD;
> +
> +    case INDEX_op_ld8s_i32: case INDEX_op_ld16s_i32:
> +        return LDST_LD_S_W;
> +
> +    case INDEX_op_ld8s_i64: case INDEX_op_ld16s_i64:
> +    case INDEX_op_ld32s_i64:
> +        return LDST_LD_S_X;
> +
> +    default:
> +        tcg_abort();
> +    }
> +}
> +
> +static inline uint32_t tcg_in32(TCGContext *s)
> +{
> +    uint32_t v = *(uint32_t *)s->code_ptr;
> +    return v;
> +}
> +
> +static inline void tcg_out_ldst_9(TCGContext *s,
> +                                  enum aarch64_ldst_op_data op_data,
> +                                  enum aarch64_ldst_op_type op_type,
> +                                  int rd, int rn, tcg_target_long offset)
> +{
> +    /* use LDUR with BASE register with 9bit signed unscaled offset */
> +    unsigned int mod, off;
> +
> +    if (offset < 0) {
> +        off = (256 + offset);
> +        mod = 0x1;
> +

Extra blank line.

> +    } else {
> +        off = offset;
> +        mod = 0x0;
> +    }
> +
> +    mod |= op_type;
> +    tcg_out32(s, op_data << 24 | mod << 20 | off << 12 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_movr(TCGContext *s, int ext, int rd, int source)
> +{
> +    /* register to register move using MOV (shifted register with no shift) */
> +    /* using MOV 0x2a0003e0 | (shift).. */
> +    unsigned int base = ext ? 0xaa0003e0 : 0x2a0003e0;
> +    tcg_out32(s, base | source << 16 | rd);
> +}

See comment below for tcg_out_movr_sp.

> +static inline void tcg_out_movi32(TCGContext *s, int ext, int rd,
> +                                  uint32_t value)
> +{
> +    uint32_t half, base, movk = 0;
> +    if (!value) {
> +        tcg_out_movr(s, ext, rd, TCG_REG_XZR);
> +        return;
> +    }
> +    /* construct halfwords of the immediate with MOVZ with LSL */
> +    /* using MOVZ 0x52800000 | extended reg.. */
> +    base = ext ? 0xd2800000 : 0x52800000;
> +
> +    half = value & 0xffff;
> +    if (half) {
> +        tcg_out32(s, base | half << 5 | rd);
> +        movk = 0x20000000; /* morph next MOVZ into MOVK */
> +    }
> +
> +    half = value >> 16;
> +    if (half) { /* add shift 0x00200000. Op can be MOVZ or MOVK */
> +        tcg_out32(s, base | movk | 0x00200000 | half << 5 | rd);
> +    }
> +}
> +
> +static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value)
> +{
> +    uint32_t half, base, movk = 0, shift = 0;
> +    if (!value) {
> +        tcg_out_movr(s, 1, rd, TCG_REG_XZR);
> +        return;
> +    }
> +    /* construct halfwords of the immediate with MOVZ with LSL */
> +    /* using MOVZ 0x52800000 | extended reg.. */
> +    base = 0xd2800000;
> +
> +    while (value) {
> +        half = value & 0xffff;
> +        if (half) {
> +            /* Op can be MOVZ or MOVK */
> +            tcg_out32(s, base | movk | shift | half << 5 | rd);
> +            if (!movk) {
> +                movk = 0x20000000; /* morph next MOVZs into MOVKs */
> +            }
> +        }
> +        value >>= 16;
> +        shift += 0x00200000;
> +    }
> +}
> +
> +static inline void tcg_out_ldst_r(TCGContext *s,
> +                                  enum aarch64_ldst_op_data op_data,
> +                                  enum aarch64_ldst_op_type op_type,
> +                                  int rd, int base, int regoff)
> +{
> +    /* load from memory to register using base + 64bit register offset */
> +    /* using f.e. STR Wt, [Xn, Xm] 0xb8600800|(regoff << 16)|(base << 5)|rd */
> +    /* the 0x6000 is for the "no extend field" */
> +    tcg_out32(s, 0x00206800
> +              | op_data << 24 | op_type << 20 | regoff << 16 | base << 5 | rd);
> +}
> +
> +/* solve the whole ldst problem */
> +static inline void tcg_out_ldst(TCGContext *s, enum aarch64_ldst_op_data data,
> +                                enum aarch64_ldst_op_type type,
> +                                int rd, int rn, tcg_target_long offset)
> +{
> +    if (offset > -256 && offset < 256) {

Offset >= -256.

> +        tcg_out_ldst_9(s, data, type, rd, rn, offset);
> +
> +    } else {
> +        tcg_out_movi64(s, TCG_REG_TMP, offset);
> +        tcg_out_ldst_r(s, data, type, rd, rn, TCG_REG_TMP);
> +    }
> +}
> +
> +static inline void tcg_out_movi(TCGContext *s, TCGType type,
> +                                TCGReg rd, tcg_target_long value)
> +{
> +    if (type == TCG_TYPE_I64) {
> +        tcg_out_movi64(s, rd, value);
> +    } else {
> +        tcg_out_movi32(s, 0, rd, value);
> +    }
> +}
> +
> +/* mov alias implemented with add immediate, useful to move to/from SP */
> +static inline void tcg_out_movr_sp(TCGContext *s, int ext, int rd, int rn)
> +{
> +    /* using ADD 0x11000000 | (ext) | rn << 5 | rd */
> +    unsigned int base = ext ? 0x91000000 : 0x11000000;
> +    tcg_out32(s, base | rn << 5 | rd);
> +}

Couldn't this function be used for tcg_out_movr too?  That shouldn't
be an issue unless you want to use ZR as source or destination (this
will be true if you change tcg_out_movi according to Richard's
proposal).

> +static inline void tcg_out_mov(TCGContext *s,
> +                               TCGType type, TCGReg ret, TCGReg arg)
> +{
> +    if (ret != arg) {
> +        tcg_out_movr(s, type == TCG_TYPE_I64, ret, arg);
> +    }
> +}
> +
> +static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg arg,
> +                              TCGReg arg1, tcg_target_long arg2)
> +{
> +    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_LD,
> +                 arg, arg1, arg2);
> +}
> +
> +static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> +                              TCGReg arg1, tcg_target_long arg2)
> +{
> +    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_ST,
> +                 arg, arg1, arg2);
> +}
> +
> +static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
> +                                 int ext, int rd, int rn, int rm)
> +{
> +    /* Using shifted register arithmetic operations */
> +    /* if extended registry operation (64bit) just or with 0x80 << 24 */
> +    unsigned int base = ext ? (0x80 | opc) << 24 : opc << 24;
> +    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_mul(TCGContext *s, int ext, int rd, int rn, int rm)
> +{
> +    /* Using MADD 0x1b000000 with Ra = wzr alias MUL 0x1b007c00 */
> +    unsigned int base = ext ? 0x9b007c00 : 0x1b007c00;
> +    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_shiftrot_reg(TCGContext *s,
> +                                        enum aarch64_srr_opc opc, int ext,
> +                                        int rd, int rn, int rm)
> +{
> +    /* using 2-source data processing instructions 0x1ac02000 */
> +    unsigned int base = ext ? 0x9ac02000 : 0x1ac02000;
> +    tcg_out32(s, base | rm << 16 | opc << 8 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_ubfm(TCGContext *s, int ext,
> +                                int rd, int rn, unsigned int a, unsigned int b)
> +{
> +    /* Using UBFM 0x53000000 Wd, Wn, a, b - ext encoding requires the 0x4 */
> +    unsigned int base = ext ? 0xd3400000 : 0x53000000;
> +    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_sbfm(TCGContext *s, int ext,
> +                                int rd, int rn, unsigned int a, unsigned int b)
> +{
> +    /* Using SBFM 0x13000000 Wd, Wn, a, b - ext encoding requires the 0x4 */
> +    unsigned int base = ext ? 0x93400000 : 0x13000000;
> +    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_extr(TCGContext *s, int ext,
> +                                int rd, int rn, int rm, unsigned int a)
> +{
> +    /* Using EXTR 0x13800000 Wd, Wn, Wm, a - ext encoding requires the 0x4 */
> +    unsigned int base = ext ? 0x93c00000 : 0x13800000;
> +    tcg_out32(s, base | rm << 16 | a << 10 | rn << 5 | rd);
> +}
> +
> +static inline void tcg_out_shl(TCGContext *s, int ext,
> +                               int rd, int rn, unsigned int m)
> +{
> +    int bits, max;
> +    bits = ext ? 64 : 32;
> +    max = bits - 1;
> +    tcg_out_ubfm(s, ext, rd, rn, bits - (m & max), max - (m & max));
> +}
> +
> +static inline void tcg_out_shr(TCGContext *s, int ext,
> +                               int rd, int rn, unsigned int m)
> +{
> +    int max = ext ? 63 : 31;
> +    tcg_out_ubfm(s, ext, rd, rn, m & max, max);
> +}
> +
> +static inline void tcg_out_sar(TCGContext *s, int ext,
> +                               int rd, int rn, unsigned int m)
> +{
> +    int max = ext ? 63 : 31;
> +    tcg_out_sbfm(s, ext, rd, rn, m & max, max);
> +}
> +
> +static inline void tcg_out_rotr(TCGContext *s, int ext,
> +                                int rd, int rn, unsigned int m)
> +{
> +    int max = ext ? 63 : 31;
> +    tcg_out_extr(s, ext, rd, rn, rn, m & max);
> +}
> +
> +static inline void tcg_out_rotl(TCGContext *s, int ext,
> +                                int rd, int rn, unsigned int m)
> +{
> +    int bits, max;
> +    bits = ext ? 64 : 32;
> +    max = bits - 1;
> +    tcg_out_extr(s, ext, rd, rn, rn, bits - (m & max));
> +}
> +
> +static inline void tcg_out_cmp(TCGContext *s, int ext,
> +                               int rn, int rm)
> +{
> +    /* Using CMP alias SUBS wzr, Wn, Wm */
> +    unsigned int base = ext ? 0xeb00001f : 0x6b00001f;
> +    tcg_out32(s, base | rm << 16 | rn << 5);
> +}
> +
> +static inline void tcg_out_cset(TCGContext *s, int ext,
> +                                int rd, TCGCond c)
> +{
> +    /* Using CSET alias of CSINC 0x1a800400 Xd, XZR, XZR, invert(cond) */
> +    unsigned int base = ext ? 0x9a9f07e0 : 0x1a9f07e0;
> +    tcg_out32(s, base | tcg_cond_to_aarch64[tcg_invert_cond(c)] << 12 | rd);
> +}
> +
> +static inline void tcg_out_goto(TCGContext *s, tcg_target_long target)
> +{
> +    tcg_target_long offset;
> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
> +
> +    if (offset <= -0x02000000 || offset >= 0x02000000) {

offset < -0x02000000

> +        /* out of 26bit range */
> +        tcg_abort();
> +    }
> +
> +    tcg_out32(s, 0x14000000 | (offset & 0x03ffffff));
> +}
> +
> +static inline void tcg_out_goto_noaddr(TCGContext *s)
> +{
> +    /* We pay attention here to not modify the branch target by
> +       reading from the buffer. This ensure that caches and memory are
> +       kept coherent during retranslation.
> +       Mask away possible garbage in the high bits for the first translation,
> +       while keeping the offset bits for retranslation. */
> +    uint32_t insn;
> +    insn = (tcg_in32(s) & 0x03ffffff) | 0x14000000;
> +    tcg_out32(s, insn);
> +}
> +
> +static inline void tcg_out_goto_cond_noaddr(TCGContext *s, TCGCond c)
> +{
> +    /* see comments in tcg_out_goto_noaddr */
> +    uint32_t insn;
> +    insn = tcg_in32(s) & (0x07ffff << 5);
> +    insn |= 0x54000000 | tcg_cond_to_aarch64[c];
> +    tcg_out32(s, insn);
> +}
> +
> +static inline void tcg_out_goto_cond(TCGContext *s, TCGCond c,
> +                                     tcg_target_long target)
> +{
> +    tcg_target_long offset;
> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
> +
> +    if (offset <= -0x3ffff || offset >= 0x3ffff) {

offset < -0x40000

> +        /* out of 19bit range */
> +        tcg_abort();
> +    }
> +
> +    offset &= 0x7ffff;
> +    tcg_out32(s, 0x54000000 | tcg_cond_to_aarch64[c] | offset << 5);
> +}
> +
> +static inline void tcg_out_callr(TCGContext *s, int reg)
> +{
> +    tcg_out32(s, 0xd63f0000 | reg << 5);
> +}
> +
> +static inline void tcg_out_gotor(TCGContext *s, int reg)
> +{
> +    tcg_out32(s, 0xd61f0000 | reg << 5);
> +}
> +
> +static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
> +{
> +    tcg_target_long offset;
> +
> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
> +
> +    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */

offset < -0x02000000

> +        tcg_out_movi64(s, TCG_REG_TMP, target);
> +        tcg_out_callr(s, TCG_REG_TMP);
> +

Extra blank line.

> +    } else {
> +        tcg_out32(s, 0x94000000 | (offset & 0x03ffffff));
> +    }
> +}
> +
> +/* test a register against a bit pattern made of pattern_n repeated 1s.
> +   For example, to test against 0111b (0x07), pass pattern_n = 3 */
> +static inline void tcg_out_tst(TCGContext *s, int ext, int rn,
> +                               tcg_target_ulong pattern_n)
> +{
> +    /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f. Ext requires 4. */
> +    unsigned int base = ext ? 0xf240001f : 0x7200001f;
> +    tcg_out32(s, base | (pattern_n - 1) << 10 | rn << 5);
> +}

You probably should protect against pattern_n == 0.
Note this function is currently unused.

> +static inline void tcg_out_ret(TCGContext *s)
> +{
> +    /* emit RET { LR } */
> +    tcg_out32(s, 0xd65f03c0);
> +}
> +
> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
> +{
> +    tcg_target_long target, offset;
> +    target = (tcg_target_long)addr;
> +    offset = (target - (tcg_target_long)jmp_addr) / 4;
> +
> +    if (offset <= -0x02000000 || offset >= 0x02000000) {

offset < -0x02000000

> +        /* out of 26bit range */
> +        tcg_abort();
> +    }
> +
> +    patch_reloc((uint8_t *)jmp_addr, R_AARCH64_JUMP26, target, 0);
> +    flush_icache_range(jmp_addr, jmp_addr + 4);
> +}
> +
> +static inline void tcg_out_goto_label(TCGContext *s, int label_index)
> +{
> +    TCGLabel *l = &s->labels[label_index];
> +
> +    if (!l->has_value) {
> +        tcg_out_reloc(s, s->code_ptr, R_AARCH64_JUMP26, label_index, 0);
> +        tcg_out_goto_noaddr(s);
> +
> +    } else {
> +        tcg_out_goto(s, l->u.value);
> +    }
> +}
> +
> +static inline void tcg_out_goto_label_cond(TCGContext *s,
> +                                           TCGCond c, int label_index)
> +{
> +    TCGLabel *l = &s->labels[label_index];
> +
> +    if (!l->has_value) {
> +        tcg_out_reloc(s, s->code_ptr, R_AARCH64_CONDBR19, label_index, 0);
> +        tcg_out_goto_cond_noaddr(s, c);
> +
> +    } else {
> +        tcg_out_goto_cond(s, c, l->u.value);
> +    }
> +}
> +
> +#ifdef CONFIG_SOFTMMU
> +#include "exec/softmmu_defs.h"
> +
> +/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
> +   int mmu_idx) */
> +static const void * const qemu_ld_helpers[4] = {
> +    helper_ldb_mmu,
> +    helper_ldw_mmu,
> +    helper_ldl_mmu,
> +    helper_ldq_mmu,
> +};
> +
> +/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
> +   uintxx_t val, int mmu_idx) */
> +static const void * const qemu_st_helpers[4] = {
> +    helper_stb_mmu,
> +    helper_stw_mmu,
> +    helper_stl_mmu,
> +    helper_stq_mmu,
> +};
> +
> +#endif /* CONFIG_SOFTMMU */
> +
> +static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
> +{
> +    int addr_reg, data_reg;
> +#ifdef CONFIG_SOFTMMU
> +    int mem_index, s_bits;
> +#endif
> +    data_reg = args[0];
> +    addr_reg = args[1];
> +
> +#ifdef CONFIG_SOFTMMU
> +    mem_index = args[2];
> +    s_bits = opc & 3;
> +
> +    /* TODO: insert TLB lookup here */
> +
> +#  if CPU_TLB_BITS > 8
> +#   error "CPU_TLB_BITS too large"
> +#  endif
> +
> +    /* all arguments passed via registers */
> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
> +    tcg_out_movi32(s, 0, TCG_REG_X2, mem_index);
> +
> +    tcg_out_movi64(s, TCG_REG_TMP, (uint64_t)qemu_ld_helpers[s_bits]);
> +    tcg_out_callr(s, TCG_REG_TMP);
> +
> +    if (opc & 0x04) { /* sign extend */
> +        unsigned int bits; bits = 8 * (1 << s_bits) - 1;
> +        tcg_out_sbfm(s, 1, data_reg, TCG_REG_X0, 0, bits); /* 7|15|31 */
> +
> +    } else {
> +        tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
> +    }
> +
> +#else /* !CONFIG_SOFTMMU */
> +    tcg_abort(); /* TODO */
> +#endif
> +}
> +
> +static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
> +{
> +    int addr_reg, data_reg;
> +#ifdef CONFIG_SOFTMMU
> +    int mem_index, s_bits;
> +#endif
> +    data_reg = args[0];
> +    addr_reg = args[1];
> +
> +#ifdef CONFIG_SOFTMMU
> +    mem_index = args[2];
> +    s_bits = opc & 3;
> +
> +    /* TODO: here we should generate something like the following:
> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
> +     *  test ... XXX
> +     */
> +#  if CPU_TLB_BITS > 8
> +#   error "CPU_TLB_BITS too large"
> +#  endif
> +
> +    /* all arguments passed via registers */
> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
> +    tcg_out_movr(s, 1, TCG_REG_X2, data_reg);
> +    tcg_out_movi32(s, 0, TCG_REG_X3, mem_index);
> +
> +    tcg_out_movi64(s, TCG_REG_TMP, (uint64_t)qemu_st_helpers[s_bits]);
> +    tcg_out_callr(s, TCG_REG_TMP);
> +
> +#else /* !CONFIG_SOFTMMU */
> +    tcg_abort(); /* TODO */
> +#endif
> +}
> +
> +static uint8_t *tb_ret_addr;
> +
> +/* callee stack use example:
> +   stp     x29, x30, [sp,#-32]!
> +   mov     x29, sp
> +   stp     x1, x2, [sp,#16]
> +   ...
> +   ldp     x1, x2, [sp,#16]
> +   ldp     x29, x30, [sp],#32
> +   ret
> +*/
> +
> +/* push r1 and r2, and alloc stack space for a total of
> +   alloc_n elements (1 element=16 bytes, must be between 1 and 31. */
> +static inline void tcg_out_push_pair(TCGContext *s,
> +                                     TCGReg r1, TCGReg r2, int alloc_n)
> +{
> +    /* using indexed scaled simm7 STP 0x28800000 | (ext) | 0x01000000 (pre-idx)
> +       | alloc_n * (-1) << 16 | r2 << 10 | sp(31) << 5 | r1 */
> +    assert(alloc_n > 0 && alloc_n < 0x20);
> +    alloc_n = (-alloc_n) & 0x3f;
> +    tcg_out32(s, 0xa98003e0 | alloc_n << 16 | r2 << 10 | r1);
> +}
> +
> +/* dealloc stack space for a total of alloc_n elements and pop r1, r2.  */
> +static inline void tcg_out_pop_pair(TCGContext *s,
> +                                 TCGReg r1, TCGReg r2, int alloc_n)
> +{
> +    /* using indexed scaled simm7 LDP 0x28c00000 | (ext) | nothing (post-idx)
> +       | alloc_n << 16 | r2 << 10 | sp(31) << 5 | r1 */
> +    assert(alloc_n > 0 && alloc_n < 0x20);
> +    tcg_out32(s, 0xa8c003e0 | alloc_n << 16 | r2 << 10 | r1);
> +}
> +
> +static inline void tcg_out_store_pair(TCGContext *s,
> +                                      TCGReg r1, TCGReg r2, int idx)
> +{
> +    /* using register pair offset simm7 STP 0x29000000 | (ext)
> +       | idx << 16 | r2 << 10 | fp(29) << 5 | r1 */
> +    assert(idx > 0 && idx < 0x20);
> +    tcg_out32(s, 0xa90003a0 | idx << 16 | r2 << 10 | r1);
> +}
> +
> +static inline void tcg_out_load_pair(TCGContext *s,
> +                                     TCGReg r1, TCGReg r2, int idx)
> +{
> +    /* using register pair offset simm7 LDP 0x29400000 | (ext)
> +       | idx << 16 | r2 << 10 | fp(29) << 5 | r1 */
> +    assert(idx > 0 && idx < 0x20);
> +    tcg_out32(s, 0xa94003a0 | idx << 16 | r2 << 10 | r1);
> +}
> +
> +static void tcg_out_op(TCGContext *s, TCGOpcode opc,
> +                       const TCGArg *args, const int *const_args)
> +{
> +    int ext = 0;
> +
> +    switch (opc) {
> +    case INDEX_op_exit_tb:
> +        tcg_out_movi64(s, TCG_REG_X0, args[0]); /* load retval in X0 */
> +        tcg_out_goto(s, (tcg_target_long)tb_ret_addr);
> +        break;
> +
> +    case INDEX_op_goto_tb:
> +#ifndef USE_DIRECT_JUMP
> +#error "USE_DIRECT_JUMP required for aarch64"
> +#endif
> +        assert(s->tb_jmp_offset != NULL); /* consistency for USE_DIRECT_JUMP */
> +        s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
> +        /* actual branch destination will be patched by
> +           aarch64_tb_set_jmp_target later, beware retranslation. */
> +        tcg_out_goto_noaddr(s);
> +        s->tb_next_offset[args[0]] = s->code_ptr - s->code_buf;
> +        break;
> +
> +    case INDEX_op_call:
> +        if (const_args[0]) {
> +            tcg_out_call(s, args[0]);
> +        } else {
> +            tcg_out_callr(s, args[0]);
> +        }
> +        break;
> +
> +    case INDEX_op_br:
> +        tcg_out_goto_label(s, args[0]);
> +        break;
> +
> +    case INDEX_op_ld_i32:
> +    case INDEX_op_ld_i64:
> +    case INDEX_op_st_i32:
> +    case INDEX_op_st_i64:
> +    case INDEX_op_ld8u_i32:
> +    case INDEX_op_ld8s_i32:
> +    case INDEX_op_ld16u_i32:
> +    case INDEX_op_ld16s_i32:
> +    case INDEX_op_ld8u_i64:
> +    case INDEX_op_ld8s_i64:
> +    case INDEX_op_ld16u_i64:
> +    case INDEX_op_ld16s_i64:
> +    case INDEX_op_ld32u_i64:
> +    case INDEX_op_ld32s_i64:
> +    case INDEX_op_st8_i32:
> +    case INDEX_op_st8_i64:
> +    case INDEX_op_st16_i32:
> +    case INDEX_op_st16_i64:
> +    case INDEX_op_st32_i64:
> +        tcg_out_ldst(s, aarch64_ldst_get_data(opc), aarch64_ldst_get_type(opc),
> +                     args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_mov_i64: ext = 1;
> +    case INDEX_op_mov_i32:
> +        tcg_out_movr(s, ext, args[0], args[1]);
> +        break;
> +
> +    case INDEX_op_movi_i64:
> +        tcg_out_movi64(s, args[0], args[1]);
> +        break;
> +
> +    case INDEX_op_movi_i32:
> +        tcg_out_movi32(s, 0, args[0], args[1]);
> +        break;
> +
> +    case INDEX_op_add_i64: ext = 1;
> +    case INDEX_op_add_i32:
> +        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_sub_i64: ext = 1;
> +    case INDEX_op_sub_i32:
> +        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_and_i64: ext = 1;
> +    case INDEX_op_and_i32:
> +        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_or_i64: ext = 1;
> +    case INDEX_op_or_i32:
> +        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_xor_i64: ext = 1;
> +    case INDEX_op_xor_i32:
> +        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_mul_i64: ext = 1;
> +    case INDEX_op_mul_i32:
> +        tcg_out_mul(s, ext, args[0], args[1], args[2]);
> +        break;
> +
> +    case INDEX_op_shl_i64: ext = 1;
> +    case INDEX_op_shl_i32:
> +        if (const_args[2]) {    /* LSL / UBFM Wd, Wn, (32 - m) */
> +            tcg_out_shl(s, ext, args[0], args[1], args[2]);
> +        } else {                /* LSL / LSLV */
> +            tcg_out_shiftrot_reg(s, SRR_SHL, ext, args[0], args[1], args[2]);
> +        }
> +        break;
> +
> +    case INDEX_op_shr_i64: ext = 1;
> +    case INDEX_op_shr_i32:
> +        if (const_args[2]) {    /* LSR / UBFM Wd, Wn, m, 31 */
> +            tcg_out_shr(s, ext, args[0], args[1], args[2]);
> +        } else {                /* LSR / LSRV */
> +            tcg_out_shiftrot_reg(s, SRR_SHR, ext, args[0], args[1], args[2]);
> +        }
> +        break;
> +
> +    case INDEX_op_sar_i64: ext = 1;
> +    case INDEX_op_sar_i32:
> +        if (const_args[2]) {    /* ASR / SBFM Wd, Wn, m, 31 */
> +            tcg_out_sar(s, ext, args[0], args[1], args[2]);
> +        } else {                /* ASR / ASRV */
> +            tcg_out_shiftrot_reg(s, SRR_SAR, ext, args[0], args[1], args[2]);
> +        }
> +        break;
> +
> +    case INDEX_op_rotr_i64: ext = 1;
> +    case INDEX_op_rotr_i32:
> +        if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, m */
> +            tcg_out_rotr(s, ext, args[0], args[1], args[2]);
> +        } else {                /* ROR / RORV */
> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
> +        }
> +        break;
> +
> +    case INDEX_op_rotl_i64: ext = 1;
> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
> +        if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, 32 - m */
> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
> +        } else {
> +            tcg_out_arith(s, ARITH_SUB, ext, args[2], TCG_REG_XZR, args[2]);
> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
> +        }
> +        break;
> +
> +    case INDEX_op_brcond_i64: ext = 1;
> +    case INDEX_op_brcond_i32: /* CMP 0, 1, cond(2), label 3 */
> +        tcg_out_cmp(s, ext, args[0], args[1]);
> +        tcg_out_goto_label_cond(s, args[2], args[3]);
> +        break;
> +
> +    case INDEX_op_setcond_i64: ext = 1;
> +    case INDEX_op_setcond_i32:
> +        tcg_out_cmp(s, ext, args[1], args[2]);
> +        tcg_out_cset(s, ext, args[0], args[3]);
> +        break;
> +
> +    case INDEX_op_qemu_ld8u:
> +        tcg_out_qemu_ld(s, args, 0 | 0);
> +        break;
> +    case INDEX_op_qemu_ld8s:
> +        tcg_out_qemu_ld(s, args, 4 | 0);
> +        break;
> +    case INDEX_op_qemu_ld16u:
> +        tcg_out_qemu_ld(s, args, 0 | 1);
> +        break;
> +    case INDEX_op_qemu_ld16s:
> +        tcg_out_qemu_ld(s, args, 4 | 1);
> +        break;
> +    case INDEX_op_qemu_ld32u:
> +        tcg_out_qemu_ld(s, args, 0 | 2);
> +        break;
> +    case INDEX_op_qemu_ld32s:
> +        tcg_out_qemu_ld(s, args, 4 | 2);
> +        break;
> +    case INDEX_op_qemu_ld32:
> +        tcg_out_qemu_ld(s, args, 0 | 2);
> +        break;
> +    case INDEX_op_qemu_ld64:
> +        tcg_out_qemu_ld(s, args, 0 | 3);
> +        break;
> +    case INDEX_op_qemu_st8:
> +        tcg_out_qemu_st(s, args, 0);
> +        break;
> +    case INDEX_op_qemu_st16:
> +        tcg_out_qemu_st(s, args, 1);
> +        break;
> +    case INDEX_op_qemu_st32:
> +        tcg_out_qemu_st(s, args, 2);
> +        break;
> +    case INDEX_op_qemu_st64:
> +        tcg_out_qemu_st(s, args, 3);
> +        break;
> +
> +    default:
> +        tcg_abort(); /* opcode not implemented */
> +    }
> +}
> +
> +static const TCGTargetOpDef aarch64_op_defs[] = {
> +    { INDEX_op_exit_tb, { } },
> +    { INDEX_op_goto_tb, { } },
> +    { INDEX_op_call, { "ri" } },
> +    { INDEX_op_br, { } },
> +
> +    { INDEX_op_mov_i32, { "r", "r" } },
> +    { INDEX_op_mov_i64, { "r", "r" } },
> +
> +    { INDEX_op_movi_i32, { "r" } },
> +    { INDEX_op_movi_i64, { "r" } },
> +
> +    { INDEX_op_ld8u_i32, { "r", "r" } },
> +    { INDEX_op_ld8s_i32, { "r", "r" } },
> +    { INDEX_op_ld16u_i32, { "r", "r" } },
> +    { INDEX_op_ld16s_i32, { "r", "r" } },
> +    { INDEX_op_ld_i32, { "r", "r" } },
> +    { INDEX_op_ld8u_i64, { "r", "r" } },
> +    { INDEX_op_ld8s_i64, { "r", "r" } },
> +    { INDEX_op_ld16u_i64, { "r", "r" } },
> +    { INDEX_op_ld16s_i64, { "r", "r" } },
> +    { INDEX_op_ld32u_i64, { "r", "r" } },
> +    { INDEX_op_ld32s_i64, { "r", "r" } },
> +    { INDEX_op_ld_i64, { "r", "r" } },
> +
> +    { INDEX_op_st8_i32, { "r", "r" } },
> +    { INDEX_op_st16_i32, { "r", "r" } },
> +    { INDEX_op_st_i32, { "r", "r" } },
> +    { INDEX_op_st8_i64, { "r", "r" } },
> +    { INDEX_op_st16_i64, { "r", "r" } },
> +    { INDEX_op_st32_i64, { "r", "r" } },
> +    { INDEX_op_st_i64, { "r", "r" } },
> +
> +    { INDEX_op_add_i32, { "r", "r", "r" } },
> +    { INDEX_op_add_i64, { "r", "r", "r" } },
> +    { INDEX_op_sub_i32, { "r", "r", "r" } },
> +    { INDEX_op_sub_i64, { "r", "r", "r" } },
> +    { INDEX_op_mul_i32, { "r", "r", "r" } },
> +    { INDEX_op_mul_i64, { "r", "r", "r" } },
> +    { INDEX_op_and_i32, { "r", "r", "r" } },
> +    { INDEX_op_and_i64, { "r", "r", "r" } },
> +    { INDEX_op_or_i32, { "r", "r", "r" } },
> +    { INDEX_op_or_i64, { "r", "r", "r" } },
> +    { INDEX_op_xor_i32, { "r", "r", "r" } },
> +    { INDEX_op_xor_i64, { "r", "r", "r" } },
> +
> +    { INDEX_op_shl_i32, { "r", "r", "ri" } },
> +    { INDEX_op_shr_i32, { "r", "r", "ri" } },
> +    { INDEX_op_sar_i32, { "r", "r", "ri" } },
> +    { INDEX_op_rotl_i32, { "r", "r", "ri" } },
> +    { INDEX_op_rotr_i32, { "r", "r", "ri" } },
> +    { INDEX_op_shl_i64, { "r", "r", "ri" } },
> +    { INDEX_op_shr_i64, { "r", "r", "ri" } },
> +    { INDEX_op_sar_i64, { "r", "r", "ri" } },
> +    { INDEX_op_rotl_i64, { "r", "r", "ri" } },
> +    { INDEX_op_rotr_i64, { "r", "r", "ri" } },
> +
> +    { INDEX_op_brcond_i32, { "r", "r" } },
> +    { INDEX_op_setcond_i32, { "r", "r", "r" } },
> +    { INDEX_op_brcond_i64, { "r", "r" } },
> +    { INDEX_op_setcond_i64, { "r", "r", "r" } },
> +
> +    { INDEX_op_qemu_ld8u, { "r", "l" } },
> +    { INDEX_op_qemu_ld8s, { "r", "l" } },
> +    { INDEX_op_qemu_ld16u, { "r", "l" } },
> +    { INDEX_op_qemu_ld16s, { "r", "l" } },
> +    { INDEX_op_qemu_ld32u, { "r", "l" } },
> +    { INDEX_op_qemu_ld32s, { "r", "l" } },
> +
> +    { INDEX_op_qemu_ld32, { "r", "l" } },
> +    { INDEX_op_qemu_ld64, { "r", "l" } },
> +
> +    { INDEX_op_qemu_st8, { "l", "l" } },
> +    { INDEX_op_qemu_st16, { "l", "l" } },
> +    { INDEX_op_qemu_st32, { "l", "l" } },
> +    { INDEX_op_qemu_st64, { "l", "l" } },
> +    { -1 },
> +};
> +
> +static void tcg_target_init(TCGContext *s)
> +{
> +#if !defined(CONFIG_USER_ONLY)
> +    /* fail safe */
> +    if ((1ULL << CPU_TLB_ENTRY_BITS) != sizeof(CPUTLBEntry)) {
> +        tcg_abort();
> +    }
> +#endif
> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffffffff);
> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffffffff);
> +
> +    tcg_regset_set32(tcg_target_call_clobber_regs, 0,
> +                     (1 << TCG_REG_X0) | (1 << TCG_REG_X1) |
> +                     (1 << TCG_REG_X2) | (1 << TCG_REG_X3) |
> +                     (1 << TCG_REG_X4) | (1 << TCG_REG_X5) |
> +                     (1 << TCG_REG_X6) | (1 << TCG_REG_X7) |
> +                     (1 << TCG_REG_X8) | (1 << TCG_REG_X9) |
> +                     (1 << TCG_REG_X10) | (1 << TCG_REG_X11) |
> +                     (1 << TCG_REG_X12) | (1 << TCG_REG_X13) |
> +                     (1 << TCG_REG_X14) | (1 << TCG_REG_X15) |
> +                     (1 << TCG_REG_X16) | (1 << TCG_REG_X17) |
> +                     (1 << TCG_REG_X18) | (1 << TCG_REG_LR));
> +
> +    tcg_regset_clear(s->reserved_regs);
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
> +
> +    tcg_add_target_add_op_defs(aarch64_op_defs);
> +}
> +
> +static inline void tcg_out_addi(TCGContext *s,
> +                                int ext, int rd, int rn, unsigned int aimm)
> +{
> +    /* add immediate aimm unsigned 12bit value (we use LSL 0 - no shift) */
> +    /* using ADD 0x11000000 | (ext) | (aimm << 10) | (rn << 5) | rd */
> +    unsigned int base = ext ? 0x91000000 : 0x11000000;
> +    assert(aimm <= 0xfff);
> +    tcg_out32(s, base | (aimm << 10) | (rn << 5) | rd);
> +}
> +
> +static inline void tcg_out_subi(TCGContext *s,
> +                                int ext, int rd, int rn, unsigned int aimm)
> +{
> +    /* sub immediate aimm unsigned 12bit value (we use LSL 0 - no shift) */
> +    /* using SUB 0x51000000 | (ext) | (aimm << 10) | (rn << 5) | rd */
> +    unsigned int base = ext ? 0xd1000000 : 0x51000000;
> +    assert(aimm <= 0xfff);
> +    tcg_out32(s, base | (aimm << 10) | (rn << 5) | rd);
> +}
> +
> +static void tcg_target_qemu_prologue(TCGContext *s)
> +{
> +    /* NB: frame sizes are in 16 byte stack units! */
> +    int frame_size_callee_saved, frame_size_tcg_locals;
> +    int r;
> +
> +    /* save pairs             (FP, LR) and (X19, X20) .. (X27, X28) */
> +    frame_size_callee_saved = (1) + (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;
> +
> +    /* frame size requirement for TCG local variables */
> +    frame_size_tcg_locals = TCG_STATIC_CALL_ARGS_SIZE
> +        + CPU_TEMP_BUF_NLONGS * sizeof(long)
> +        + (TCG_TARGET_STACK_ALIGN - 1);
> +    frame_size_tcg_locals &= ~(TCG_TARGET_STACK_ALIGN - 1);
> +    frame_size_tcg_locals /= TCG_TARGET_STACK_ALIGN;
> +
> +    /* push (FP, LR) and update sp */
> +    tcg_out_push_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
> +
> +    /* FP -> callee_saved */
> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
> +
> +    /* store callee-preserved regs x19..x28 using FP -> callee_saved */
> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
> +        tcg_out_store_pair(s, r, r + 1, idx);
> +    }
> +
> +    /* make stack space for TCG locals */
> +    tcg_out_subi(s, 1, TCG_REG_SP, TCG_REG_SP,
> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
> +    /* inform TCG about how to find TCG locals with register, offset, size */
> +    tcg_set_frame(s, TCG_REG_SP, TCG_STATIC_CALL_ARGS_SIZE,
> +                  CPU_TEMP_BUF_NLONGS * sizeof(long));
> +
> +    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
> +    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
> +
> +    tb_ret_addr = s->code_ptr;
> +
> +    /* remove TCG locals stack space */
> +    tcg_out_addi(s, 1, TCG_REG_SP, TCG_REG_SP,
> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
> +
> +    /* restore registers x19..x28.
> +       FP must be preserved, so it still points to callee_saved area */
> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
> +        tcg_out_load_pair(s, r, r + 1, idx);
> +    }
> +
> +    /* pop (FP, LR), restore SP to previous frame, return */
> +    tcg_out_pop_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
> +    tcg_out_ret(s);
> +}


Laurent

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-28 13:14     ` Laurent Desnogues
@ 2013-05-28 14:37       ` Claudio Fontana
  0 siblings, 0 replies; 60+ messages in thread
From: Claudio Fontana @ 2013-05-28 14:37 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: Peter Maydell, qemu-devel, Richard Henderson

Hi Laurent,

On 28.05.2013 15:14, Laurent Desnogues wrote:
> Hi Claudio,
> 
> here are some minor tweaks and comments.
> 
> You work is very interesting and will form a good basis for further
> improvements. Thanks!

Thanks, I applied your limit check fixes.
You are right about the unused function.
It should be in the separate series with its users.
[no more comments below]

>> diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
>> new file mode 100644
>> index 0000000..da859c7
>> --- /dev/null
>> +++ b/tcg/aarch64/tcg-target.c
>> @@ -0,0 +1,1185 @@
>> +/*
>> + * Initial TCG Implementation for aarch64
>> + *
>> + * Copyright (c) 2013 Huawei Technologies Duesseldorf GmbH
>> + * Written by Claudio Fontana
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * (at your option) any later version.
>> + *
>> + * See the COPYING file in the top-level directory for details.
>> + */
>> +
>> +#ifdef TARGET_WORDS_BIGENDIAN
>> +#error "Sorry, bigendian target not supported yet."
>> +#endif /* TARGET_WORDS_BIGENDIAN */
>> +
>> +#ifndef NDEBUG
>> +static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>> +    "%x0", "%x1", "%x2", "%x3", "%x4", "%x5", "%x6", "%x7",
>> +    "%x8", "%x9", "%x10", "%x11", "%x12", "%x13", "%x14", "%x15",
>> +    "%x16", "%x17", "%x18", "%x19", "%x20", "%x21", "%x22", "%x23",
>> +    "%x24", "%x25", "%x26", "%x27", "%x28",
>> +    "%fp", /* frame pointer */
>> +    "%lr", /* link register */
>> +    "%sp",  /* stack pointer */
>> +};
>> +#endif /* NDEBUG */
>> +
>> +static const int tcg_target_reg_alloc_order[] = {
>> +    TCG_REG_X20, TCG_REG_X21, TCG_REG_X22, TCG_REG_X23,
>> +    TCG_REG_X24, TCG_REG_X25, TCG_REG_X26, TCG_REG_X27,
>> +    TCG_REG_X28,
>> +
>> +    TCG_REG_X9, TCG_REG_X10, TCG_REG_X11, TCG_REG_X12,
>> +    TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
>> +    TCG_REG_X16, TCG_REG_X17,
>> +
>> +    TCG_REG_X18, TCG_REG_X19, /* will not use these, see tcg_target_init */
>> +
>> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
>> +    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
>> +
>> +    TCG_REG_X8, /* will not use, see tcg_target_init */
>> +};
>> +
>> +static const int tcg_target_call_iarg_regs[8] = {
>> +    TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
>> +    TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7
>> +};
>> +static const int tcg_target_call_oarg_regs[1] = {
>> +    TCG_REG_X0
>> +};
>> +
>> +#define TCG_REG_TMP TCG_REG_X8
>> +
>> +static inline void reloc_pc26(void *code_ptr, tcg_target_long target)
>> +{
>> +    tcg_target_long offset; uint32_t insn;
>> +    offset = (target - (tcg_target_long)code_ptr) / 4;
>> +    offset &= 0x03ffffff;
>> +    /* read instruction, mask away previous PC_REL26 parameter contents,
>> +       set the proper offset, then write back the instruction. */
>> +    insn = *(uint32_t *)code_ptr;
>> +    insn = (insn & 0xfc000000) | offset;
>> +    *(uint32_t *)code_ptr = insn;
>> +}
>> +
>> +static inline void reloc_pc19(void *code_ptr, tcg_target_long target)
>> +{
>> +    tcg_target_long offset; uint32_t insn;
>> +    offset = (target - (tcg_target_long)code_ptr) / 4;
>> +    offset &= 0x07ffff;
>> +    /* read instruction, mask away previous PC_REL19 parameter contents,
>> +       set the proper offset, then write back the instruction. */
>> +    insn = *(uint32_t *)code_ptr;
>> +    insn = (insn & 0xff00001f) | offset << 5; /* lower 5 bits = condition */
> 
> The comment is wrong: only the lower 4 bits form the condition.
> 
>> +    *(uint32_t *)code_ptr = insn;
>> +}
>> +
>> +static inline void patch_reloc(uint8_t *code_ptr, int type,
>> +                               tcg_target_long value, tcg_target_long addend)
>> +{
>> +    switch (type) {
>> +    case R_AARCH64_JUMP26:
>> +    case R_AARCH64_CALL26:
>> +        reloc_pc26(code_ptr, value);
>> +        break;
>> +    case R_AARCH64_CONDBR19:
>> +        reloc_pc19(code_ptr, value);
>> +        break;
>> +
>> +    default:
>> +        tcg_abort();
>> +    }
>> +}
>> +
>> +/* parse target specific constraints */
>> +static int target_parse_constraint(TCGArgConstraint *ct,
>> +                                   const char **pct_str)
>> +{
>> +    const char *ct_str = *pct_str;
>> +
>> +    switch (ct_str[0]) {
>> +    case 'r':
>> +        ct->ct |= TCG_CT_REG;
>> +        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
>> +        break;
>> +    case 'l': /* qemu_ld / qemu_st address, data_reg */
>> +        ct->ct |= TCG_CT_REG;
>> +        tcg_regset_set32(ct->u.regs, 0, (1ULL << TCG_TARGET_NB_REGS) - 1);
>> +#ifdef CONFIG_SOFTMMU
>> +        /* x0 and x1 will be overwritten when reading the tlb entry,
>> +           and x2, and x3 for helper args, better to avoid using them. */
>> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X0);
>> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X1);
>> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X2);
>> +        tcg_regset_reset_reg(ct->u.regs, TCG_REG_X3);
>> +#endif
>> +        break;
>> +    default:
>> +        return -1;
>> +    }
>> +
>> +    ct_str++;
>> +    *pct_str = ct_str;
>> +    return 0;
>> +}
>> +
>> +static inline int tcg_target_const_match(tcg_target_long val,
>> +                                         const TCGArgConstraint *arg_ct)
>> +{
>> +    int ct = arg_ct->ct;
>> +
>> +    if (ct & TCG_CT_CONST) {
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +enum aarch64_cond_code {
>> +    COND_EQ = 0x0,
>> +    COND_NE = 0x1,
>> +    COND_CS = 0x2,     /* Unsigned greater or equal */
>> +    COND_HS = COND_CS, /* ALIAS greater or equal */
>> +    COND_CC = 0x3,     /* Unsigned less than */
>> +    COND_LO = COND_CC, /* ALIAS Lower */
>> +    COND_MI = 0x4,     /* Negative */
>> +    COND_PL = 0x5,     /* Zero or greater */
>> +    COND_VS = 0x6,     /* Overflow */
>> +    COND_VC = 0x7,     /* No overflow */
>> +    COND_HI = 0x8,     /* Unsigned greater than */
>> +    COND_LS = 0x9,     /* Unsigned less or equal */
>> +    COND_GE = 0xa,
>> +    COND_LT = 0xb,
>> +    COND_GT = 0xc,
>> +    COND_LE = 0xd,
>> +    COND_AL = 0xe,
>> +    COND_NV = 0xf, /* behaves like COND_AL here */
>> +};
>> +
>> +static const enum aarch64_cond_code tcg_cond_to_aarch64[] = {
>> +    [TCG_COND_EQ] = COND_EQ,
>> +    [TCG_COND_NE] = COND_NE,
>> +    [TCG_COND_LT] = COND_LT,
>> +    [TCG_COND_GE] = COND_GE,
>> +    [TCG_COND_LE] = COND_LE,
>> +    [TCG_COND_GT] = COND_GT,
>> +    /* unsigned */
>> +    [TCG_COND_LTU] = COND_LO,
>> +    [TCG_COND_GTU] = COND_HI,
>> +    [TCG_COND_GEU] = COND_HS,
>> +    [TCG_COND_LEU] = COND_LS,
>> +};
>> +
>> +/* opcodes for LDR / STR instructions with base + simm9 addressing */
>> +enum aarch64_ldst_op_data { /* size of the data moved */
>> +    LDST_8 = 0x38,
>> +    LDST_16 = 0x78,
>> +    LDST_32 = 0xb8,
>> +    LDST_64 = 0xf8,
>> +};
>> +enum aarch64_ldst_op_type { /* type of operation */
>> +    LDST_ST = 0x0,    /* store */
>> +    LDST_LD = 0x4,    /* load */
>> +    LDST_LD_S_X = 0x8,  /* load and sign-extend into Xt */
>> +    LDST_LD_S_W = 0xc,  /* load and sign-extend into Wt */
>> +};
>> +
>> +enum aarch64_arith_opc {
>> +    ARITH_ADD = 0x0b,
>> +    ARITH_SUB = 0x4b,
>> +    ARITH_AND = 0x0a,
>> +    ARITH_OR = 0x2a,
>> +    ARITH_XOR = 0x4a
>> +};
>> +
>> +enum aarch64_srr_opc {
>> +    SRR_SHL = 0x0,
>> +    SRR_SHR = 0x4,
>> +    SRR_SAR = 0x8,
>> +    SRR_ROR = 0xc
>> +};
>> +
>> +static inline enum aarch64_ldst_op_data
>> +aarch64_ldst_get_data(TCGOpcode tcg_op)
>> +{
>> +    switch (tcg_op) {
>> +    case INDEX_op_ld8u_i32: case INDEX_op_ld8s_i32:
>> +    case INDEX_op_ld8u_i64: case INDEX_op_ld8s_i64:
>> +    case INDEX_op_st8_i32: case INDEX_op_st8_i64:
>> +        return LDST_8;
>> +
>> +    case INDEX_op_ld16u_i32: case INDEX_op_ld16s_i32:
>> +    case INDEX_op_ld16u_i64: case INDEX_op_ld16s_i64:
>> +    case INDEX_op_st16_i32: case INDEX_op_st16_i64:
>> +        return LDST_16;
>> +
>> +    case INDEX_op_ld_i32: case INDEX_op_st_i32:
>> +    case INDEX_op_ld32u_i64: case INDEX_op_ld32s_i64:
>> +    case INDEX_op_st32_i64:
>> +        return LDST_32;
>> +
>> +    case INDEX_op_ld_i64: case INDEX_op_st_i64:
>> +        return LDST_64;
>> +
>> +    default:
>> +        tcg_abort();
>> +    }
>> +}
>> +
>> +static inline enum aarch64_ldst_op_type
>> +aarch64_ldst_get_type(TCGOpcode tcg_op)
>> +{
>> +    switch (tcg_op) {
>> +    case INDEX_op_st8_i32: case INDEX_op_st16_i32:
>> +    case INDEX_op_st8_i64: case INDEX_op_st16_i64:
>> +    case INDEX_op_st_i32:
>> +    case INDEX_op_st32_i64:
>> +    case INDEX_op_st_i64:
>> +        return LDST_ST;
>> +
>> +    case INDEX_op_ld8u_i32: case INDEX_op_ld16u_i32:
>> +    case INDEX_op_ld8u_i64: case INDEX_op_ld16u_i64:
>> +    case INDEX_op_ld_i32:
>> +    case INDEX_op_ld32u_i64:
>> +    case INDEX_op_ld_i64:
>> +        return LDST_LD;
>> +
>> +    case INDEX_op_ld8s_i32: case INDEX_op_ld16s_i32:
>> +        return LDST_LD_S_W;
>> +
>> +    case INDEX_op_ld8s_i64: case INDEX_op_ld16s_i64:
>> +    case INDEX_op_ld32s_i64:
>> +        return LDST_LD_S_X;
>> +
>> +    default:
>> +        tcg_abort();
>> +    }
>> +}
>> +
>> +static inline uint32_t tcg_in32(TCGContext *s)
>> +{
>> +    uint32_t v = *(uint32_t *)s->code_ptr;
>> +    return v;
>> +}
>> +
>> +static inline void tcg_out_ldst_9(TCGContext *s,
>> +                                  enum aarch64_ldst_op_data op_data,
>> +                                  enum aarch64_ldst_op_type op_type,
>> +                                  int rd, int rn, tcg_target_long offset)
>> +{
>> +    /* use LDUR with BASE register with 9bit signed unscaled offset */
>> +    unsigned int mod, off;
>> +
>> +    if (offset < 0) {
>> +        off = (256 + offset);
>> +        mod = 0x1;
>> +
> 
> Extra blank line.
> 
>> +    } else {
>> +        off = offset;
>> +        mod = 0x0;
>> +    }
>> +
>> +    mod |= op_type;
>> +    tcg_out32(s, op_data << 24 | mod << 20 | off << 12 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_movr(TCGContext *s, int ext, int rd, int source)
>> +{
>> +    /* register to register move using MOV (shifted register with no shift) */
>> +    /* using MOV 0x2a0003e0 | (shift).. */
>> +    unsigned int base = ext ? 0xaa0003e0 : 0x2a0003e0;
>> +    tcg_out32(s, base | source << 16 | rd);
>> +}
> 
> See comment below for tcg_out_movr_sp.
> 
>> +static inline void tcg_out_movi32(TCGContext *s, int ext, int rd,
>> +                                  uint32_t value)
>> +{
>> +    uint32_t half, base, movk = 0;
>> +    if (!value) {
>> +        tcg_out_movr(s, ext, rd, TCG_REG_XZR);
>> +        return;
>> +    }
>> +    /* construct halfwords of the immediate with MOVZ with LSL */
>> +    /* using MOVZ 0x52800000 | extended reg.. */
>> +    base = ext ? 0xd2800000 : 0x52800000;
>> +
>> +    half = value & 0xffff;
>> +    if (half) {
>> +        tcg_out32(s, base | half << 5 | rd);
>> +        movk = 0x20000000; /* morph next MOVZ into MOVK */
>> +    }
>> +
>> +    half = value >> 16;
>> +    if (half) { /* add shift 0x00200000. Op can be MOVZ or MOVK */
>> +        tcg_out32(s, base | movk | 0x00200000 | half << 5 | rd);
>> +    }
>> +}
>> +
>> +static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value)
>> +{
>> +    uint32_t half, base, movk = 0, shift = 0;
>> +    if (!value) {
>> +        tcg_out_movr(s, 1, rd, TCG_REG_XZR);
>> +        return;
>> +    }
>> +    /* construct halfwords of the immediate with MOVZ with LSL */
>> +    /* using MOVZ 0x52800000 | extended reg.. */
>> +    base = 0xd2800000;
>> +
>> +    while (value) {
>> +        half = value & 0xffff;
>> +        if (half) {
>> +            /* Op can be MOVZ or MOVK */
>> +            tcg_out32(s, base | movk | shift | half << 5 | rd);
>> +            if (!movk) {
>> +                movk = 0x20000000; /* morph next MOVZs into MOVKs */
>> +            }
>> +        }
>> +        value >>= 16;
>> +        shift += 0x00200000;
>> +    }
>> +}
>> +
>> +static inline void tcg_out_ldst_r(TCGContext *s,
>> +                                  enum aarch64_ldst_op_data op_data,
>> +                                  enum aarch64_ldst_op_type op_type,
>> +                                  int rd, int base, int regoff)
>> +{
>> +    /* load from memory to register using base + 64bit register offset */
>> +    /* using f.e. STR Wt, [Xn, Xm] 0xb8600800|(regoff << 16)|(base << 5)|rd */
>> +    /* the 0x6000 is for the "no extend field" */
>> +    tcg_out32(s, 0x00206800
>> +              | op_data << 24 | op_type << 20 | regoff << 16 | base << 5 | rd);
>> +}
>> +
>> +/* solve the whole ldst problem */
>> +static inline void tcg_out_ldst(TCGContext *s, enum aarch64_ldst_op_data data,
>> +                                enum aarch64_ldst_op_type type,
>> +                                int rd, int rn, tcg_target_long offset)
>> +{
>> +    if (offset > -256 && offset < 256) {
> 
> Offset >= -256.
> 
>> +        tcg_out_ldst_9(s, data, type, rd, rn, offset);
>> +
>> +    } else {
>> +        tcg_out_movi64(s, TCG_REG_TMP, offset);
>> +        tcg_out_ldst_r(s, data, type, rd, rn, TCG_REG_TMP);
>> +    }
>> +}
>> +
>> +static inline void tcg_out_movi(TCGContext *s, TCGType type,
>> +                                TCGReg rd, tcg_target_long value)
>> +{
>> +    if (type == TCG_TYPE_I64) {
>> +        tcg_out_movi64(s, rd, value);
>> +    } else {
>> +        tcg_out_movi32(s, 0, rd, value);
>> +    }
>> +}
>> +
>> +/* mov alias implemented with add immediate, useful to move to/from SP */
>> +static inline void tcg_out_movr_sp(TCGContext *s, int ext, int rd, int rn)
>> +{
>> +    /* using ADD 0x11000000 | (ext) | rn << 5 | rd */
>> +    unsigned int base = ext ? 0x91000000 : 0x11000000;
>> +    tcg_out32(s, base | rn << 5 | rd);
>> +}
> 
> Couldn't this function be used for tcg_out_movr too?  That shouldn't
> be an issue unless you want to use ZR as source or destination (this
> will be true if you change tcg_out_movi according to Richard's
> proposal).
> 
>> +static inline void tcg_out_mov(TCGContext *s,
>> +                               TCGType type, TCGReg ret, TCGReg arg)
>> +{
>> +    if (ret != arg) {
>> +        tcg_out_movr(s, type == TCG_TYPE_I64, ret, arg);
>> +    }
>> +}
>> +
>> +static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg arg,
>> +                              TCGReg arg1, tcg_target_long arg2)
>> +{
>> +    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_LD,
>> +                 arg, arg1, arg2);
>> +}
>> +
>> +static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
>> +                              TCGReg arg1, tcg_target_long arg2)
>> +{
>> +    tcg_out_ldst(s, (type == TCG_TYPE_I64) ? LDST_64 : LDST_32, LDST_ST,
>> +                 arg, arg1, arg2);
>> +}
>> +
>> +static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
>> +                                 int ext, int rd, int rn, int rm)
>> +{
>> +    /* Using shifted register arithmetic operations */
>> +    /* if extended registry operation (64bit) just or with 0x80 << 24 */
>> +    unsigned int base = ext ? (0x80 | opc) << 24 : opc << 24;
>> +    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_mul(TCGContext *s, int ext, int rd, int rn, int rm)
>> +{
>> +    /* Using MADD 0x1b000000 with Ra = wzr alias MUL 0x1b007c00 */
>> +    unsigned int base = ext ? 0x9b007c00 : 0x1b007c00;
>> +    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_shiftrot_reg(TCGContext *s,
>> +                                        enum aarch64_srr_opc opc, int ext,
>> +                                        int rd, int rn, int rm)
>> +{
>> +    /* using 2-source data processing instructions 0x1ac02000 */
>> +    unsigned int base = ext ? 0x9ac02000 : 0x1ac02000;
>> +    tcg_out32(s, base | rm << 16 | opc << 8 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_ubfm(TCGContext *s, int ext,
>> +                                int rd, int rn, unsigned int a, unsigned int b)
>> +{
>> +    /* Using UBFM 0x53000000 Wd, Wn, a, b - ext encoding requires the 0x4 */
>> +    unsigned int base = ext ? 0xd3400000 : 0x53000000;
>> +    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_sbfm(TCGContext *s, int ext,
>> +                                int rd, int rn, unsigned int a, unsigned int b)
>> +{
>> +    /* Using SBFM 0x13000000 Wd, Wn, a, b - ext encoding requires the 0x4 */
>> +    unsigned int base = ext ? 0x93400000 : 0x13000000;
>> +    tcg_out32(s, base | a << 16 | b << 10 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_extr(TCGContext *s, int ext,
>> +                                int rd, int rn, int rm, unsigned int a)
>> +{
>> +    /* Using EXTR 0x13800000 Wd, Wn, Wm, a - ext encoding requires the 0x4 */
>> +    unsigned int base = ext ? 0x93c00000 : 0x13800000;
>> +    tcg_out32(s, base | rm << 16 | a << 10 | rn << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_shl(TCGContext *s, int ext,
>> +                               int rd, int rn, unsigned int m)
>> +{
>> +    int bits, max;
>> +    bits = ext ? 64 : 32;
>> +    max = bits - 1;
>> +    tcg_out_ubfm(s, ext, rd, rn, bits - (m & max), max - (m & max));
>> +}
>> +
>> +static inline void tcg_out_shr(TCGContext *s, int ext,
>> +                               int rd, int rn, unsigned int m)
>> +{
>> +    int max = ext ? 63 : 31;
>> +    tcg_out_ubfm(s, ext, rd, rn, m & max, max);
>> +}
>> +
>> +static inline void tcg_out_sar(TCGContext *s, int ext,
>> +                               int rd, int rn, unsigned int m)
>> +{
>> +    int max = ext ? 63 : 31;
>> +    tcg_out_sbfm(s, ext, rd, rn, m & max, max);
>> +}
>> +
>> +static inline void tcg_out_rotr(TCGContext *s, int ext,
>> +                                int rd, int rn, unsigned int m)
>> +{
>> +    int max = ext ? 63 : 31;
>> +    tcg_out_extr(s, ext, rd, rn, rn, m & max);
>> +}
>> +
>> +static inline void tcg_out_rotl(TCGContext *s, int ext,
>> +                                int rd, int rn, unsigned int m)
>> +{
>> +    int bits, max;
>> +    bits = ext ? 64 : 32;
>> +    max = bits - 1;
>> +    tcg_out_extr(s, ext, rd, rn, rn, bits - (m & max));
>> +}
>> +
>> +static inline void tcg_out_cmp(TCGContext *s, int ext,
>> +                               int rn, int rm)
>> +{
>> +    /* Using CMP alias SUBS wzr, Wn, Wm */
>> +    unsigned int base = ext ? 0xeb00001f : 0x6b00001f;
>> +    tcg_out32(s, base | rm << 16 | rn << 5);
>> +}
>> +
>> +static inline void tcg_out_cset(TCGContext *s, int ext,
>> +                                int rd, TCGCond c)
>> +{
>> +    /* Using CSET alias of CSINC 0x1a800400 Xd, XZR, XZR, invert(cond) */
>> +    unsigned int base = ext ? 0x9a9f07e0 : 0x1a9f07e0;
>> +    tcg_out32(s, base | tcg_cond_to_aarch64[tcg_invert_cond(c)] << 12 | rd);
>> +}
>> +
>> +static inline void tcg_out_goto(TCGContext *s, tcg_target_long target)
>> +{
>> +    tcg_target_long offset;
>> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
>> +
>> +    if (offset <= -0x02000000 || offset >= 0x02000000) {
> 
> offset < -0x02000000
> 
>> +        /* out of 26bit range */
>> +        tcg_abort();
>> +    }
>> +
>> +    tcg_out32(s, 0x14000000 | (offset & 0x03ffffff));
>> +}
>> +
>> +static inline void tcg_out_goto_noaddr(TCGContext *s)
>> +{
>> +    /* We pay attention here to not modify the branch target by
>> +       reading from the buffer. This ensure that caches and memory are
>> +       kept coherent during retranslation.
>> +       Mask away possible garbage in the high bits for the first translation,
>> +       while keeping the offset bits for retranslation. */
>> +    uint32_t insn;
>> +    insn = (tcg_in32(s) & 0x03ffffff) | 0x14000000;
>> +    tcg_out32(s, insn);
>> +}
>> +
>> +static inline void tcg_out_goto_cond_noaddr(TCGContext *s, TCGCond c)
>> +{
>> +    /* see comments in tcg_out_goto_noaddr */
>> +    uint32_t insn;
>> +    insn = tcg_in32(s) & (0x07ffff << 5);
>> +    insn |= 0x54000000 | tcg_cond_to_aarch64[c];
>> +    tcg_out32(s, insn);
>> +}
>> +
>> +static inline void tcg_out_goto_cond(TCGContext *s, TCGCond c,
>> +                                     tcg_target_long target)
>> +{
>> +    tcg_target_long offset;
>> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
>> +
>> +    if (offset <= -0x3ffff || offset >= 0x3ffff) {
> 
> offset < -0x40000
> 
>> +        /* out of 19bit range */
>> +        tcg_abort();
>> +    }
>> +
>> +    offset &= 0x7ffff;
>> +    tcg_out32(s, 0x54000000 | tcg_cond_to_aarch64[c] | offset << 5);
>> +}
>> +
>> +static inline void tcg_out_callr(TCGContext *s, int reg)
>> +{
>> +    tcg_out32(s, 0xd63f0000 | reg << 5);
>> +}
>> +
>> +static inline void tcg_out_gotor(TCGContext *s, int reg)
>> +{
>> +    tcg_out32(s, 0xd61f0000 | reg << 5);
>> +}
>> +
>> +static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
>> +{
>> +    tcg_target_long offset;
>> +
>> +    offset = (target - (tcg_target_long)s->code_ptr) / 4;
>> +
>> +    if (offset <= -0x02000000 || offset >= 0x02000000) { /* out of 26bit rng */
> 
> offset < -0x02000000
> 
>> +        tcg_out_movi64(s, TCG_REG_TMP, target);
>> +        tcg_out_callr(s, TCG_REG_TMP);
>> +
> 
> Extra blank line.
> 
>> +    } else {
>> +        tcg_out32(s, 0x94000000 | (offset & 0x03ffffff));
>> +    }
>> +}
>> +
>> +/* test a register against a bit pattern made of pattern_n repeated 1s.
>> +   For example, to test against 0111b (0x07), pass pattern_n = 3 */
>> +static inline void tcg_out_tst(TCGContext *s, int ext, int rn,
>> +                               tcg_target_ulong pattern_n)
>> +{
>> +    /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f. Ext requires 4. */
>> +    unsigned int base = ext ? 0xf240001f : 0x7200001f;
>> +    tcg_out32(s, base | (pattern_n - 1) << 10 | rn << 5);
>> +}
> 
> You probably should protect against pattern_n == 0.
> Note this function is currently unused.
> 
>> +static inline void tcg_out_ret(TCGContext *s)
>> +{
>> +    /* emit RET { LR } */
>> +    tcg_out32(s, 0xd65f03c0);
>> +}
>> +
>> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
>> +{
>> +    tcg_target_long target, offset;
>> +    target = (tcg_target_long)addr;
>> +    offset = (target - (tcg_target_long)jmp_addr) / 4;
>> +
>> +    if (offset <= -0x02000000 || offset >= 0x02000000) {
> 
> offset < -0x02000000
> 
>> +        /* out of 26bit range */
>> +        tcg_abort();
>> +    }
>> +
>> +    patch_reloc((uint8_t *)jmp_addr, R_AARCH64_JUMP26, target, 0);
>> +    flush_icache_range(jmp_addr, jmp_addr + 4);
>> +}
>> +
>> +static inline void tcg_out_goto_label(TCGContext *s, int label_index)
>> +{
>> +    TCGLabel *l = &s->labels[label_index];
>> +
>> +    if (!l->has_value) {
>> +        tcg_out_reloc(s, s->code_ptr, R_AARCH64_JUMP26, label_index, 0);
>> +        tcg_out_goto_noaddr(s);
>> +
>> +    } else {
>> +        tcg_out_goto(s, l->u.value);
>> +    }
>> +}
>> +
>> +static inline void tcg_out_goto_label_cond(TCGContext *s,
>> +                                           TCGCond c, int label_index)
>> +{
>> +    TCGLabel *l = &s->labels[label_index];
>> +
>> +    if (!l->has_value) {
>> +        tcg_out_reloc(s, s->code_ptr, R_AARCH64_CONDBR19, label_index, 0);
>> +        tcg_out_goto_cond_noaddr(s, c);
>> +
>> +    } else {
>> +        tcg_out_goto_cond(s, c, l->u.value);
>> +    }
>> +}
>> +
>> +#ifdef CONFIG_SOFTMMU
>> +#include "exec/softmmu_defs.h"
>> +
>> +/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
>> +   int mmu_idx) */
>> +static const void * const qemu_ld_helpers[4] = {
>> +    helper_ldb_mmu,
>> +    helper_ldw_mmu,
>> +    helper_ldl_mmu,
>> +    helper_ldq_mmu,
>> +};
>> +
>> +/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
>> +   uintxx_t val, int mmu_idx) */
>> +static const void * const qemu_st_helpers[4] = {
>> +    helper_stb_mmu,
>> +    helper_stw_mmu,
>> +    helper_stl_mmu,
>> +    helper_stq_mmu,
>> +};
>> +
>> +#endif /* CONFIG_SOFTMMU */
>> +
>> +static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
>> +{
>> +    int addr_reg, data_reg;
>> +#ifdef CONFIG_SOFTMMU
>> +    int mem_index, s_bits;
>> +#endif
>> +    data_reg = args[0];
>> +    addr_reg = args[1];
>> +
>> +#ifdef CONFIG_SOFTMMU
>> +    mem_index = args[2];
>> +    s_bits = opc & 3;
>> +
>> +    /* TODO: insert TLB lookup here */
>> +
>> +#  if CPU_TLB_BITS > 8
>> +#   error "CPU_TLB_BITS too large"
>> +#  endif
>> +
>> +    /* all arguments passed via registers */
>> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
>> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
>> +    tcg_out_movi32(s, 0, TCG_REG_X2, mem_index);
>> +
>> +    tcg_out_movi64(s, TCG_REG_TMP, (uint64_t)qemu_ld_helpers[s_bits]);
>> +    tcg_out_callr(s, TCG_REG_TMP);
>> +
>> +    if (opc & 0x04) { /* sign extend */
>> +        unsigned int bits; bits = 8 * (1 << s_bits) - 1;
>> +        tcg_out_sbfm(s, 1, data_reg, TCG_REG_X0, 0, bits); /* 7|15|31 */
>> +
>> +    } else {
>> +        tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
>> +    }
>> +
>> +#else /* !CONFIG_SOFTMMU */
>> +    tcg_abort(); /* TODO */
>> +#endif
>> +}
>> +
>> +static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
>> +{
>> +    int addr_reg, data_reg;
>> +#ifdef CONFIG_SOFTMMU
>> +    int mem_index, s_bits;
>> +#endif
>> +    data_reg = args[0];
>> +    addr_reg = args[1];
>> +
>> +#ifdef CONFIG_SOFTMMU
>> +    mem_index = args[2];
>> +    s_bits = opc & 3;
>> +
>> +    /* TODO: here we should generate something like the following:
>> +     *  shr x8, addr_reg, #TARGET_PAGE_BITS
>> +     *  and x0, x8, #(CPU_TLB_SIZE - 1)   @ Assumption: CPU_TLB_BITS <= 8
>> +     *  add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS
>> +     *  test ... XXX
>> +     */
>> +#  if CPU_TLB_BITS > 8
>> +#   error "CPU_TLB_BITS too large"
>> +#  endif
>> +
>> +    /* all arguments passed via registers */
>> +    tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
>> +    tcg_out_movr(s, 1, TCG_REG_X1, addr_reg);
>> +    tcg_out_movr(s, 1, TCG_REG_X2, data_reg);
>> +    tcg_out_movi32(s, 0, TCG_REG_X3, mem_index);
>> +
>> +    tcg_out_movi64(s, TCG_REG_TMP, (uint64_t)qemu_st_helpers[s_bits]);
>> +    tcg_out_callr(s, TCG_REG_TMP);
>> +
>> +#else /* !CONFIG_SOFTMMU */
>> +    tcg_abort(); /* TODO */
>> +#endif
>> +}
>> +
>> +static uint8_t *tb_ret_addr;
>> +
>> +/* callee stack use example:
>> +   stp     x29, x30, [sp,#-32]!
>> +   mov     x29, sp
>> +   stp     x1, x2, [sp,#16]
>> +   ...
>> +   ldp     x1, x2, [sp,#16]
>> +   ldp     x29, x30, [sp],#32
>> +   ret
>> +*/
>> +
>> +/* push r1 and r2, and alloc stack space for a total of
>> +   alloc_n elements (1 element=16 bytes, must be between 1 and 31. */
>> +static inline void tcg_out_push_pair(TCGContext *s,
>> +                                     TCGReg r1, TCGReg r2, int alloc_n)
>> +{
>> +    /* using indexed scaled simm7 STP 0x28800000 | (ext) | 0x01000000 (pre-idx)
>> +       | alloc_n * (-1) << 16 | r2 << 10 | sp(31) << 5 | r1 */
>> +    assert(alloc_n > 0 && alloc_n < 0x20);
>> +    alloc_n = (-alloc_n) & 0x3f;
>> +    tcg_out32(s, 0xa98003e0 | alloc_n << 16 | r2 << 10 | r1);
>> +}
>> +
>> +/* dealloc stack space for a total of alloc_n elements and pop r1, r2.  */
>> +static inline void tcg_out_pop_pair(TCGContext *s,
>> +                                 TCGReg r1, TCGReg r2, int alloc_n)
>> +{
>> +    /* using indexed scaled simm7 LDP 0x28c00000 | (ext) | nothing (post-idx)
>> +       | alloc_n << 16 | r2 << 10 | sp(31) << 5 | r1 */
>> +    assert(alloc_n > 0 && alloc_n < 0x20);
>> +    tcg_out32(s, 0xa8c003e0 | alloc_n << 16 | r2 << 10 | r1);
>> +}
>> +
>> +static inline void tcg_out_store_pair(TCGContext *s,
>> +                                      TCGReg r1, TCGReg r2, int idx)
>> +{
>> +    /* using register pair offset simm7 STP 0x29000000 | (ext)
>> +       | idx << 16 | r2 << 10 | fp(29) << 5 | r1 */
>> +    assert(idx > 0 && idx < 0x20);
>> +    tcg_out32(s, 0xa90003a0 | idx << 16 | r2 << 10 | r1);
>> +}
>> +
>> +static inline void tcg_out_load_pair(TCGContext *s,
>> +                                     TCGReg r1, TCGReg r2, int idx)
>> +{
>> +    /* using register pair offset simm7 LDP 0x29400000 | (ext)
>> +       | idx << 16 | r2 << 10 | fp(29) << 5 | r1 */
>> +    assert(idx > 0 && idx < 0x20);
>> +    tcg_out32(s, 0xa94003a0 | idx << 16 | r2 << 10 | r1);
>> +}
>> +
>> +static void tcg_out_op(TCGContext *s, TCGOpcode opc,
>> +                       const TCGArg *args, const int *const_args)
>> +{
>> +    int ext = 0;
>> +
>> +    switch (opc) {
>> +    case INDEX_op_exit_tb:
>> +        tcg_out_movi64(s, TCG_REG_X0, args[0]); /* load retval in X0 */
>> +        tcg_out_goto(s, (tcg_target_long)tb_ret_addr);
>> +        break;
>> +
>> +    case INDEX_op_goto_tb:
>> +#ifndef USE_DIRECT_JUMP
>> +#error "USE_DIRECT_JUMP required for aarch64"
>> +#endif
>> +        assert(s->tb_jmp_offset != NULL); /* consistency for USE_DIRECT_JUMP */
>> +        s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
>> +        /* actual branch destination will be patched by
>> +           aarch64_tb_set_jmp_target later, beware retranslation. */
>> +        tcg_out_goto_noaddr(s);
>> +        s->tb_next_offset[args[0]] = s->code_ptr - s->code_buf;
>> +        break;
>> +
>> +    case INDEX_op_call:
>> +        if (const_args[0]) {
>> +            tcg_out_call(s, args[0]);
>> +        } else {
>> +            tcg_out_callr(s, args[0]);
>> +        }
>> +        break;
>> +
>> +    case INDEX_op_br:
>> +        tcg_out_goto_label(s, args[0]);
>> +        break;
>> +
>> +    case INDEX_op_ld_i32:
>> +    case INDEX_op_ld_i64:
>> +    case INDEX_op_st_i32:
>> +    case INDEX_op_st_i64:
>> +    case INDEX_op_ld8u_i32:
>> +    case INDEX_op_ld8s_i32:
>> +    case INDEX_op_ld16u_i32:
>> +    case INDEX_op_ld16s_i32:
>> +    case INDEX_op_ld8u_i64:
>> +    case INDEX_op_ld8s_i64:
>> +    case INDEX_op_ld16u_i64:
>> +    case INDEX_op_ld16s_i64:
>> +    case INDEX_op_ld32u_i64:
>> +    case INDEX_op_ld32s_i64:
>> +    case INDEX_op_st8_i32:
>> +    case INDEX_op_st8_i64:
>> +    case INDEX_op_st16_i32:
>> +    case INDEX_op_st16_i64:
>> +    case INDEX_op_st32_i64:
>> +        tcg_out_ldst(s, aarch64_ldst_get_data(opc), aarch64_ldst_get_type(opc),
>> +                     args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_mov_i64: ext = 1;
>> +    case INDEX_op_mov_i32:
>> +        tcg_out_movr(s, ext, args[0], args[1]);
>> +        break;
>> +
>> +    case INDEX_op_movi_i64:
>> +        tcg_out_movi64(s, args[0], args[1]);
>> +        break;
>> +
>> +    case INDEX_op_movi_i32:
>> +        tcg_out_movi32(s, 0, args[0], args[1]);
>> +        break;
>> +
>> +    case INDEX_op_add_i64: ext = 1;
>> +    case INDEX_op_add_i32:
>> +        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_sub_i64: ext = 1;
>> +    case INDEX_op_sub_i32:
>> +        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_and_i64: ext = 1;
>> +    case INDEX_op_and_i32:
>> +        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_or_i64: ext = 1;
>> +    case INDEX_op_or_i32:
>> +        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_xor_i64: ext = 1;
>> +    case INDEX_op_xor_i32:
>> +        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_mul_i64: ext = 1;
>> +    case INDEX_op_mul_i32:
>> +        tcg_out_mul(s, ext, args[0], args[1], args[2]);
>> +        break;
>> +
>> +    case INDEX_op_shl_i64: ext = 1;
>> +    case INDEX_op_shl_i32:
>> +        if (const_args[2]) {    /* LSL / UBFM Wd, Wn, (32 - m) */
>> +            tcg_out_shl(s, ext, args[0], args[1], args[2]);
>> +        } else {                /* LSL / LSLV */
>> +            tcg_out_shiftrot_reg(s, SRR_SHL, ext, args[0], args[1], args[2]);
>> +        }
>> +        break;
>> +
>> +    case INDEX_op_shr_i64: ext = 1;
>> +    case INDEX_op_shr_i32:
>> +        if (const_args[2]) {    /* LSR / UBFM Wd, Wn, m, 31 */
>> +            tcg_out_shr(s, ext, args[0], args[1], args[2]);
>> +        } else {                /* LSR / LSRV */
>> +            tcg_out_shiftrot_reg(s, SRR_SHR, ext, args[0], args[1], args[2]);
>> +        }
>> +        break;
>> +
>> +    case INDEX_op_sar_i64: ext = 1;
>> +    case INDEX_op_sar_i32:
>> +        if (const_args[2]) {    /* ASR / SBFM Wd, Wn, m, 31 */
>> +            tcg_out_sar(s, ext, args[0], args[1], args[2]);
>> +        } else {                /* ASR / ASRV */
>> +            tcg_out_shiftrot_reg(s, SRR_SAR, ext, args[0], args[1], args[2]);
>> +        }
>> +        break;
>> +
>> +    case INDEX_op_rotr_i64: ext = 1;
>> +    case INDEX_op_rotr_i32:
>> +        if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, m */
>> +            tcg_out_rotr(s, ext, args[0], args[1], args[2]);
>> +        } else {                /* ROR / RORV */
>> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
>> +        }
>> +        break;
>> +
>> +    case INDEX_op_rotl_i64: ext = 1;
>> +    case INDEX_op_rotl_i32:     /* same as rotate right by (32 - m) */
>> +        if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, 32 - m */
>> +            tcg_out_rotl(s, ext, args[0], args[1], args[2]);
>> +        } else {
>> +            tcg_out_arith(s, ARITH_SUB, ext, args[2], TCG_REG_XZR, args[2]);
>> +            tcg_out_shiftrot_reg(s, SRR_ROR, ext, args[0], args[1], args[2]);
>> +        }
>> +        break;
>> +
>> +    case INDEX_op_brcond_i64: ext = 1;
>> +    case INDEX_op_brcond_i32: /* CMP 0, 1, cond(2), label 3 */
>> +        tcg_out_cmp(s, ext, args[0], args[1]);
>> +        tcg_out_goto_label_cond(s, args[2], args[3]);
>> +        break;
>> +
>> +    case INDEX_op_setcond_i64: ext = 1;
>> +    case INDEX_op_setcond_i32:
>> +        tcg_out_cmp(s, ext, args[1], args[2]);
>> +        tcg_out_cset(s, ext, args[0], args[3]);
>> +        break;
>> +
>> +    case INDEX_op_qemu_ld8u:
>> +        tcg_out_qemu_ld(s, args, 0 | 0);
>> +        break;
>> +    case INDEX_op_qemu_ld8s:
>> +        tcg_out_qemu_ld(s, args, 4 | 0);
>> +        break;
>> +    case INDEX_op_qemu_ld16u:
>> +        tcg_out_qemu_ld(s, args, 0 | 1);
>> +        break;
>> +    case INDEX_op_qemu_ld16s:
>> +        tcg_out_qemu_ld(s, args, 4 | 1);
>> +        break;
>> +    case INDEX_op_qemu_ld32u:
>> +        tcg_out_qemu_ld(s, args, 0 | 2);
>> +        break;
>> +    case INDEX_op_qemu_ld32s:
>> +        tcg_out_qemu_ld(s, args, 4 | 2);
>> +        break;
>> +    case INDEX_op_qemu_ld32:
>> +        tcg_out_qemu_ld(s, args, 0 | 2);
>> +        break;
>> +    case INDEX_op_qemu_ld64:
>> +        tcg_out_qemu_ld(s, args, 0 | 3);
>> +        break;
>> +    case INDEX_op_qemu_st8:
>> +        tcg_out_qemu_st(s, args, 0);
>> +        break;
>> +    case INDEX_op_qemu_st16:
>> +        tcg_out_qemu_st(s, args, 1);
>> +        break;
>> +    case INDEX_op_qemu_st32:
>> +        tcg_out_qemu_st(s, args, 2);
>> +        break;
>> +    case INDEX_op_qemu_st64:
>> +        tcg_out_qemu_st(s, args, 3);
>> +        break;
>> +
>> +    default:
>> +        tcg_abort(); /* opcode not implemented */
>> +    }
>> +}
>> +
>> +static const TCGTargetOpDef aarch64_op_defs[] = {
>> +    { INDEX_op_exit_tb, { } },
>> +    { INDEX_op_goto_tb, { } },
>> +    { INDEX_op_call, { "ri" } },
>> +    { INDEX_op_br, { } },
>> +
>> +    { INDEX_op_mov_i32, { "r", "r" } },
>> +    { INDEX_op_mov_i64, { "r", "r" } },
>> +
>> +    { INDEX_op_movi_i32, { "r" } },
>> +    { INDEX_op_movi_i64, { "r" } },
>> +
>> +    { INDEX_op_ld8u_i32, { "r", "r" } },
>> +    { INDEX_op_ld8s_i32, { "r", "r" } },
>> +    { INDEX_op_ld16u_i32, { "r", "r" } },
>> +    { INDEX_op_ld16s_i32, { "r", "r" } },
>> +    { INDEX_op_ld_i32, { "r", "r" } },
>> +    { INDEX_op_ld8u_i64, { "r", "r" } },
>> +    { INDEX_op_ld8s_i64, { "r", "r" } },
>> +    { INDEX_op_ld16u_i64, { "r", "r" } },
>> +    { INDEX_op_ld16s_i64, { "r", "r" } },
>> +    { INDEX_op_ld32u_i64, { "r", "r" } },
>> +    { INDEX_op_ld32s_i64, { "r", "r" } },
>> +    { INDEX_op_ld_i64, { "r", "r" } },
>> +
>> +    { INDEX_op_st8_i32, { "r", "r" } },
>> +    { INDEX_op_st16_i32, { "r", "r" } },
>> +    { INDEX_op_st_i32, { "r", "r" } },
>> +    { INDEX_op_st8_i64, { "r", "r" } },
>> +    { INDEX_op_st16_i64, { "r", "r" } },
>> +    { INDEX_op_st32_i64, { "r", "r" } },
>> +    { INDEX_op_st_i64, { "r", "r" } },
>> +
>> +    { INDEX_op_add_i32, { "r", "r", "r" } },
>> +    { INDEX_op_add_i64, { "r", "r", "r" } },
>> +    { INDEX_op_sub_i32, { "r", "r", "r" } },
>> +    { INDEX_op_sub_i64, { "r", "r", "r" } },
>> +    { INDEX_op_mul_i32, { "r", "r", "r" } },
>> +    { INDEX_op_mul_i64, { "r", "r", "r" } },
>> +    { INDEX_op_and_i32, { "r", "r", "r" } },
>> +    { INDEX_op_and_i64, { "r", "r", "r" } },
>> +    { INDEX_op_or_i32, { "r", "r", "r" } },
>> +    { INDEX_op_or_i64, { "r", "r", "r" } },
>> +    { INDEX_op_xor_i32, { "r", "r", "r" } },
>> +    { INDEX_op_xor_i64, { "r", "r", "r" } },
>> +
>> +    { INDEX_op_shl_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_shr_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_sar_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_rotl_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_rotr_i32, { "r", "r", "ri" } },
>> +    { INDEX_op_shl_i64, { "r", "r", "ri" } },
>> +    { INDEX_op_shr_i64, { "r", "r", "ri" } },
>> +    { INDEX_op_sar_i64, { "r", "r", "ri" } },
>> +    { INDEX_op_rotl_i64, { "r", "r", "ri" } },
>> +    { INDEX_op_rotr_i64, { "r", "r", "ri" } },
>> +
>> +    { INDEX_op_brcond_i32, { "r", "r" } },
>> +    { INDEX_op_setcond_i32, { "r", "r", "r" } },
>> +    { INDEX_op_brcond_i64, { "r", "r" } },
>> +    { INDEX_op_setcond_i64, { "r", "r", "r" } },
>> +
>> +    { INDEX_op_qemu_ld8u, { "r", "l" } },
>> +    { INDEX_op_qemu_ld8s, { "r", "l" } },
>> +    { INDEX_op_qemu_ld16u, { "r", "l" } },
>> +    { INDEX_op_qemu_ld16s, { "r", "l" } },
>> +    { INDEX_op_qemu_ld32u, { "r", "l" } },
>> +    { INDEX_op_qemu_ld32s, { "r", "l" } },
>> +
>> +    { INDEX_op_qemu_ld32, { "r", "l" } },
>> +    { INDEX_op_qemu_ld64, { "r", "l" } },
>> +
>> +    { INDEX_op_qemu_st8, { "l", "l" } },
>> +    { INDEX_op_qemu_st16, { "l", "l" } },
>> +    { INDEX_op_qemu_st32, { "l", "l" } },
>> +    { INDEX_op_qemu_st64, { "l", "l" } },
>> +    { -1 },
>> +};
>> +
>> +static void tcg_target_init(TCGContext *s)
>> +{
>> +#if !defined(CONFIG_USER_ONLY)
>> +    /* fail safe */
>> +    if ((1ULL << CPU_TLB_ENTRY_BITS) != sizeof(CPUTLBEntry)) {
>> +        tcg_abort();
>> +    }
>> +#endif
>> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffffffff);
>> +    tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffffffff);
>> +
>> +    tcg_regset_set32(tcg_target_call_clobber_regs, 0,
>> +                     (1 << TCG_REG_X0) | (1 << TCG_REG_X1) |
>> +                     (1 << TCG_REG_X2) | (1 << TCG_REG_X3) |
>> +                     (1 << TCG_REG_X4) | (1 << TCG_REG_X5) |
>> +                     (1 << TCG_REG_X6) | (1 << TCG_REG_X7) |
>> +                     (1 << TCG_REG_X8) | (1 << TCG_REG_X9) |
>> +                     (1 << TCG_REG_X10) | (1 << TCG_REG_X11) |
>> +                     (1 << TCG_REG_X12) | (1 << TCG_REG_X13) |
>> +                     (1 << TCG_REG_X14) | (1 << TCG_REG_X15) |
>> +                     (1 << TCG_REG_X16) | (1 << TCG_REG_X17) |
>> +                     (1 << TCG_REG_X18) | (1 << TCG_REG_LR));
>> +
>> +    tcg_regset_clear(s->reserved_regs);
>> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
>> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
>> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
>> +
>> +    tcg_add_target_add_op_defs(aarch64_op_defs);
>> +}
>> +
>> +static inline void tcg_out_addi(TCGContext *s,
>> +                                int ext, int rd, int rn, unsigned int aimm)
>> +{
>> +    /* add immediate aimm unsigned 12bit value (we use LSL 0 - no shift) */
>> +    /* using ADD 0x11000000 | (ext) | (aimm << 10) | (rn << 5) | rd */
>> +    unsigned int base = ext ? 0x91000000 : 0x11000000;
>> +    assert(aimm <= 0xfff);
>> +    tcg_out32(s, base | (aimm << 10) | (rn << 5) | rd);
>> +}
>> +
>> +static inline void tcg_out_subi(TCGContext *s,
>> +                                int ext, int rd, int rn, unsigned int aimm)
>> +{
>> +    /* sub immediate aimm unsigned 12bit value (we use LSL 0 - no shift) */
>> +    /* using SUB 0x51000000 | (ext) | (aimm << 10) | (rn << 5) | rd */
>> +    unsigned int base = ext ? 0xd1000000 : 0x51000000;
>> +    assert(aimm <= 0xfff);
>> +    tcg_out32(s, base | (aimm << 10) | (rn << 5) | rd);
>> +}
>> +
>> +static void tcg_target_qemu_prologue(TCGContext *s)
>> +{
>> +    /* NB: frame sizes are in 16 byte stack units! */
>> +    int frame_size_callee_saved, frame_size_tcg_locals;
>> +    int r;
>> +
>> +    /* save pairs             (FP, LR) and (X19, X20) .. (X27, X28) */
>> +    frame_size_callee_saved = (1) + (TCG_REG_X28 - TCG_REG_X19) / 2 + 1;
>> +
>> +    /* frame size requirement for TCG local variables */
>> +    frame_size_tcg_locals = TCG_STATIC_CALL_ARGS_SIZE
>> +        + CPU_TEMP_BUF_NLONGS * sizeof(long)
>> +        + (TCG_TARGET_STACK_ALIGN - 1);
>> +    frame_size_tcg_locals &= ~(TCG_TARGET_STACK_ALIGN - 1);
>> +    frame_size_tcg_locals /= TCG_TARGET_STACK_ALIGN;
>> +
>> +    /* push (FP, LR) and update sp */
>> +    tcg_out_push_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
>> +
>> +    /* FP -> callee_saved */
>> +    tcg_out_movr_sp(s, 1, TCG_REG_FP, TCG_REG_SP);
>> +
>> +    /* store callee-preserved regs x19..x28 using FP -> callee_saved */
>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>> +        tcg_out_store_pair(s, r, r + 1, idx);
>> +    }
>> +
>> +    /* make stack space for TCG locals */
>> +    tcg_out_subi(s, 1, TCG_REG_SP, TCG_REG_SP,
>> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
>> +    /* inform TCG about how to find TCG locals with register, offset, size */
>> +    tcg_set_frame(s, TCG_REG_SP, TCG_STATIC_CALL_ARGS_SIZE,
>> +                  CPU_TEMP_BUF_NLONGS * sizeof(long));
>> +
>> +    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
>> +    tcg_out_gotor(s, tcg_target_call_iarg_regs[1]);
>> +
>> +    tb_ret_addr = s->code_ptr;
>> +
>> +    /* remove TCG locals stack space */
>> +    tcg_out_addi(s, 1, TCG_REG_SP, TCG_REG_SP,
>> +                 frame_size_tcg_locals * TCG_TARGET_STACK_ALIGN);
>> +
>> +    /* restore registers x19..x28.
>> +       FP must be preserved, so it still points to callee_saved area */
>> +    for (r = TCG_REG_X19; r <= TCG_REG_X27; r += 2) {
>> +        int idx; idx = (r - TCG_REG_X19) / 2 + 1;
>> +        tcg_out_load_pair(s, r, r + 1, idx);
>> +    }
>> +
>> +    /* pop (FP, LR), restore SP to previous frame, return */
>> +    tcg_out_pop_pair(s, TCG_REG_FP, TCG_REG_LR, frame_size_callee_saved);
>> +    tcg_out_ret(s);
>> +}
> 
> 
> Laurent
> 


-- 
Claudio Fontana
Server OS Architect
Huawei Technologies Duesseldorf GmbH
Riesstraße 25 - 80992 München

office: +49 89 158834 4135
mobile: +49 15253060158

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64
  2013-05-28  7:17               ` [Qemu-devel] [PATCH 2/4] " Claudio Fontana
@ 2013-05-28 14:52                 ` Richard Henderson
  0 siblings, 0 replies; 60+ messages in thread
From: Richard Henderson @ 2013-05-28 14:52 UTC (permalink / raw)
  To: Claudio Fontana; +Cc: Peter Maydell, Jani Kokkonen, qemu-devel

On 05/28/2013 12:17 AM, Claudio Fontana wrote:
>> >     if (type == TCG_TYPE_I32) {
>> >         value = (uint32_t)value;
>> >         ext = 0;
>> >     } else if (value <= 0xffffffff) {
>> >         ext = 0;
>> >     } else {
>> >         ext = 0x80000000;
>> >     }
> The check for type is probably unnecessary, since we don't gain anything (we still have to check something once), so I'd rather use a uint64_t parameter and then just check for value < 0xffffffff.
> 

The check for type is necessary, because we don't guarantee that the high bits
of value are zero, sign-extended, or indeed garbage, for TYPE_I32.


r~

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2013-05-28 14:52 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-14 15:57 [Qemu-devel] QEMU aarch64 TCG target Claudio Fontana
2013-03-14 16:16 ` Peter Maydell
2013-05-06 12:56   ` [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64 Claudio Fontana
2013-05-06 13:27     ` Paolo Bonzini
2013-05-13 13:22       ` [Qemu-devel] [PATCH 0/3] ARM aarch64 TCG target Claudio Fontana
2013-05-13 13:28         ` [Qemu-devel] [PATCH 1/3] configure: permit compilation on arm aarch64 Claudio Fontana
2013-05-13 18:29           ` Peter Maydell
2013-05-14  8:19             ` Claudio Fontana
2013-05-13 13:31         ` [Qemu-devel] [PATCH 2/3] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
2013-05-13 18:34           ` Peter Maydell
2013-05-14  8:24             ` Claudio Fontana
2013-05-13 13:33         ` [Qemu-devel] [PATCH 3/3] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
2013-05-13 18:28           ` Peter Maydell
2013-05-14 12:01             ` Claudio Fontana
2013-05-14 12:25               ` Peter Maydell
2013-05-14 15:19                 ` Richard Henderson
2013-05-16 14:39                   ` Claudio Fontana
2013-05-14 12:41               ` Laurent Desnogues
2013-05-13 19:49           ` Richard Henderson
2013-05-14 14:05             ` Claudio Fontana
2013-05-14 15:16               ` Richard Henderson
2013-05-14 16:26                 ` Richard Henderson
2013-05-06 13:42     ` [Qemu-devel] QEMU aarch64 TCG target - testing question about x86-64 Peter Maydell
2013-05-23  8:09 ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Claudio Fontana
2013-05-23  8:14   ` [Qemu-devel] [PATCH 1/4] include/elf.h: add aarch64 ELF machine and relocs Claudio Fontana
2013-05-23 13:18     ` Peter Maydell
2013-05-28  8:09     ` Laurent Desnogues
2013-05-23  8:18   ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement new TCG target for aarch64 Claudio Fontana
2013-05-23 16:29     ` Richard Henderson
2013-05-24  8:53       ` Claudio Fontana
2013-05-24 17:02         ` Richard Henderson
2013-05-24 17:08           ` Peter Maydell
2013-05-24 17:17             ` Richard Henderson
2013-05-24 17:28               ` Peter Maydell
2013-05-24 17:54                 ` Richard Henderson
2013-05-27 11:43           ` Claudio Fontana
2013-05-27 18:47             ` Richard Henderson
2013-05-27 21:14               ` [Qemu-devel] [PATCH 3/3] " Laurent Desnogues
2013-05-28 13:01                 ` Claudio Fontana
2013-05-28 13:09                   ` Laurent Desnogues
2013-05-28  7:17               ` [Qemu-devel] [PATCH 2/4] " Claudio Fontana
2013-05-28 14:52                 ` Richard Henderson
2013-05-23 16:39     ` Peter Maydell
2013-05-24  8:51       ` Claudio Fontana
2013-05-27  9:10         ` Claudio Fontana
2013-05-27 10:40           ` Peter Maydell
2013-05-27 17:05           ` Richard Henderson
2013-05-27  9:47     ` Laurent Desnogues
2013-05-27 10:13       ` Claudio Fontana
2013-05-27 10:28         ` Laurent Desnogues
2013-05-28 13:14     ` Laurent Desnogues
2013-05-28 14:37       ` Claudio Fontana
2013-05-23  8:19   ` [Qemu-devel] [PATCH 3/4] configure: permit compilation on arm aarch64 Claudio Fontana
2013-05-23 13:24     ` Peter Maydell
2013-05-23  8:22   ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: more ops in preparation of tlb lookup Claudio Fontana
2013-05-23 12:37   ` [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG target VERSION 2 Andreas Färber
2013-05-23 12:50     ` Peter Maydell
2013-05-23 12:53       ` Andreas Färber
2013-05-23 13:03         ` Peter Maydell
2013-05-23 13:27           ` Claudio Fontana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).