All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH RFC 0/3] TCG: do copy propagation through memory locations
@ 2017-11-09 14:41 Kirill Batuzov
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 1/3] tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator Kirill Batuzov
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Kirill Batuzov @ 2017-11-09 14:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kirill Batuzov, Alex Bennée, Richard Henderson

This patch series is based on native-vector-registers-3:
  git://github.com/rth7680/qemu.git native-vector-registers-3

Particular goal of this change was to retain values of guest vector registers
on host vector registers between different guest instructions.

Relation between memory locations and variables is many-to-many.
Variables can be copies of each other; multiple variables can have the same
value as the one stored in memory location. Any variable can be stored to
multiple memory locations as well. To represent all this a data structure that
can handle the following operations is needed.

 (0) Allocate and deallocate memory locations. Exact number of possible memory
     locations is unknown, but there should not be too many of them known to
     algorithm simultaneously.
 (1) Find a memory location with specified offset, size and type among all
     memory locations. Needed to replace LOADs.
 (2) For a memory location find a variable containing the same value. Also
     needed to replace LOADs.
 (3) Remove memory locations overlapping with specified range of addresses.
     Needed to remove memory locations affected by STOREs.
 (4) For a variable find all memory locations containing the same value.
     In case the value of the variable has changed, these memory locations
     should not reference this variable any more.

In proposed implementation all these cases are handled by multiple lists
containing memory locations.
 - List of unused memory location descriptors.
 - List of all known memory locations.
 - List of memory locations containing the same value for every variable.

Change was tested on x264 video encoder compiled for ARM32 and run using
qemu-linux-user on x86_64 host. Some loads were replaced by MOVs, but no
change in performance was observed.

x264 video encoder compiled for ARM64 crashed under qemu-linux-user
unfortunately.

On the artificial test case nearly 3x speedup was observed (8s vs 22s).

IN:
<snip>
0x00000000004005c0:  4ea18400      add v0.4s, v0.4s, v1.4s
0x00000000004005c4:  4ea18400      add v0.4s, v0.4s, v1.4s
0x00000000004005c8:  4ea18400      add v0.4s, v0.4s, v1.4s
0x00000000004005cc:  4ea18400      add v0.4s, v0.4s, v1.4s
<snip>

OP:
<snip>
 ---- 00000000004005c0 0000000000000000 0000000000000000
 ld_vec tmp7,env,$0x8a0,$0x1
 ld_vec tmp8,env,$0x8b0,$0x1
 add32_vec tmp9,tmp7,tmp8,$0x1
 st_vec tmp9,env,$0x8a0,$0x1

 ---- 00000000004005c4 0000000000000000 0000000000000000
 ld_vec tmp7,env,$0x8a0,$0x1
 ld_vec tmp8,env,$0x8b0,$0x1
 add32_vec tmp9,tmp7,tmp8,$0x1
 st_vec tmp9,env,$0x8a0,$0x1

 ---- 00000000004005c8 0000000000000000 0000000000000000
 ld_vec tmp7,env,$0x8a0,$0x1
 ld_vec tmp8,env,$0x8b0,$0x1
 add32_vec tmp9,tmp7,tmp8,$0x1
 st_vec tmp9,env,$0x8a0,$0x1

 ---- 00000000004005cc 0000000000000000 0000000000000000
 ld_vec tmp7,env,$0x8a0,$0x1
 ld_vec tmp8,env,$0x8b0,$0x1
 add32_vec tmp9,tmp7,tmp8,$0x1
 st_vec tmp9,env,$0x8a0,$0x1
<snip>

OP after optimization and liveness analysis:
<snip>
 ---- 00000000004005c0 0000000000000000 0000000000000000
 ld_vec tmp7,env,$0x8a0,$0x1
 ld_vec tmp8,env,$0x8b0,$0x1
 add32_vec tmp9,tmp7,tmp8,$0x1                    dead: 1
 st_vec tmp9,env,$0x8a0,$0x1

 ---- 00000000004005c4 0000000000000000 0000000000000000
 mov_vec tmp7,tmp9,$0x1                           dead: 1
 add32_vec tmp9,tmp7,tmp8,$0x1                    dead: 1
 st_vec tmp9,env,$0x8a0,$0x1

 ---- 00000000004005c8 0000000000000000 0000000000000000
 mov_vec tmp7,tmp9,$0x1                           dead: 1
 add32_vec tmp9,tmp7,tmp8,$0x1                    dead: 1
 st_vec tmp9,env,$0x8a0,$0x1

 ---- 00000000004005cc 0000000000000000 0000000000000000
 mov_vec tmp7,tmp9,$0x1                           dead: 1
 add32_vec tmp9,tmp7,tmp8,$0x1                    dead: 1 2
 st_vec tmp9,env,$0x8a0,$0x1                      dead: 0 1
<snip>

I'm not particularly happy about the current implementation.
 - Data structure seems to be a bit too complicated for the task at hand. May
   be I'm doing something wrong?
 - Current data structure is tightly related to struct tcg_temp_info and is a
   part of optimizations. Very similar data structure will be needed in
   liveness analysis to eliminate redundant STOREs.

Having SSA (or at least single assignment per basic block) will help a lot.
It will remove use case (4) completely, and with it the need for the lists of
memory locations for each variable, leaving only one list. Another result will
be that operation on TCGMemLocation will no longer need to do any
modifications of TCGTemp or tcg_temp_info structures thus making TCGMemLocation
reusable in liveness or register allocation.

But we do not have SSA (yet?).

Any thoughts or comments?

Kirill Batuzov (3):
  tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator
  tcg/optimize: do copy propagation for memory locations
  tcg/optimize: handle vector loads and stores during copy propagation

 tcg/optimize.c | 288 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.c      |   2 +
 2 files changed, 290 insertions(+)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH RFC 1/3] tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator
  2017-11-09 14:41 [Qemu-devel] [PATCH RFC 0/3] TCG: do copy propagation through memory locations Kirill Batuzov
@ 2017-11-09 14:41 ` Kirill Batuzov
  2017-11-22  8:01   ` Richard Henderson
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 2/3] tcg/optimize: do copy propagation for memory locations Kirill Batuzov
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 3/3] tcg/optimize: handle vector loads and stores during copy propagation Kirill Batuzov
  2 siblings, 1 reply; 9+ messages in thread
From: Kirill Batuzov @ 2017-11-09 14:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kirill Batuzov, Alex Bennée, Richard Henderson

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index a7854a59a1..6db7dd526a 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -3327,10 +3327,12 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
         switch (opc) {
         case INDEX_op_mov_i32:
         case INDEX_op_mov_i64:
+        case INDEX_op_mov_vec:
             tcg_reg_alloc_mov(s, op);
             break;
         case INDEX_op_movi_i32:
         case INDEX_op_movi_i64:
+        case INDEX_op_movi_vec:
             tcg_reg_alloc_movi(s, op);
             break;
         case INDEX_op_insn_start:
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH RFC 2/3] tcg/optimize: do copy propagation for memory locations
  2017-11-09 14:41 [Qemu-devel] [PATCH RFC 0/3] TCG: do copy propagation through memory locations Kirill Batuzov
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 1/3] tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator Kirill Batuzov
@ 2017-11-09 14:41 ` Kirill Batuzov
  2017-11-22  8:44   ` Richard Henderson
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 3/3] tcg/optimize: handle vector loads and stores during copy propagation Kirill Batuzov
  2 siblings, 1 reply; 9+ messages in thread
From: Kirill Batuzov @ 2017-11-09 14:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kirill Batuzov, Alex Bennée, Richard Henderson

During copy propagation phase, keep track of memory locations that store value
of a known live variable. Only memory locations that are addressed relative to
ENV are accounted. Any other access types are handled conservatively.

When a load is encountered, source memory location is checked against list of
known memory locations. If its content is a copy of some variable, then MOV or
EXT from this variable is issued instead of load. This allows us to keep values
of some CPUState fields that are not represented by global variables on host
registers during computations involving them.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/optimize.c | 266 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 266 insertions(+)

diff --git a/tcg/optimize.c b/tcg/optimize.c
index 847dfa44c9..da7f069444 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -38,8 +38,28 @@ struct tcg_temp_info {
     TCGTemp *next_copy;
     tcg_target_ulong val;
     tcg_target_ulong mask;
+    struct TCGMemLocation *mem_loc;
 };
 
+typedef struct TCGMemLocation {
+    /* Offset is relative to ENV. Only fields of CPUState are accounted.  */
+    tcg_target_ulong offset;
+    tcg_target_ulong size;
+    TCGType type;
+    /* Pointer to a temp containing a valid copy of this memory location.  */
+    TCGTemp *copy;
+    /* Pointer to the next memory location containing copy of the same
+       content.  */
+    struct TCGMemLocation *next_copy;
+
+    /* Double-linked list of all memory locations.  */
+    struct TCGMemLocation *next;
+    struct TCGMemLocation **prev_ptr;
+} TCGMemLocation;
+
+struct TCGMemLocation *mem_locations;
+struct TCGMemLocation *free_mls;
+
 static inline struct tcg_temp_info *ts_info(TCGTemp *ts)
 {
     return ts->state_ptr;
@@ -70,6 +90,34 @@ static inline bool arg_is_copy(TCGArg arg)
     return ts_is_copy(arg_temp(arg));
 }
 
+/* Reset MEMORY LOCATION state. */
+static void reset_ml(TCGMemLocation *ml)
+{
+    if (!ml) {
+        return ;
+    }
+    if (ml->copy) {
+        TCGMemLocation **prev_ptr = &ts_info(ml->copy)->mem_loc;
+        TCGMemLocation *cur_ptr = ts_info(ml->copy)->mem_loc;
+        while (cur_ptr && cur_ptr != ml) {
+            prev_ptr = &cur_ptr->next_copy;
+            cur_ptr = cur_ptr->next_copy;
+        }
+        *prev_ptr = ml->next_copy;
+        if (ts_info(ml->copy)->mem_loc == ml) {
+            ts_info(ml->copy)->mem_loc = ml->next_copy;
+        }
+    }
+
+    *ml->prev_ptr = ml->next;
+    if (ml->next) {
+        ml->next->prev_ptr = ml->prev_ptr;
+    }
+    ml->prev_ptr = NULL;
+    ml->next = free_mls;
+    free_mls = ml;
+}
+
 /* Reset TEMP's state, possibly removing the temp for the list of copies.  */
 static void reset_ts(TCGTemp *ts)
 {
@@ -77,12 +125,27 @@ static void reset_ts(TCGTemp *ts)
     struct tcg_temp_info *pi = ts_info(ti->prev_copy);
     struct tcg_temp_info *ni = ts_info(ti->next_copy);
 
+    if (ti->mem_loc && ts_is_copy(ts) && 0) {
+        TCGMemLocation *ml, *nml;
+        for (ml = ti->mem_loc; ml; ml = nml) {
+            nml = ml->next_copy;
+            ml->copy = ti->next_copy;
+            ml->next_copy = ni->mem_loc;
+            ni->mem_loc = ml;
+        }
+    } else {
+        while (ti->mem_loc) {
+            reset_ml(ti->mem_loc);
+        }
+    }
+
     ni->prev_copy = ti->prev_copy;
     pi->next_copy = ti->next_copy;
     ti->next_copy = ts;
     ti->prev_copy = ts;
     ti->is_const = false;
     ti->mask = -1;
+    ti->mem_loc = NULL;
 }
 
 static void reset_temp(TCGArg arg)
@@ -103,6 +166,7 @@ static void init_ts_info(struct tcg_temp_info *infos,
         ti->prev_copy = ts;
         ti->is_const = false;
         ti->mask = -1;
+        ti->mem_loc = NULL;
         set_bit(idx, temps_used->l);
     }
 }
@@ -119,6 +183,92 @@ static int op_bits(TCGOpcode op)
     return def->flags & TCG_OPF_64BIT ? 64 : 32;
 }
 
+/* Allocate a new MEMORY LOCATION of reuse a free one.  */
+static TCGMemLocation *alloc_ml(void)
+{
+    if (free_mls) {
+        TCGMemLocation *ml = free_mls;
+        free_mls = free_mls->next;
+        return ml;
+    }
+    return tcg_malloc(sizeof(TCGMemLocation));
+}
+
+/* Allocate and initialize MEMORY LOCATION.  */
+static TCGMemLocation *new_ml(tcg_target_ulong off, tcg_target_ulong sz,
+                              TCGTemp *copy)
+{
+    TCGMemLocation *ml = alloc_ml();
+
+    ml->offset = off;
+    ml->size = sz;
+    ml->copy = copy;
+    if (copy) {
+        ml->type = copy->base_type;
+        ml->next_copy = ts_info(copy)->mem_loc;
+        ts_info(copy)->mem_loc = ml;
+    } else {
+        tcg_abort();
+    }
+    ml->next = mem_locations;
+    if (ml->next) {
+        ml->next->prev_ptr = &ml->next;
+    }
+    ml->prev_ptr = &mem_locations;
+    mem_locations = ml;
+    return ml;
+}
+
+static TCGMemLocation *find_ml(tcg_target_ulong off, tcg_target_ulong sz,
+                               TCGType type)
+{
+    TCGMemLocation *mi;
+    for (mi = mem_locations; mi; mi = mi->next) {
+        if (mi->offset == off && mi->size == sz && mi->type == type) {
+            return mi;
+        }
+    }
+    return NULL;
+}
+
+static bool range_intersect(tcg_target_ulong off1, tcg_target_ulong sz1,
+                            tcg_target_ulong off2, tcg_target_ulong sz2)
+{
+    if (off1 + sz1 <= off2) {
+        return false;
+    }
+    if (off2 + sz2 <= off1) {
+        return false;
+    }
+    return true;
+}
+
+static void remove_ml_range(tcg_target_ulong off, tcg_target_ulong sz)
+{
+    TCGMemLocation *mi, *next;
+    for (mi = mem_locations; mi; mi = next) {
+        next = mi->next;
+        if (range_intersect(mi->offset, mi->size, off, sz)) {
+            reset_ml(mi);
+        }
+    }
+}
+
+static void reset_all_ml(void)
+{
+    TCGMemLocation *ml, *nml;
+    for (ml = mem_locations; ml; ml = nml) {
+        nml = ml->next;
+        if (ml->copy) {
+            ts_info(ml->copy)->mem_loc = NULL;
+        }
+        ml->next = free_mls;
+        ml->copy = NULL;
+        free_mls = ml;
+    }
+    mem_locations = NULL;
+}
+
 static TCGOpcode op_to_mov(TCGOpcode op)
 {
     switch (op_bits(op)) {
@@ -147,6 +297,34 @@ static TCGOpcode op_to_movi(TCGOpcode op)
     }
 }
 
+static TCGOpcode ld_to_mov(TCGOpcode op)
+{
+#define LD_TO_EXT(sz, w)                                 \
+    case glue(glue(INDEX_op_ld, sz), glue(_i, w)):       \
+        return glue(glue(INDEX_op_ext, sz), glue(_i, w))
+
+    switch (op) {
+    LD_TO_EXT(8s, 32);
+    LD_TO_EXT(8u, 32);
+    LD_TO_EXT(16s, 32);
+    LD_TO_EXT(16u, 32);
+    LD_TO_EXT(8s, 64);
+    LD_TO_EXT(8u, 64);
+    LD_TO_EXT(16s, 64);
+    LD_TO_EXT(16u, 64);
+    LD_TO_EXT(32s, 64);
+    LD_TO_EXT(32u, 64);
+    case INDEX_op_ld_i32:
+        return INDEX_op_mov_i32;
+    case INDEX_op_ld_i64:
+        return INDEX_op_mov_i64;
+    default:
+        tcg_abort();
+    }
+
+#undef LD_TO_EXT
+}
+
 static TCGTemp *find_better_copy(TCGContext *s, TCGTemp *ts)
 {
     TCGTemp *i;
@@ -604,6 +782,32 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
     return false;
 }
 
+static int ldst_size(const TCGOp *op)
+{
+    switch (op->opc) {
+    CASE_OP_32_64(st8):
+    CASE_OP_32_64(ld8u):
+    CASE_OP_32_64(ld8s):
+        return 1;
+    CASE_OP_32_64(st16):
+    CASE_OP_32_64(ld16u):
+    CASE_OP_32_64(ld16s):
+        return 2;
+    case INDEX_op_st32_i64:
+    case INDEX_op_ld32u_i64:
+    case INDEX_op_ld32s_i64:
+    case INDEX_op_st_i32:
+    case INDEX_op_ld_i32:
+        return 4;
+    case INDEX_op_st_i64:
+    case INDEX_op_ld_i64:
+        return 8;
+    default:
+        /* Some unsupported opcode? */
+        tcg_abort();
+    }
+}
+
 /* Propagate constants and copies, fold constant expressions. */
 void tcg_optimize(TCGContext *s)
 {
@@ -611,6 +815,10 @@ void tcg_optimize(TCGContext *s)
     TCGOp *prev_mb = NULL;
     struct tcg_temp_info *infos;
     TCGTempSet temps_used;
+    TCGMemLocation *ml;
+
+    mem_locations = NULL;
+    free_mls = NULL;
 
     /* Array VALS has an element for each temp.
        If this temp holds a constant then its value is kept in VALS' element.
@@ -644,6 +852,9 @@ void tcg_optimize(TCGContext *s)
                     init_ts_info(infos, &temps_used, ts);
                 }
             }
+            if (!(op->args[nb_oargs + nb_iargs + 1] & TCG_CALL_NO_SE)) {
+                reset_all_ml();
+            }
         } else {
             nb_oargs = def->nb_oargs;
             nb_iargs = def->nb_iargs;
@@ -660,6 +871,10 @@ void tcg_optimize(TCGContext *s)
             }
         }
 
+        if (def->flags & TCG_OPF_BB_END) {
+            reset_all_ml();
+        }
+
         /* For commutative operations make constant second argument */
         switch (opc) {
         CASE_OP_32_64(add):
@@ -1441,6 +1656,57 @@ void tcg_optimize(TCGContext *s)
             }
             break;
 
+        CASE_OP_32_64(st8):
+        CASE_OP_32_64(st16):
+        CASE_OP_32_64(st):
+        case INDEX_op_st32_i64:
+            if (op->args[1] == tcgv_ptr_arg(cpu_env)) {
+                remove_ml_range(op->args[2], ldst_size(op));
+                new_ml(op->args[2], ldst_size(op), arg_temp(op->args[0]));
+            } else {
+                /* Store to an unknown location.  Any of the exisitng
+                   locations could be affected.  Reset them all.  */
+                reset_all_ml();
+            }
+            goto do_default;
+
+        CASE_OP_32_64(ld8u):
+        CASE_OP_32_64(ld8s):
+        CASE_OP_32_64(ld16u):
+        CASE_OP_32_64(ld16s):
+        CASE_OP_32_64(ld):
+        case INDEX_op_ld32s_i64:
+        case INDEX_op_ld32u_i64:
+            /* Only loads that are relative to ENV can be handled.  */
+            if (op->args[1] == tcgv_ptr_arg(cpu_env)) {
+                ml = find_ml(op->args[2], ldst_size(op),
+                             arg_temp(op->args[0])->base_type);
+                if (ml && ml->copy) {
+                    TCGOpcode re = ld_to_mov(opc);
+                    if (re == INDEX_op_mov_i32 || re == INDEX_op_mov_i64) {
+                        /* No sign/zero extension needed. OP is a move.
+                           Handle this case separately to track copies.  */
+                        TCGTemp *copy = find_better_copy(s, ml->copy);
+                        tcg_opt_gen_mov(s, op, op->args[0], temp_arg(copy));
+                        break;
+                    } else {
+                        if (tcg_op_defs[re].flags & TCG_OPF_NOT_PRESENT) {
+                            /* Required operation is not supported by host.  */
+                            goto do_default;
+                        }
+                        op->opc = re;
+                        op->args[1] = temp_arg(find_better_copy(s, ml->copy));
+                    }
+                } else {
+                    assert(!ml);
+                    reset_temp(op->args[0]);
+                    arg_info(op->args[0])->mask = mask;
+                    new_ml(op->args[2], ldst_size(op), arg_temp(op->args[0]));
+                    break;
+                }
+            }
+            goto do_reset_output;
+
         case INDEX_op_call:
             if (!(op->args[nb_oargs + nb_iargs + 1]
                   & (TCG_CALL_NO_READ_GLOBALS | TCG_CALL_NO_WRITE_GLOBALS))) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH RFC 3/3] tcg/optimize: handle vector loads and stores during copy propagation
  2017-11-09 14:41 [Qemu-devel] [PATCH RFC 0/3] TCG: do copy propagation through memory locations Kirill Batuzov
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 1/3] tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator Kirill Batuzov
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 2/3] tcg/optimize: do copy propagation for memory locations Kirill Batuzov
@ 2017-11-09 14:41 ` Kirill Batuzov
  2017-11-22  8:06   ` Richard Henderson
  2 siblings, 1 reply; 9+ messages in thread
From: Kirill Batuzov @ 2017-11-09 14:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kirill Batuzov, Alex Bennée, Richard Henderson

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/optimize.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tcg/optimize.c b/tcg/optimize.c
index da7f069444..1b6962c6c5 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -318,6 +318,8 @@ static TCGOpcode ld_to_mov(TCGOpcode op)
         return INDEX_op_mov_i32;
     case INDEX_op_ld_i64:
         return INDEX_op_mov_i64;
+    case INDEX_op_ld_vec:
+        return INDEX_op_mov_vec;
     default:
         tcg_abort();
     }
@@ -782,6 +784,13 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
     return false;
 }
 
+static int tcg_vec_size(const TCGOp *op)
+{
+    TCGArg arg = op->args[0];
+    TCGTemp *tmp = arg_temp(arg);
+    return 1 << (3 + tmp->base_type - TCG_TYPE_V64);
+}
+
 static int ldst_size(const TCGOp *op)
 {
     switch (op->opc) {
@@ -802,6 +811,9 @@ static int ldst_size(const TCGOp *op)
     case INDEX_op_st_i64:
     case INDEX_op_ld_i64:
         return 8;
+    case INDEX_op_ld_vec:
+    case INDEX_op_st_vec:
+        return tcg_vec_size(op);
     default:
         /* Some unsupported opcode? */
         tcg_abort();
@@ -1660,6 +1672,7 @@ void tcg_optimize(TCGContext *s)
         CASE_OP_32_64(st16):
         CASE_OP_32_64(st):
         case INDEX_op_st32_i64:
+        case INDEX_op_st_vec:
             if (op->args[1] == tcgv_ptr_arg(cpu_env)) {
                 remove_ml_range(op->args[2], ldst_size(op));
                 new_ml(op->args[2], ldst_size(op), arg_temp(op->args[0]));
@@ -1677,6 +1690,7 @@ void tcg_optimize(TCGContext *s)
         CASE_OP_32_64(ld):
         case INDEX_op_ld32s_i64:
         case INDEX_op_ld32u_i64:
+        case INDEX_op_ld_vec:
             /* Only loads that are relative to ENV can be handled.  */
             if (op->args[1] == tcgv_ptr_arg(cpu_env)) {
                 ml = find_ml(op->args[2], ldst_size(op),
@@ -1689,6 +1703,14 @@ void tcg_optimize(TCGContext *s)
                         TCGTemp *copy = find_better_copy(s, ml->copy);
                         tcg_opt_gen_mov(s, op, op->args[0], temp_arg(copy));
                         break;
+                    } else if (re == INDEX_op_mov_vec) {
+                        if (ts_are_copies(arg_temp(op->args[0]), ml->copy)) {
+                            tcg_op_remove(s, op);
+                            break;
+                        }
+                        op->opc = re;
+                        op->args[1] = temp_arg(find_better_copy(s, ml->copy));
+                        op->args[2] = op->args[3];
                     } else {
                         if (tcg_op_defs[re].flags & TCG_OPF_NOT_PRESENT) {
                             /* Required operation is not supported by host.  */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 1/3] tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 1/3] tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator Kirill Batuzov
@ 2017-11-22  8:01   ` Richard Henderson
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Henderson @ 2017-11-22  8:01 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel; +Cc: Alex Bennée

On 11/09/2017 03:41 PM, Kirill Batuzov wrote:
> Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
> ---
>  tcg/tcg.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index a7854a59a1..6db7dd526a 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -3327,10 +3327,12 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
>          switch (opc) {
>          case INDEX_op_mov_i32:
>          case INDEX_op_mov_i64:
> +        case INDEX_op_mov_vec:
>              tcg_reg_alloc_mov(s, op);
>              break;
>          case INDEX_op_movi_i32:
>          case INDEX_op_movi_i64:
> +        case INDEX_op_movi_vec:
>              tcg_reg_alloc_movi(s, op);
>              break;
>          case INDEX_op_insn_start:
> 

Thanks.  I'd actually missed out on this myself until quite late in v6, which
caused a number of interesting problems.


r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 3/3] tcg/optimize: handle vector loads and stores during copy propagation
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 3/3] tcg/optimize: handle vector loads and stores during copy propagation Kirill Batuzov
@ 2017-11-22  8:06   ` Richard Henderson
  2017-11-22  8:19     ` Richard Henderson
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Henderson @ 2017-11-22  8:06 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel; +Cc: Alex Bennée

On 11/09/2017 03:41 PM, Kirill Batuzov wrote:
> +                    } else if (re == INDEX_op_mov_vec) {
> +                        if (ts_are_copies(arg_temp(op->args[0]), ml->copy)) {
> +                            tcg_op_remove(s, op);
> +                            break;
> +                        }
> +                        op->opc = re;
> +                        op->args[1] = temp_arg(find_better_copy(s, ml->copy));
> +                        op->args[2] = op->args[3];
>                      } else {

Why don't you send this through tcg_opt_gen_mov as with mov_i32 and mov_i64?


r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 3/3] tcg/optimize: handle vector loads and stores during copy propagation
  2017-11-22  8:06   ` Richard Henderson
@ 2017-11-22  8:19     ` Richard Henderson
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Henderson @ 2017-11-22  8:19 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel; +Cc: Alex Bennée

On 11/22/2017 09:06 AM, Richard Henderson wrote:
> On 11/09/2017 03:41 PM, Kirill Batuzov wrote:
>> +                    } else if (re == INDEX_op_mov_vec) {
>> +                        if (ts_are_copies(arg_temp(op->args[0]), ml->copy)) {
>> +                            tcg_op_remove(s, op);
>> +                            break;
>> +                        }
>> +                        op->opc = re;
>> +                        op->args[1] = temp_arg(find_better_copy(s, ml->copy));
>> +                        op->args[2] = op->args[3];
>>                      } else {
> 
> Why don't you send this through tcg_opt_gen_mov as with mov_i32 and mov_i64?

Oh nevermind, you're handling the extra size operand that was in v3.  Well,
good news is that you don't need that anymore.  ;-)


r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 2/3] tcg/optimize: do copy propagation for memory locations
  2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 2/3] tcg/optimize: do copy propagation for memory locations Kirill Batuzov
@ 2017-11-22  8:44   ` Richard Henderson
  2017-11-23 11:16     ` Kirill Batuzov
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Henderson @ 2017-11-22  8:44 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel; +Cc: Alex Bennée

On 11/09/2017 03:41 PM, Kirill Batuzov wrote:
> +typedef struct TCGMemLocation {
> +    /* Offset is relative to ENV. Only fields of CPUState are accounted.  */
> +    tcg_target_ulong offset;
> +    tcg_target_ulong size;
> +    TCGType type;
> +    /* Pointer to a temp containing a valid copy of this memory location.  */
> +    TCGTemp *copy;
> +    /* Pointer to the next memory location containing copy of the same
> +       content.  */
> +    struct TCGMemLocation *next_copy;

Did you ever find copies of memories that weren't also copies within temps?
I.e. you could have found this through copy->next_copy?

> +    /* Double-linked list of all memory locations.  */
> +    struct TCGMemLocation *next;
> +    struct TCGMemLocation **prev_ptr;

Use QTAILQ_* for common double-linked-list manipulation.

> +struct TCGMemLocation *mem_locations;
> +struct TCGMemLocation *free_mls;

These can't be globals anymore -- we're do multi-threaded code generation now.

> @@ -77,12 +125,27 @@ static void reset_ts(TCGTemp *ts)
>      struct tcg_temp_info *pi = ts_info(ti->prev_copy);
>      struct tcg_temp_info *ni = ts_info(ti->next_copy);
>  
> +    if (ti->mem_loc && ts_is_copy(ts) && 0) {
> +        TCGMemLocation *ml, *nml;
> +        for (ml = ti->mem_loc; ml; ml = nml) {
> +            nml = ml->next_copy;
> +            ml->copy = ti->next_copy;
> +            ml->next_copy = ni->mem_loc;
> +            ni->mem_loc = ml;
> +        }
> +    } else {
> +        while (ti->mem_loc) {
> +            reset_ml(ti->mem_loc);
> +        }

Why would a single temp be associated with more than one memory?

> +static TCGOpcode ld_to_mov(TCGOpcode op)
> +{
> +#define LD_TO_EXT(sz, w)                                 \
> +    case glue(glue(INDEX_op_ld, sz), glue(_i, w)):       \
> +        return glue(glue(INDEX_op_ext, sz), glue(_i, w))
> +
> +    switch (op) {
> +    LD_TO_EXT(8s, 32);
> +    LD_TO_EXT(8u, 32);
> +    LD_TO_EXT(16s, 32);
> +    LD_TO_EXT(16u, 32);
> +    LD_TO_EXT(8s, 64);
> +    LD_TO_EXT(8u, 64);
> +    LD_TO_EXT(16s, 64);
> +    LD_TO_EXT(16u, 64);
> +    LD_TO_EXT(32s, 64);
> +    LD_TO_EXT(32u, 64);

How many extensions did you find?  Or is this Just In Case?

Otherwise this looks quite reasonable.


r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 2/3] tcg/optimize: do copy propagation for memory locations
  2017-11-22  8:44   ` Richard Henderson
@ 2017-11-23 11:16     ` Kirill Batuzov
  0 siblings, 0 replies; 9+ messages in thread
From: Kirill Batuzov @ 2017-11-23 11:16 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Alex Bennée

On Wed, 22 Nov 2017, Richard Henderson wrote:

> On 11/09/2017 03:41 PM, Kirill Batuzov wrote:
> > +typedef struct TCGMemLocation {
> > +    /* Offset is relative to ENV. Only fields of CPUState are accounted.  */
> > +    tcg_target_ulong offset;
> > +    tcg_target_ulong size;
> > +    TCGType type;
> > +    /* Pointer to a temp containing a valid copy of this memory location.  */
> > +    TCGTemp *copy;
> > +    /* Pointer to the next memory location containing copy of the same
> > +       content.  */
> > +    struct TCGMemLocation *next_copy;
> 
> Did you ever find copies of memories that weren't also copies within temps?
> I.e. you could have found this through copy->next_copy?

Yes. This happens when a temp was stored to multiple memory locations.

> 
> > +    /* Double-linked list of all memory locations.  */
> > +    struct TCGMemLocation *next;
> > +    struct TCGMemLocation **prev_ptr;
> 
> Use QTAILQ_* for common double-linked-list manipulation.
> 
> > +struct TCGMemLocation *mem_locations;
> > +struct TCGMemLocation *free_mls;
> 
> These can't be globals anymore -- we're do multi-threaded code generation now.

Then they should be moved to TCGContext I assume?

> 
> > @@ -77,12 +125,27 @@ static void reset_ts(TCGTemp *ts)
> >      struct tcg_temp_info *pi = ts_info(ti->prev_copy);
> >      struct tcg_temp_info *ni = ts_info(ti->next_copy);
> >  
> > +    if (ti->mem_loc && ts_is_copy(ts) && 0) {
> > +        TCGMemLocation *ml, *nml;
> > +        for (ml = ti->mem_loc; ml; ml = nml) {
> > +            nml = ml->next_copy;
> > +            ml->copy = ti->next_copy;
> > +            ml->next_copy = ni->mem_loc;
> > +            ni->mem_loc = ml;
> > +        }
> > +    } else {
> > +        while (ti->mem_loc) {
> > +            reset_ml(ti->mem_loc);
> > +        }
> 
> Why would a single temp be associated with more than one memory?
>

Because it was stored to multiple memory locations. And when reading from
any of these locations we want to access the temp instead.

For example this happens when we translate ARM32 VDUP instruction. One
value is duplicated into all elements of the vector. When elements of the
vector are accessed later, we want to use this value instead of rereading
it from memory.

> > +static TCGOpcode ld_to_mov(TCGOpcode op)
> > +{
> > +#define LD_TO_EXT(sz, w)                                 \
> > +    case glue(glue(INDEX_op_ld, sz), glue(_i, w)):       \
> > +        return glue(glue(INDEX_op_ext, sz), glue(_i, w))
> > +
> > +    switch (op) {
> > +    LD_TO_EXT(8s, 32);
> > +    LD_TO_EXT(8u, 32);
> > +    LD_TO_EXT(16s, 32);
> > +    LD_TO_EXT(16u, 32);
> > +    LD_TO_EXT(8s, 64);
> > +    LD_TO_EXT(8u, 64);
> > +    LD_TO_EXT(16s, 64);
> > +    LD_TO_EXT(16u, 64);
> > +    LD_TO_EXT(32s, 64);
> > +    LD_TO_EXT(32u, 64);
> 
> How many extensions did you find?  Or is this Just In Case?
> 

One. It was in Aartch64 build of x264. So this is more Just in Case. But
it may become useful if we try to emulate some 8- or 16-bit architectures.

-- 
Kirill

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-11-23 11:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-09 14:41 [Qemu-devel] [PATCH RFC 0/3] TCG: do copy propagation through memory locations Kirill Batuzov
2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 1/3] tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator Kirill Batuzov
2017-11-22  8:01   ` Richard Henderson
2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 2/3] tcg/optimize: do copy propagation for memory locations Kirill Batuzov
2017-11-22  8:44   ` Richard Henderson
2017-11-23 11:16     ` Kirill Batuzov
2017-11-09 14:41 ` [Qemu-devel] [PATCH RFC 3/3] tcg/optimize: handle vector loads and stores during copy propagation Kirill Batuzov
2017-11-22  8:06   ` Richard Henderson
2017-11-22  8:19     ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.