All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans
@ 2016-08-15 10:46 Alex Bennée
  2016-08-15 11:00 ` Peter Maydell
  2016-08-15 15:46 ` Emilio G. Cota
  0 siblings, 2 replies; 21+ messages in thread
From: Alex Bennée @ 2016-08-15 10:46 UTC (permalink / raw)
  To: mttcg, qemu-devel, fred.konrad, a.rigo, cota, bobby.prani, nikunj
  Cc: mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth,
	peter.maydell, claudio.fontana, Dr. David Alan Gilbert,
	Peter Crosthwaite

Hi,

Numbers!
========

First things first, I ran some more benchmarks on the base patches +
cmpxchg branch over the weekend when I had access to some bigger boxen
which weren't being used. I also added some KVM runs for comparison:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 -smp  on overdrive01 [1]  x -smp 1  on desktop [2]  x -smp 1  on hackbox [3]  x -smp 1
────────────────────────────────────────────────────────────────────────────────────────
    1              36.995     1.000         243.723     1.000         377.035     1.000
    2              21.480     1.722         134.854     1.807         216.337     1.743
    3              16.474     2.246         100.090     2.435         163.316     2.309
    4              13.671     2.706          83.512     2.918         136.180     2.769
    5              12.269     3.015          82.519     2.954         119.261     3.161
    6              11.268     3.283          79.589     3.062         110.393     3.415
    7                 n/a       n/a          78.338     3.111         105.244     3.582
    8                 n/a       n/a          81.091     3.006         103.032     3.659
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Footnotes
─────────

[1] pre-production A57, only 6 cores, KVM with -cpu host,aarch64=off

[2] i7-4770 @ 3.4 Ghz, past -smp 5 there is much greater deviation plus
some hangs, best times taken

[3] Xeon X5690 @ 3.47Ghz, 24 cores, -smp 7 number manually calculated

So comparing the numbers on the Xeon monster to my desktop seem to show
we still get a beneficial scaling when the extra cores are real cores
instead of fake hyperthread cores. I only ran up to -smp 8 as that is as
much as the -m virt model will actually accept.

I have noticed some instability in the test though for high -smp values
which caused the test runners timeout protection to kick in. These look
like guest hangs and maybe barrier related (store-after-load re-ordering
can happen). I plan to apply the barrier patches and see if this
improves the stability of the tests.

All in all however the results are pretty promising I'm now running -smp
4 -accel tcg,thread=multi on a fairly regular basis and appreciating the
more snappy response on heavy operations.

MTTCG Call
==========

We've missed a number of the MTTCG calls of late and given the spread of
developers actively working on MTTCG stuff I wonder if we should just
shelve the call and move to regular status updates on the list? I'm
happy to prompt a status thread every couple of weeks if wanted.

As far as I'm aware the following work is still ongoing:

Emilo: cmpxchg atomics
Alvise: LL/SC modelling
Pranith: Memory barrier work (GSoC coming to an end this month)
Nikunj: PPC support for MTTCG

Anyone want to add their status updates? Is anyone else secretly working
on MTTCG related bits who want to make themselves known?

KVM Forum
=========

I'll be at KVM Forum on Toronto next week. Feel free to grab me at
anytime but I'm planning to sign up for a BoF slot on Thursday afternoon
to discuss any outstanding issues for MTTCG and discuss any outstanding
work that needs to be done to be ready for merging when the 2.8
development cycle opens.

From my point of view I think we are looking pretty good for merging but
I would like to get input from the TCG maintainers who are the ones that
will need to accept the work into their tree.

The only current issue I'm aware of is thread safety of the GDB stub.
In theory it is not currently MTTCG safe but it tends to get away with
it because the system is halted when updates are made to the
break/watchpoint lists. I did post a series to RCUify these few months
ago but I dropped it (and the debug asserts) from the base patches
series as it felt a little orthogonal to the main work. My feeling is
this shouldn't be a blocker to MTTCG going in (as it doesn't get any
worse) but we can fix it up in a later series. However I would like to
get the opinions of the maintainers to this approach.

Are there any other issues we should be aware of?

Looking forward to meeting up with other QEMU hackers in the flesh next
week!

Cheers,


--
Alex Bennée

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans
  2016-08-15 10:46 [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
@ 2016-08-15 11:00 ` Peter Maydell
  2016-08-15 11:16   ` Alex Bennée
  2016-08-15 15:46 ` Emilio G. Cota
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Maydell @ 2016-08-15 11:00 UTC (permalink / raw)
  To: Alex Bennée
  Cc: MTTCG Devel, QEMU Developers, KONRAD Frédéric,
	Alvise Rigo, Emilio G. Cota, pranith kumar, Nikunj A Dadhania,
	Mark Burton, Paolo Bonzini, J. Kiszka, Sergey Fedorov,
	Richard Henderson, Claudio Fontana, Dr. David Alan Gilbert,
	Peter Crosthwaite

On 15 August 2016 at 11:46, Alex Bennée <alex.bennee@linaro.org> wrote:
> I only ran up to -smp 8 as that is as
> much as the -m virt model will actually accept.

FWIW, -machine gic-version=3 should allow you more than 8 cores.

> I have noticed some instability in the test though for high -smp values
> which caused the test runners timeout protection to kick in. These look
> like guest hangs and maybe barrier related (store-after-load re-ordering
> can happen). I plan to apply the barrier patches and see if this
> improves the stability of the tests.

> From my point of view I think we are looking pretty good for merging but
> I would like to get input from the TCG maintainers who are the ones that
> will need to accept the work into their tree.

Your note above about instability and hangs is the main thing that
makes me nervous about merging...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans
  2016-08-15 11:00 ` Peter Maydell
@ 2016-08-15 11:16   ` Alex Bennée
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Bennée @ 2016-08-15 11:16 UTC (permalink / raw)
  To: Peter Maydell
  Cc: MTTCG Devel, QEMU Developers, KONRAD Frédéric,
	Alvise Rigo, Emilio G. Cota, pranith kumar, Nikunj A Dadhania,
	Mark Burton, Paolo Bonzini, J. Kiszka, Sergey Fedorov,
	Richard Henderson, Claudio Fontana, Dr. David Alan Gilbert,
	Peter Crosthwaite


Peter Maydell <peter.maydell@linaro.org> writes:

> On 15 August 2016 at 11:46, Alex Bennée <alex.bennee@linaro.org> wrote:
>> I only ran up to -smp 8 as that is as
>> much as the -m virt model will actually accept.
>
> FWIW, -machine gic-version=3 should allow you more than 8 cores.

Good to know. Thanks.

>> I have noticed some instability in the test though for high -smp values
>> which caused the test runners timeout protection to kick in. These look
>> like guest hangs and maybe barrier related (store-after-load re-ordering
>> can happen). I plan to apply the barrier patches and see if this
>> improves the stability of the tests.
>
>> From my point of view I think we are looking pretty good for merging but
>> I would like to get input from the TCG maintainers who are the ones that
>> will need to accept the work into their tree.
>
> Your note above about instability and hangs is the main thing that
> makes me nervous about merging...

Don't worry I won't be proposing any merge while I can still provoke
hangs in the guest!

My point is you actually have to work quite hard to trigger these and
they are subtle emulation failures that trip up the guest rather than
crashes that take down QEMU itself.

I wanted to post the numbers I'd collected so far because I feel we are
shaping up quite well. Just one more hill... ;-)

--
Alex Bennée

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans
  2016-08-15 10:46 [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
  2016-08-15 11:00 ` Peter Maydell
@ 2016-08-15 15:46 ` Emilio G. Cota
  2016-08-15 15:49   ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota
  2016-08-16 11:16   ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
  1 sibling, 2 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-15 15:46 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj,
	mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth,
	peter.maydell, claudio.fontana, Dr. David Alan Gilbert,
	Peter Crosthwaite

On Mon, Aug 15, 2016 at 11:46:32 +0100, Alex Bennée wrote:
> As far as I'm aware the following work is still ongoing:
> 
> Emilo: cmpxchg atomics
> Alvise: LL/SC modelling

I've been tinkering with an experimental patch to do proper LL/SC. The idea
is to rely on hardware transactional memory, so that stores don't have
to be tracked. The trickiest thing is the fallback path, for which I'm
trying to (ab)use EXCP_ATOMIC to execute exclusively from the ldrex
all the way to the strex.

To test it, I'm using aarch64-linux-user running qht-bench compiled on
an aarch64 machine. I'm running on an Intel Skylake host (Skylake has
no known TSX bugs)

However, I'm finding issues that might not have to do with the
patch itself.

- On the latest MTTCG+cmpxchg tree (45c11751ed7 a.k.a.
  bennee/mttcg/base-patches-v4-with-cmpxchg-atomics-v2), QEMU loops
  forever without making progress in the instruction stream, even
  with taskset -c 0.
- On the cmpxchg tree (rth's atomic-2 branch [1]), it works more
  reliably, although tb_lock is held around tb_find_fast so parallelism isn't
  very high. Still, it sometimes triggers the assert below.
  - Applying the "remove tb_lock around hot path" patch makes it
    easier to trigger this assert in cpu-exec.c:650 (approx.):
            /* Assert that the compiler does not smash local variables. */
            g_assert(cpu == current_cpu)
    I've also seen triggered the assert immediately after that one, as well
    as the rcu_read_unlock depth assert.
  The asserts are usually triggered when all threads exit (by returning
  NULL) at roughly the same time.
  However, they cannot be triggered with taskset -c 0, which makes me
  suspect that somehow start_exclusive isn't working as intended.

Any tips would be appreciated! I'll reply with a patch that uses RTM,
the one below is fallback path all the way, and the best to reproduce
the above.

Thanks,

		Emilio

[1] https://github.com/rth7680/qemu/commits/atomic-2

>From ed6af6eb364e5a36e81d7cc8143c0e9783c50587 Mon Sep 17 00:00:00 2001
From: "Emilio G. Cota" <cota@braap.org>
Date: Mon, 15 Aug 2016 00:27:42 +0200
Subject: [PATCH] aarch64: use TSX for ldrex/strex (fallback path only)

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/main.c          |  5 +++--
 target-arm/helper-a64.c    | 23 +++++++++++++++++++++++
 target-arm/helper-a64.h    |  4 ++++
 target-arm/translate-a64.c | 15 +++++++++------
 4 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/linux-user/main.c b/linux-user/main.c
index 9880505..6922faa 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -192,8 +192,9 @@ static void step_atomic(CPUState *cpu)
 
     /* Since we got here, we know that parallel_cpus must be true.  */
     parallel_cpus = false;
-    cpu_exec_step(cpu);
-    parallel_cpus = true;
+    while (!parallel_cpus) {
+        cpu_exec_step(cpu);
+    }
 
     end_exclusive();
 }
diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index 8ce518b..a97b631 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -579,3 +579,26 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
 
     return !success;
 }
+
+void HELPER(xbegin)(CPUARMState *env)
+{
+    uintptr_t ra = GETPC();
+
+    if (parallel_cpus) {
+        cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
+    }
+}
+
+void HELPER(xend)(void)
+{
+    assert(!parallel_cpus);
+    parallel_cpus = true;
+}
+
+uint64_t HELPER(x_ok)(void)
+{
+    if (!parallel_cpus) {
+        return 1;
+    }
+    return 0;
+}
diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
index dd32000..e7ede43 100644
--- a/target-arm/helper-a64.h
+++ b/target-arm/helper-a64.h
@@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
+
+DEF_HELPER_1(xbegin, void, env)
+DEF_HELPER_0(x_ok, i64)
+DEF_HELPER_0(xend, void)
diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 450c359..cfcf440 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     TCGv_i64 tmp = tcg_temp_new_i64();
     TCGMemOp be = s->be_data;
 
+    gen_helper_xbegin(cpu_env);
+
     g_assert(size <= 3);
     if (is_pair) {
         TCGv_i64 hitmp = tcg_temp_new_i64();
@@ -1825,6 +1827,9 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
     tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label);
 
     tmp = tcg_temp_new_i64();
+    gen_helper_x_ok(tmp);
+    tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label);
+
     if (is_pair) {
         if (size == 2) {
             TCGv_i64 val = tcg_temp_new_i64();
@@ -1844,16 +1849,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
         }
     } else {
         TCGv_i64 val = cpu_reg(s, rt);
-        tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val,
-                                   get_mem_index(s),
-                                   size | MO_ALIGN | s->be_data);
-        tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val);
+        tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size);
     }
 
     tcg_temp_free_i64(addr);
-
-    tcg_gen_mov_i64(cpu_reg(s, rd), tmp);
     tcg_temp_free_i64(tmp);
+
+    tcg_gen_movi_i64(cpu_reg(s, rd), 0);
+    gen_helper_xend();
     tcg_gen_br(done_label);
 
     gen_set_label(fail_label);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
  2016-08-15 15:46 ` Emilio G. Cota
@ 2016-08-15 15:49   ` Emilio G. Cota
  2016-08-17 17:22     ` Richard Henderson
  2016-08-16 11:16   ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
  1 sibling, 1 reply; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-15 15:49 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj,
	mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth,
	peter.maydell, claudio.fontana, Dr. David Alan Gilbert,
	Peter Crosthwaite

Configure with --extra-cflags="-mrtm"

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/main.c          |  5 +++--
 target-arm/helper-a64.c    | 42 ++++++++++++++++++++++++++++++++++++++++++
 target-arm/helper-a64.h    |  4 ++++
 target-arm/translate-a64.c | 15 +++++++++------
 4 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/linux-user/main.c b/linux-user/main.c
index 9880505..6922faa 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -192,8 +192,9 @@ static void step_atomic(CPUState *cpu)
 
     /* Since we got here, we know that parallel_cpus must be true.  */
     parallel_cpus = false;
-    cpu_exec_step(cpu);
-    parallel_cpus = true;
+    while (!parallel_cpus) {
+        cpu_exec_step(cpu);
+    }
 
     end_exclusive();
 }
diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index 8ce518b..af45694 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -33,6 +33,8 @@
 #include "tcg.h"
 #include <zlib.h> /* For crc32 */
 
+#include <immintrin.h>
+
 /* C2.4.7 Multiply and divide */
 /* special cases for 0 and LLONG_MIN are mandated by the standard */
 uint64_t HELPER(udiv64)(uint64_t num, uint64_t den)
@@ -579,3 +581,43 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
 
     return !success;
 }
+
+void HELPER(xbegin)(CPUARMState *env)
+{
+    uintptr_t ra = GETPC();
+    int status;
+    int retries = 100;
+
+ retry:
+    status = _xbegin();
+    if (status != _XBEGIN_STARTED) {
+        if (status && retries) {
+            retries--;
+            goto retry;
+        }
+        if (parallel_cpus) {
+            cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
+        }
+    }
+}
+
+void HELPER(xend)(void)
+{
+    if (_xtest()) {
+        _xend();
+    } else {
+        assert(!parallel_cpus);
+        parallel_cpus = true;
+    }
+}
+
+uint64_t HELPER(x_ok)(void)
+{
+    if (_xtest()) {
+        return 1;
+    }
+    if (!parallel_cpus) {
+        return 1;
+    }
+    return 0;
+}
diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
index dd32000..e7ede43 100644
--- a/target-arm/helper-a64.h
+++ b/target-arm/helper-a64.h
@@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
+
+DEF_HELPER_1(xbegin, void, env)
+DEF_HELPER_0(x_ok, i64)
+DEF_HELPER_0(xend, void)
diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 450c359..cfcf440 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     TCGv_i64 tmp = tcg_temp_new_i64();
     TCGMemOp be = s->be_data;
 
+    gen_helper_xbegin(cpu_env);
+
     g_assert(size <= 3);
     if (is_pair) {
         TCGv_i64 hitmp = tcg_temp_new_i64();
@@ -1825,6 +1827,9 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
     tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label);
 
     tmp = tcg_temp_new_i64();
+    gen_helper_x_ok(tmp);
+    tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label);
+
     if (is_pair) {
         if (size == 2) {
             TCGv_i64 val = tcg_temp_new_i64();
@@ -1844,16 +1849,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
         }
     } else {
         TCGv_i64 val = cpu_reg(s, rt);
-        tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val,
-                                   get_mem_index(s),
-                                   size | MO_ALIGN | s->be_data);
-        tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val);
+        tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size);
     }
 
     tcg_temp_free_i64(addr);
-
-    tcg_gen_mov_i64(cpu_reg(s, rd), tmp);
     tcg_temp_free_i64(tmp);
+
+    tcg_gen_movi_i64(cpu_reg(s, rd), 0);
+    gen_helper_xend();
     tcg_gen_br(done_label);
 
     gen_set_label(fail_label);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans
  2016-08-15 15:46 ` Emilio G. Cota
  2016-08-15 15:49   ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota
@ 2016-08-16 11:16   ` Alex Bennée
  2016-08-16 21:51     ` Emilio G. Cota
  1 sibling, 1 reply; 21+ messages in thread
From: Alex Bennée @ 2016-08-16 11:16 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj,
	mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth,
	peter.maydell, claudio.fontana, Dr. David Alan Gilbert,
	Peter Crosthwaite


Emilio G. Cota <cota@braap.org> writes:

> On Mon, Aug 15, 2016 at 11:46:32 +0100, Alex Bennée wrote:
>> As far as I'm aware the following work is still ongoing:
>>
>> Emilo: cmpxchg atomics
>> Alvise: LL/SC modelling
>
> I've been tinkering with an experimental patch to do proper LL/SC. The idea
> is to rely on hardware transactional memory, so that stores don't have
> to be tracked. The trickiest thing is the fallback path, for which I'm
> trying to (ab)use EXCP_ATOMIC to execute exclusively from the ldrex
> all the way to the strex.
>
> To test it, I'm using aarch64-linux-user running qht-bench compiled on
> an aarch64 machine. I'm running on an Intel Skylake host (Skylake has
> no known TSX bugs)
>
> However, I'm finding issues that might not have to do with the
> patch itself.
>
> - On the latest MTTCG+cmpxchg tree (45c11751ed7 a.k.a.
>   bennee/mttcg/base-patches-v4-with-cmpxchg-atomics-v2), QEMU loops
>   forever without making progress in the instruction stream, even
>   with taskset -c 0.

Could this be a store-after-load barrier issue? I have a branch that
adds Pranith's work:

  https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4-with-cmpxchg-atomics-v2-and-barriers-v4

This seems to have eliminated some of the failure modes (usually kernel
complaining about stalled tasks) but I'm still seeing my test case fail
from time to time starting the benchmark task. Currently I'm not seeing
much information about why its failing to start though.

> - On the cmpxchg tree (rth's atomic-2 branch [1]), it works more
>   reliably, although tb_lock is held around tb_find_fast so parallelism isn't
>   very high. Still, it sometimes triggers the assert below.
>   - Applying the "remove tb_lock around hot path" patch makes it
>     easier to trigger this assert in cpu-exec.c:650 (approx.):
>             /* Assert that the compiler does not smash local variables. */
>             g_assert(cpu == current_cpu)
>     I've also seen triggered the assert immediately after that one, as well
>     as the rcu_read_unlock depth assert.

Odd - these are remnants of a dodgy compiler.

>   The asserts are usually triggered when all threads exit (by returning
>   NULL) at roughly the same time.
>   However, they cannot be triggered with taskset -c 0, which makes me
>   suspect that somehow start_exclusive isn't working as intended.
>
> Any tips would be appreciated! I'll reply with a patch that uses RTM,
> the one below is fallback path all the way, and the best to reproduce
> the above.

I'll see if I can reproduce the errors your seeing on my setup.

>
> Thanks,
>
> 		Emilio
>
> [1] https://github.com/rth7680/qemu/commits/atomic-2
>
> From ed6af6eb364e5a36e81d7cc8143c0e9783c50587 Mon Sep 17 00:00:00 2001
> From: "Emilio G. Cota" <cota@braap.org>
> Date: Mon, 15 Aug 2016 00:27:42 +0200
> Subject: [PATCH] aarch64: use TSX for ldrex/strex (fallback path only)
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  linux-user/main.c          |  5 +++--
>  target-arm/helper-a64.c    | 23 +++++++++++++++++++++++
>  target-arm/helper-a64.h    |  4 ++++
>  target-arm/translate-a64.c | 15 +++++++++------
>  4 files changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/linux-user/main.c b/linux-user/main.c
> index 9880505..6922faa 100644
> --- a/linux-user/main.c
> +++ b/linux-user/main.c
> @@ -192,8 +192,9 @@ static void step_atomic(CPUState *cpu)
>
>      /* Since we got here, we know that parallel_cpus must be true.  */
>      parallel_cpus = false;
> -    cpu_exec_step(cpu);
> -    parallel_cpus = true;
> +    while (!parallel_cpus) {
> +        cpu_exec_step(cpu);
> +    }
>
>      end_exclusive();
>  }
> diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
> index 8ce518b..a97b631 100644
> --- a/target-arm/helper-a64.c
> +++ b/target-arm/helper-a64.c
> @@ -579,3 +579,26 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
>
>      return !success;
>  }
> +
> +void HELPER(xbegin)(CPUARMState *env)
> +{
> +    uintptr_t ra = GETPC();
> +
> +    if (parallel_cpus) {
> +        cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
> +    }
> +}
> +
> +void HELPER(xend)(void)
> +{
> +    assert(!parallel_cpus);
> +    parallel_cpus = true;
> +}
> +
> +uint64_t HELPER(x_ok)(void)
> +{
> +    if (!parallel_cpus) {
> +        return 1;
> +    }
> +    return 0;
> +}
> diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
> index dd32000..e7ede43 100644
> --- a/target-arm/helper-a64.h
> +++ b/target-arm/helper-a64.h
> @@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
>  DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
>  DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
>  DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
> +
> +DEF_HELPER_1(xbegin, void, env)
> +DEF_HELPER_0(x_ok, i64)
> +DEF_HELPER_0(xend, void)
> diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
> index 450c359..cfcf440 100644
> --- a/target-arm/translate-a64.c
> +++ b/target-arm/translate-a64.c
> @@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>      TCGv_i64 tmp = tcg_temp_new_i64();
>      TCGMemOp be = s->be_data;
>
> +    gen_helper_xbegin(cpu_env);
> +
>      g_assert(size <= 3);
>      if (is_pair) {
>          TCGv_i64 hitmp = tcg_temp_new_i64();
> @@ -1825,6 +1827,9 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>      tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label);
>
>      tmp = tcg_temp_new_i64();
> +    gen_helper_x_ok(tmp);
> +    tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label);
> +
>      if (is_pair) {
>          if (size == 2) {
>              TCGv_i64 val = tcg_temp_new_i64();
> @@ -1844,16 +1849,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>          }
>      } else {
>          TCGv_i64 val = cpu_reg(s, rt);
> -        tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val,
> -                                   get_mem_index(s),
> -                                   size | MO_ALIGN | s->be_data);
> -        tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val);
> +        tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size);
>      }
>
>      tcg_temp_free_i64(addr);
> -
> -    tcg_gen_mov_i64(cpu_reg(s, rd), tmp);
>      tcg_temp_free_i64(tmp);
> +
> +    tcg_gen_movi_i64(cpu_reg(s, rd), 0);
> +    gen_helper_xend();
>      tcg_gen_br(done_label);
>
>      gen_set_label(fail_label);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans
  2016-08-16 11:16   ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
@ 2016-08-16 21:51     ` Emilio G. Cota
  0 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-16 21:51 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj,
	mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth,
	peter.maydell, claudio.fontana, Dr. David Alan Gilbert,
	Peter Crosthwaite

On Tue, Aug 16, 2016 at 12:16:26 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> > However, I'm finding issues that might not have to do with the
> > patch itself.

I had some time today to dig deeper -- turns out the issues *have*
to do with my patch, see below. (And sorry for hijacking this thread.)

> >   - Applying the "remove tb_lock around hot path" patch makes it
> >     easier to trigger this assert in cpu-exec.c:650 (approx.):
> >             /* Assert that the compiler does not smash local variables. */
> >             g_assert(cpu == current_cpu)
> >     I've also seen triggered the assert immediately after that one, as well
> >     as the rcu_read_unlock depth assert.
> 
> Odd - these are remnants of a dodgy compiler.

The problem is that by calling cpu_exec_step() in a loop, we don't
know what instructions we might execute. Thus, when one of those instructions
(sandwiched between an ldrex and strex) causes an exception (e.g. SVC in A64)
we take the longjmp that lands into cpu_exec_loop, from which we did *not*
come from. That explains those odd asserts being triggered.

The reason why this is only triggered when pthreads are joined, is because
the code there is particularly tricky, with branches and SVC between
ldrex/strex pairs.

The good news is that this still allows me to benchmark the TSX code vs
cmpxchg (I just print out the results before joining); for 4 cores
(8 HW threads), qht-bench performs just as well with TSX and cmpxchg (but
with TSX we get full correctness). For 1 thread, atomic_add is faster
with cmpxchg, but the gap is greatly reduced as contention increases.
This gap is due to the fixed cost of calling _xstart/_xend, which
is quite a few more instructions than just emitting an atomic.

		Emilio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
  2016-08-15 15:49   ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota
@ 2016-08-17 17:22     ` Richard Henderson
  2016-08-17 17:58       ` Emilio G. Cota
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2016-08-17 17:22 UTC (permalink / raw)
  To: Emilio G. Cota, Alex Bennée
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj,
	mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell,
	claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite

On 08/15/2016 08:49 AM, Emilio G. Cota wrote:
> +void HELPER(xbegin)(CPUARMState *env)
> +{
> +    uintptr_t ra = GETPC();
> +    int status;
> +    int retries = 100;
> +
> + retry:
> +    status = _xbegin();
> +    if (status != _XBEGIN_STARTED) {
> +        if (status && retries) {
> +            retries--;
> +            goto retry;
> +        }
> +        if (parallel_cpus) {
> +            cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
> +        }
> +    }
> +}
> +
> +void HELPER(xend)(void)
> +{
> +    if (_xtest()) {
> +        _xend();
> +    } else {
> +        assert(!parallel_cpus);
> +        parallel_cpus = true;
> +    }
> +}
> +

Interesting idea.

FWIW, there are two other extant HTM implementations: ppc64 and s390x.  As I 
recall, the s390 (but not the ppc64) transactions do not roll back the fp 
registers.  Which suggests that we need special support within the TCG 
proglogue.  Perhaps folding these operations into special TCG opcodes.

I believe that power8 has HTM, and there's one of those in the gcc compile 
farm, so this should be relatively easy to try out.

We increase the chances of success of the transaction if we minimize the amount 
of non-target code that's executed while the transaction is running.  That 
suggests two things:

(1) that it would be doubly helpful to incorporate the transaction start 
directly into TCG code generation rather than as a helper and

(2) that we should start a new TB upon encountering a load-exclusive, so that 
we maximize the chance of the store-exclusive being a part of the same TB and 
thus have *nothing* extra between the beginning and commit of the transaction.



r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
  2016-08-17 17:22     ` Richard Henderson
@ 2016-08-17 17:58       ` Emilio G. Cota
  2016-08-17 18:18         ` Emilio G. Cota
  2016-08-17 18:41         ` Richard Henderson
  0 siblings, 2 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-17 17:58 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo,
	bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka,
	serge.fdrv, peter.maydell, claudio.fontana,
	Dr. David Alan Gilbert, Peter Crosthwaite

On Wed, Aug 17, 2016 at 10:22:05 -0700, Richard Henderson wrote:
> On 08/15/2016 08:49 AM, Emilio G. Cota wrote:
> >+void HELPER(xbegin)(CPUARMState *env)
> >+{
> >+    uintptr_t ra = GETPC();
> >+    int status;
> >+    int retries = 100;
> >+
> >+ retry:
> >+    status = _xbegin();
> >+    if (status != _XBEGIN_STARTED) {
> >+        if (status && retries) {
> >+            retries--;
> >+            goto retry;
> >+        }
> >+        if (parallel_cpus) {
> >+            cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
> >+        }
> >+    }
> >+}
> >+
> >+void HELPER(xend)(void)
> >+{
> >+    if (_xtest()) {
> >+        _xend();
> >+    } else {
> >+        assert(!parallel_cpus);
> >+        parallel_cpus = true;
> >+    }
> >+}
> >+
> 
> Interesting idea.
> 
> FWIW, there are two other extant HTM implementations: ppc64 and s390x.  As I
> recall, the s390 (but not the ppc64) transactions do not roll back the fp
> registers.  Which suggests that we need special support within the TCG
> proglogue.  Perhaps folding these operations into special TCG opcodes.

I'm not familiar with s390, but as long as the hardware implements 'strong atomicity'
["strong atomicity guarantees atomicity between transactions and non-transactional
code", see http://acg.cis.upenn.edu/papers/cal06_atomic_semantics.pdf ] then
this approach would work, in the sense that stores wouldn't have to
be instrumented.

Of course architecture issues like saving the fp registers as you mention for
s390 would have to be taken into account.

> I believe that power8 has HTM, and there's one of those in the gcc compile
> farm, so this should be relatively easy to try out.

Good point! I had forgotten about power8. So far my tests have been on a
4-core Skylake. I have an account on the gcc compile farm so I will make use
of it. The power8 machine in the farm has a lot of cores, so this is
pretty exciting.

> We increase the chances of success of the transaction if we minimize the
> amount of non-target code that's executed while the transaction is running.
> That suggests two things:
> 
> (1) that it would be doubly helpful to incorporate the transaction start
> directly into TCG code generation rather than as a helper and

This (and leaving the fallback path in a helper) is simple enough that even
I could do it :-)

> (2) that we should start a new TB upon encountering a load-exclusive, so
> that we maximize the chance of the store-exclusive being a part of the same
> TB and thus have *nothing* extra between the beginning and commit of the
> transaction.

I don't know how to do this. If it's easy to do, please let me know how
(for aarch64 at least, since that's the target I'm using).

I've run some more tests on the Intel machine, and noticed that failed
transactions are very common (up to 50% abort rate for some SPEC workloads,
and I count these aborts as "retrying doesn't help" kind of aborts), so
bringing that down should definitely help.

Another thing I found out is that abusing tcg_exec_step (as is right now)
for the fallback path is a bad idea: when there are many failed transactions,
performance drops dramatically (up to 5x overall slowdown). Turns out that
all this overhead comes from re-translating the code between ldrex/strex.
Would it be possible to cache this step-by-step code? If not, then an
alternative would be to have a way to stop the world *without* leaving
the CPU loop for the calling thread. I'm more comfortable doing the latter
due to my glaring lack of TCG competence.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
  2016-08-17 17:58       ` Emilio G. Cota
@ 2016-08-17 18:18         ` Emilio G. Cota
  2016-08-17 18:41         ` Richard Henderson
  1 sibling, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-17 18:18 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo,
	bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka,
	serge.fdrv, peter.maydell, claudio.fontana,
	Dr. David Alan Gilbert, Peter Crosthwaite

On Wed, Aug 17, 2016 at 13:58:00 -0400, Emilio G. Cota wrote:
> due to my glaring lack of TCG competence.

A related note that might be of interest.

I benchmarked an alternative implementation that *does* instrument
stores. I wrapped every tcg_gen_qemu_st_i64 (those are enough, right?
tcg_gen_st_i64 are stores for the host memory, which I presume are
not "explicit" guest stores and therefore would not go through
the soft TLB) with a pre/post pair of helpers.

These helpers first check a bitmap given a masked subset of the physical
address of the access, and if the bit is set, then check a QHT with the full
physaddr. If an entry exists, they lock/unlock the entry's spinlock around
the store, so that no race is possible with an ongoing atomic (atomics always
take their corresponding lock). Overhead is not too bad over cmpxchg, but
most of it comes from the helpers--see these numbers for SPEC:
(NB. the "QEMU" baseline does *not* include QHT for tb_htable and therefore
takes tb_lock around tb_find_fast, that's why it's so slow)
  http://imgur.com/a/SoSHQ

"QHT only" means a QHT lookup is performed on every guest store. The win of
having the bitmap before hitting the QHT is quite large. I wonder
if things could be sped up further by performing the bitmap check in
TCG code. Would that be worth exploring? If so, any help on that would
be appreciated (i386 host at least)--I tried, but I'm way out of my element.

		E.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
  2016-08-17 17:58       ` Emilio G. Cota
  2016-08-17 18:18         ` Emilio G. Cota
@ 2016-08-17 18:41         ` Richard Henderson
  2016-08-18 15:38           ` Richard Henderson
  1 sibling, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2016-08-17 18:41 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo,
	bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka,
	serge.fdrv, peter.maydell, claudio.fontana,
	Dr. David Alan Gilbert, Peter Crosthwaite

On 08/17/2016 10:58 AM, Emilio G. Cota wrote:
>> (2) that we should start a new TB upon encountering a load-exclusive, so
>> that we maximize the chance of the store-exclusive being a part of the same
>> TB and thus have *nothing* extra between the beginning and commit of the
>> transaction.
>
> I don't know how to do this. If it's easy to do, please let me know how
> (for aarch64 at least, since that's the target I'm using).

It's a simple matter of peeking at the next instruction.

One way is to partially decode the insn before advancing the PC.

  static void disas_a64_insn (CPUARMState *env, DisasContext *s, int num_insns)
  {
     uint32_t insn = arm_ldl_code(env, s->pc, s->sctlr_b);
+
+   if (num_insns > 1 && (insn & xxx) == yyy) {
+       /* Start load-exclusive in a new TB.  */
+       s->is_jmp = DISAS_UPDATE;
+       return;
+   }
     s->insn = insn;
     s->pc += 4;
...


Alternately, store num_insns into DisasContext, and do pc -= 4 in disas_ldst_excl.


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
  2016-08-17 18:41         ` Richard Henderson
@ 2016-08-18 15:38           ` Richard Henderson
  2016-08-24 21:12             ` Emilio G. Cota
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2016-08-18 15:38 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo,
	bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka,
	serge.fdrv, peter.maydell, claudio.fontana,
	Dr. David Alan Gilbert, Peter Crosthwaite

On 08/17/2016 11:41 AM, Richard Henderson wrote:
> On 08/17/2016 10:58 AM, Emilio G. Cota wrote:
>>> (2) that we should start a new TB upon encountering a load-exclusive, so
>>> that we maximize the chance of the store-exclusive being a part of the same
>>> TB and thus have *nothing* extra between the beginning and commit of the
>>> transaction.
>>
>> I don't know how to do this. If it's easy to do, please let me know how
>> (for aarch64 at least, since that's the target I'm using).
>
> It's a simple matter of peeking at the next instruction.
>
> One way is to partially decode the insn before advancing the PC.
>
>  static void disas_a64_insn (CPUARMState *env, DisasContext *s, int num_insns)
>  {
>     uint32_t insn = arm_ldl_code(env, s->pc, s->sctlr_b);
> +
> +   if (num_insns > 1 && (insn & xxx) == yyy) {
> +       /* Start load-exclusive in a new TB.  */
> +       s->is_jmp = DISAS_UPDATE;
> +       return;
> +   }
>     s->insn = insn;
>     s->pc += 4;
> ...
>
>
> Alternately, store num_insns into DisasContext, and do pc -= 4 in disas_ldst_excl.

Actually, the mask check is the only really viable solution, and it needs to 
happen before we do the tcg_gen_insn_start thing.

A couple of other notes, as I've thought about this some more.

If the start and end of the transaction are not in the same TB, the likelihood 
of transaction failure should be very near 100%.  Consider:

   * TB with ldrex ends before the strex.

   * Since the next TB hasn't been built yet, we'll definitely go
     through tb_find_physical, through the translator, and through
     the tcg compiler.

     (a) Which I think we can definitely assume will exhaust any
         resources associated with the transaction.
     (b) Which will abort the transaction,
     (c) Which, with the current code, will retry N times, with
         identical results, failing within the compiler each time,
     (d) Which, with the current code, will single-step through
         to the strex, as you saw.

   * Since we proceed to (d) the first time, we'll never succeed
     to create the next TB, so we'll always iterate compilation N
     times, resulting in the single-step.

This is probably the real slow-down that you see.

Therefore, we must abort any transaction when we exit tcg-generated code.  Both 
through cpu_exit_loop or through the tcg epilogue.  We should be able to use 
the software controlled bits associated with the abort to tell what kind of 
event lead to the abort.  However, we must bear in mind that (for both x86 and 
ppc at least) we only have an 8-bit abort code.  So we can't pass back a 
pointer, for instance.

We should think about what kinds of limitations we should accept for handling 
ll/sc via transactions.

   * How do we handle unpaired ldrexd / ldxp?  This is used by the compiler,
     as it's the only way to perform a double-word atomic load.

     This implies that we need some sort of counter, beyond which we stop
     trying to succeed via transaction.

   * In order to make normal cmpxchg patterns work, we have to be able to
     handle a branch within a ll/sc sequence.  Options:

     * Less complex way is to build a TB, including branches, with a max
       of N insns along the branch-not-taken path, searching for the strex.
       But of course this fails to handle legitimate patterns for arm
       (and other ll/sc guests).

       However, gcc code generation will generally annotate the cmpxchg
       failure branch as not-taken, so perhaps this will work well enough
       in practice.

     * More complex way is to build a TB, including branches, with a max
       of N insns along *all* paths, searching for the strex.  This runs
       into problems with, among other things, branches crossing pages.

     * Most complex way is to somehow get all of the TBs built, and
       linked together, preferably before we even try executing
       (and failing the transaction in) the first TB.


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
  2016-08-18 15:38           ` Richard Henderson
@ 2016-08-24 21:12             ` Emilio G. Cota
  2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
  0 siblings, 1 reply; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 21:12 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo,
	bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka,
	serge.fdrv, peter.maydell, claudio.fontana,
	Dr. David Alan Gilbert, Peter Crosthwaite

On Thu, Aug 18, 2016 at 08:38:47 -0700, Richard Henderson wrote:
> A couple of other notes, as I've thought about this some more.

Thanks for spending time on this.

I have a new patchset (will send as a reply to this e-mail in a few
minutes) that has good performance. Its main ideas:

- Use transactions that start on ldrex and finish on strex. On
  an exception, end (instead of abort) the ongoing transaction,
  if any. There's little point in aborting, since the subsequent
  retries will end up in the same exception anyway. This means
  the translation of the corresponding blocks might happen via
  the fallback path. That's OK, given that subsequent executions
  of the TBs will (likely) complete via HTM.

- For the fallback path, add a stop-the-world primitive that stops
  all other CPUs, without requiring the calling CPU to exit the CPU loop.
  Not breaking from the loop keeps the code simple--we can just
  keep translating/executing normally, with the guarantee that
  no other CPU can run until we're done.

- The fallback path of the transaction stops the world and then
  continues execution (from ldrex) as the only running CPU.

- Only retry when the hardware hints that we may do so. This
  ends up being rare (I can only get dozens of retries under
  heavy contention, for instance with 'atomic_add-bench -r 1')

Limitations: for now user-mode only, and I have paid no attention
to paired atomics. Also, I'm making no checks for unusual (undefined?)
guest code, such as stray ldrex/strex thrown in there.

Performance optimizations like you suggest (e.g. starting a TB
on ldrex, or using TCG ops for beginning/ending the transaction)
could be implemented, but at least on Intel TSX (the only one I've
tried so far[*]), the transaction buffer seems big enough to not
make these optimizations a necessity.

[*] I tried running HTM primitives on the gcc compile farm's Power8,
  but I get an illegal instruction fault on tbegin. I've filed
  an issue here to report it: https://gna.org/support/?3369 ]

Some observations:

- The peak number of retries I see is for atomic_add-bench -r 1 -n 16
  (on an 8-thread machine) at about ~90 retries. So I set the limit
  to 100.

- The lowest success rate I've seen is ~98%, again for atomic_add-bench
  under high contention.

Some numbers:

- atomic_add's performance is lower for HTM vs cmpxchg, although under
  contention performance gets very similar. The reason for the perf
  gap is that xbegin/xend takes more cycles than cmpxchg, especially
  under little or no contention; this explains the large difference
  for threads=1.
  http://imgur.com/5kiT027
  As a side note, contended transactions seem to scale worse than contended
  cmpxchg when exploiting SMT. But anyway I wouldn't read much into
  that.

- For more realistic workloads that gap goes away, as the relative impact
  of cmpxchg or transaction delays is lower. For QHT, 1000 keys:
  http://imgur.com/l6vcowu
  And for SPEC (note that despite being single-threaded, SPEC executes
  a lot of atomics, e.g. from mutexes and from forking):
  http://imgur.com/W49YMhJ
  Performance is essentially identical to that of cmpxchg, but of course
  with HTM we get correct emulation.

Thanks for reading this far!

		Emilio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST
  2016-08-24 21:12             ` Emilio G. Cota
@ 2016-08-24 22:17               ` Emilio G. Cota
  2016-08-24 22:17                 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota
                                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 22:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani,
	nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv,
	peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter

This avoids the chance of reading a corrupted list of CPUs in usermode.

Note: this breaks hw/ppc/spapr due to the removal of CPU_FOREACH_REVERSE.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c               |  2 +-
 exec.c               | 18 +++++++++++++++---
 include/qom/cpu.h    | 16 +++++++---------
 linux-user/main.c    |  2 +-
 linux-user/syscall.c |  2 +-
 5 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/cpus.c b/cpus.c
index a01bbbd..bc573be 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1177,7 +1177,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
                 qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
             }
         }
-        qemu_tcg_wait_io_event(QTAILQ_FIRST(&cpus));
+        qemu_tcg_wait_io_event(first_cpu);
         CPU_FOREACH(cpu) {
             if (cpu->unplug && !cpu_can_run(cpu)) {
                 remove_cpu = cpu;
diff --git a/exec.c b/exec.c
index 806e2fe..70dd869 100644
--- a/exec.c
+++ b/exec.c
@@ -93,7 +93,7 @@ static MemoryRegion io_mem_unassigned;
 
 #endif
 
-struct CPUTailQ cpus = QTAILQ_HEAD_INITIALIZER(cpus);
+struct CPUTailQ cpus = QLIST_HEAD_INITIALIZER(cpus);
 /* current CPU in the current thread. It is only valid inside
    cpu_exec() */
 __thread CPUState *current_cpu;
@@ -651,7 +651,7 @@ void cpu_exec_exit(CPUState *cpu)
         return;
     }
 
-    QTAILQ_REMOVE(&cpus, cpu, node);
+    QLIST_REMOVE_RCU(cpu, node);
     cpu_release_index(cpu);
     cpu->cpu_index = -1;
 #if defined(CONFIG_USER_ONLY)
@@ -703,7 +703,19 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
 #endif
         return;
     }
-    QTAILQ_INSERT_TAIL(&cpus, cpu, node);
+    /* poor man's QLIST_INSERT_TAIL_RCU */
+    if (QLIST_EMPTY_RCU(&cpus)) {
+        QLIST_INSERT_HEAD_RCU(&cpus, cpu, node);
+    } else {
+        CPUState *some_cpu;
+
+        CPU_FOREACH(some_cpu) {
+            if (QLIST_NEXT_RCU(some_cpu, node) == NULL) {
+                QLIST_INSERT_AFTER_RCU(some_cpu, cpu, node);
+                break;
+            }
+        }
+    }
 #if defined(CONFIG_USER_ONLY)
     (void) cc;
     cpu_list_unlock();
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 32f3af3..eba48ed 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -24,7 +24,7 @@
 #include "disas/bfd.h"
 #include "exec/hwaddr.h"
 #include "exec/memattrs.h"
-#include "qemu/queue.h"
+#include "qemu/rcu_queue.h"
 #include "qemu/thread.h"
 
 typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
@@ -319,7 +319,7 @@ struct CPUState {
     struct GDBRegisterState *gdb_regs;
     int gdb_num_regs;
     int gdb_num_g_regs;
-    QTAILQ_ENTRY(CPUState) node;
+    QLIST_ENTRY(CPUState) node;
 
     /* ice debug support */
     QTAILQ_HEAD(breakpoints_head, CPUBreakpoint) breakpoints;
@@ -362,15 +362,13 @@ struct CPUState {
     uint32_t tcg_exit_req;
 };
 
-QTAILQ_HEAD(CPUTailQ, CPUState);
+QLIST_HEAD(CPUTailQ, CPUState);
 extern struct CPUTailQ cpus;
-#define CPU_NEXT(cpu) QTAILQ_NEXT(cpu, node)
-#define CPU_FOREACH(cpu) QTAILQ_FOREACH(cpu, &cpus, node)
+#define CPU_NEXT(cpu) QLIST_NEXT_RCU(cpu, node)
+#define CPU_FOREACH(cpu) QLIST_FOREACH_RCU(cpu, &cpus, node)
 #define CPU_FOREACH_SAFE(cpu, next_cpu) \
-    QTAILQ_FOREACH_SAFE(cpu, &cpus, node, next_cpu)
-#define CPU_FOREACH_REVERSE(cpu) \
-    QTAILQ_FOREACH_REVERSE(cpu, &cpus, CPUTailQ, node)
-#define first_cpu QTAILQ_FIRST(&cpus)
+    QLIST_FOREACH_SAFE_RCU(cpu, &cpus, node, next_cpu)
+#define first_cpu QLIST_FIRST_RCU(&cpus)
 
 extern __thread CPUState *current_cpu;
 
diff --git a/linux-user/main.c b/linux-user/main.c
index f2f7422..9880505 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -131,7 +131,7 @@ void fork_end(int child)
            Discard information about the parent threads.  */
         CPU_FOREACH_SAFE(cpu, next_cpu) {
             if (cpu != thread_cpu) {
-                QTAILQ_REMOVE(&cpus, cpu, node);
+                QLIST_REMOVE_RCU(cpu, node);
             }
         }
         pending_cpus = 0;
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 1c17b74..2911319 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -6710,7 +6710,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 
             cpu_list_lock();
             /* Remove the CPU from the list.  */
-            QTAILQ_REMOVE(&cpus, cpu, node);
+            QLIST_REMOVE_RCU(cpu, node);
             cpu_list_unlock();
             ts = cpu->opaque;
             if (ts->child_tidptr) {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path
  2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
@ 2016-08-24 22:17                 ` Emilio G. Cota
  2016-08-24 22:17                 ` [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held() Emilio G. Cota
                                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 22:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani,
	nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv,
	peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c | 38 +++++++++++---------------------------
 1 file changed, 11 insertions(+), 27 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 041f8b7..63d739a 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -309,34 +309,18 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
     TranslationBlock *tb;
 
     tb = tb_find_physical(cpu, pc, cs_base, flags);
-    if (tb) {
-        goto found;
-    }
-
-#ifdef CONFIG_USER_ONLY
-    /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
-     * taken outside tb_lock.  Since we're momentarily dropping
-     * tb_lock, there's a chance that our desired tb has been
-     * translated.
-     */
-    tb_unlock();
-    mmap_lock();
-    tb_lock();
-    tb = tb_find_physical(cpu, pc, cs_base, flags);
-    if (tb) {
+    if (!tb) {
+        mmap_lock();
+        tb_lock();
+        tb = tb_find_physical(cpu, pc, cs_base, flags);
+        if (!tb) {
+            /* if no translated code available, then translate it now */
+            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+        }
+        tb_unlock();
         mmap_unlock();
-        goto found;
     }
-#endif
-
-    /* if no translated code available, then translate it now */
-    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
-
-#ifdef CONFIG_USER_ONLY
-    mmap_unlock();
-#endif
 
-found:
     /* we add the TB in the virtual pc hash table */
     cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
     return tb;
@@ -355,7 +339,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
        always be the same before a given translated block
        is executed. */
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-    tb_lock();
     tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
     if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
                  tb->flags != flags)) {
@@ -379,9 +362,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
 #endif
     /* See if we can patch the calling TB. */
     if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
+        tb_lock();
         tb_add_jump(*last_tb, tb_exit, tb);
+        tb_unlock();
     }
-    tb_unlock();
     return tb;
 }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held()
  2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
  2016-08-24 22:17                 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota
@ 2016-08-24 22:17                 ` Emilio G. Cota
  2016-08-24 22:17                 ` [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics Emilio G. Cota
                                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 22:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani,
	nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv,
	peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/rcu.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 83ae280..0f6e467 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -98,6 +98,13 @@ static inline void rcu_read_unlock(void)
     }
 }
 
+static inline bool rcu_read_lock_held(void)
+{
+    struct rcu_reader_data *p_rcu_reader = &rcu_reader;
+
+    return p_rcu_reader->depth > 0;
+}
+
 extern void synchronize_rcu(void);
 
 /*
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics
  2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
  2016-08-24 22:17                 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota
  2016-08-24 22:17                 ` [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held() Emilio G. Cota
@ 2016-08-24 22:17                 ` Emilio G. Cota
  2016-08-24 22:18                 ` [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop Emilio G. Cota
                                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 22:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani,
	nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv,
	peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-arm/helper-a64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index 8ce518b..6f3fd17 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -453,7 +453,7 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, uint32_t bytes)
 uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr,
                                      uint64_t new_lo, uint64_t new_hi)
 {
-#ifndef CONFIG_USER_ONLY
+#if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_ATOMIC128)
     uintptr_t ra = GETPC();
 #endif
     Int128 oldv, cmpv, newv;
@@ -518,7 +518,7 @@ uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr,
 uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
                                      uint64_t new_lo, uint64_t new_hi)
 {
-#ifndef CONFIG_USER_ONLY
+#if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_ATOMIC128)
     uintptr_t ra = GETPC();
 #endif
     Int128 oldv, cmpv, newv;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop
  2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
                                   ` (2 preceding siblings ...)
  2016-08-24 22:17                 ` [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics Emilio G. Cota
@ 2016-08-24 22:18                 ` Emilio G. Cota
  2016-08-24 22:18                 ` [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics Emilio G. Cota
                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 22:18 UTC (permalink / raw)
  To: Richard Henderson
  Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani,
	nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv,
	peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              |  1 +
 include/exec/exec-all.h |  5 +++
 linux-user/main.c       | 89 +++++++++++++++++++++++++++++++++++++++++++++++++
 linux-user/syscall.c    |  1 +
 4 files changed, 96 insertions(+)

diff --git a/cpu-exec.c b/cpu-exec.c
index 63d739a..8f1adc4 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -649,6 +649,7 @@ int cpu_exec(CPUState *cpu)
             g_assert(cc == CPU_GET_CLASS(cpu));
 #endif /* buggy compiler */
             cpu->can_do_io = 1;
+            stop_the_world_reset();
             tb_lock_reset();
         }
     } /* for(;;) */
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index ec72c5a..c483d80 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -61,6 +61,11 @@ void QEMU_NORETURN cpu_loop_exit(CPUState *cpu);
 void QEMU_NORETURN cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc);
 void QEMU_NORETURN cpu_loop_exit_atomic(CPUState *cpu, uintptr_t pc);
 
+void stop_the_world_lock(CPUState *cpu);
+void stop_the_world_unlock(void);
+void stop_the_world_reset(void);
+extern __thread bool stw_held;
+
 #if !defined(CONFIG_USER_ONLY)
 void cpu_reloading_memory_map(void);
 /**
diff --git a/linux-user/main.c b/linux-user/main.c
index 9880505..94c6625 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -114,11 +114,19 @@ static pthread_cond_t exclusive_cond = PTHREAD_COND_INITIALIZER;
 static pthread_cond_t exclusive_resume = PTHREAD_COND_INITIALIZER;
 static int pending_cpus;
 
+static pthread_cond_t stw_sleep_cond   = PTHREAD_COND_INITIALIZER;
+static pthread_cond_t stw_request_cond = PTHREAD_COND_INITIALIZER;
+static pthread_mutex_t stw_lock = PTHREAD_MUTEX_INITIALIZER;
+static int stw_requests;
+static bool stw_ongoing;
+__thread bool stw_held;
+
 /* Make sure everything is in a consistent state for calling fork().  */
 void fork_start(void)
 {
     qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
     pthread_mutex_lock(&exclusive_lock);
+    pthread_mutex_lock(&stw_lock);
     mmap_fork_start();
 }
 
@@ -137,11 +145,17 @@ void fork_end(int child)
         pending_cpus = 0;
         pthread_mutex_init(&exclusive_lock, NULL);
         pthread_mutex_init(&cpu_list_mutex, NULL);
+        pthread_mutex_init(&stw_lock, NULL);
+        stw_held = false;
+        stw_ongoing = false;
         pthread_cond_init(&exclusive_cond, NULL);
         pthread_cond_init(&exclusive_resume, NULL);
+        pthread_cond_init(&stw_sleep_cond, NULL);
+        pthread_cond_init(&stw_request_cond, NULL);
         qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
         gdbserver_fork(thread_cpu);
     } else {
+        pthread_mutex_unlock(&stw_lock);
         pthread_mutex_unlock(&exclusive_lock);
         qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
     }
@@ -198,6 +212,79 @@ static void step_atomic(CPUState *cpu)
     end_exclusive();
 }
 
+void stop_the_world_lock(CPUState *cpu)
+{
+    CPUState *other;
+
+    if (stw_held) {
+        return;
+    }
+    rcu_read_unlock();
+    assert(!rcu_read_lock_held());
+
+    pthread_mutex_lock(&stw_lock);
+    if (stw_ongoing) {
+        stw_requests++;
+        /* wait for ongoing stops to occur */
+        while (stw_ongoing) {
+            pthread_cond_wait(&stw_request_cond, &stw_lock);
+        }
+        stw_requests--;
+    }
+
+    /* it's our turn! */
+    stw_ongoing = true;
+    stw_held = true;
+    CPU_FOREACH(other) {
+        if (other != cpu) {
+            cpu_exit(other);
+        }
+    }
+    synchronize_rcu();
+}
+
+void stop_the_world_unlock(void)
+{
+    if (!stw_held) {
+        return;
+    }
+    assert(stw_ongoing);
+    assert(!rcu_read_lock_held());
+
+    if (stw_requests) {
+        pthread_cond_signal(&stw_request_cond);
+    } else {
+        pthread_cond_broadcast(&stw_sleep_cond);
+    }
+    /*
+     * Make sure the next STW requester (if any) will perceive that we're
+     * in an RCU read critical section
+     */
+    rcu_read_lock();
+    stw_ongoing = false;
+    stw_held = false;
+    pthread_mutex_unlock(&stw_lock);
+}
+
+void stop_the_world_reset(void)
+{
+    if (likely(!stw_held)) {
+        return;
+    }
+    stop_the_world_unlock();
+}
+
+static inline void stop_the_world_sleep(void)
+{
+    pthread_mutex_lock(&stw_lock);
+    if (unlikely(stw_ongoing)) {
+        while (stw_ongoing) {
+            pthread_cond_wait(&stw_sleep_cond, &stw_lock);
+        }
+    }
+    pthread_mutex_unlock(&stw_lock);
+}
+
 /* Wait for exclusive ops to finish, and begin cpu execution.  */
 static inline void cpu_exec_start(CPUState *cpu)
 {
@@ -205,6 +292,8 @@ static inline void cpu_exec_start(CPUState *cpu)
     exclusive_idle();
     cpu->running = true;
     pthread_mutex_unlock(&exclusive_lock);
+
+    stop_the_world_sleep();
 }
 
 /* Mark cpu as not executing, and release pending exclusive ops.  */
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 2911319..740af23 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -5403,6 +5403,7 @@ static void *clone_func(void *arg)
     /* Wait until the parent has finshed initializing the tls state.  */
     pthread_mutex_lock(&clone_lock);
     pthread_mutex_unlock(&clone_lock);
+    stw_held = false;
     cpu_loop(env);
     /* never exits */
     return NULL;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics
  2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
                                   ` (3 preceding siblings ...)
  2016-08-24 22:18                 ` [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop Emilio G. Cota
@ 2016-08-24 22:18                 ` Emilio G. Cota
  2016-08-24 22:18                 ` [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics Emilio G. Cota
  2016-08-24 22:18                 ` [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path Emilio G. Cota
  6 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 22:18 UTC (permalink / raw)
  To: Richard Henderson
  Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani,
	nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv,
	peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/htm.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)
 create mode 100644 include/qemu/htm.h

diff --git a/include/qemu/htm.h b/include/qemu/htm.h
new file mode 100644
index 0000000..dc84bc1
--- /dev/null
+++ b/include/qemu/htm.h
@@ -0,0 +1,43 @@
+#ifndef HTM_H
+#define HTM_H
+
+enum htm {
+    HTM_OK,
+    HTM_ABORT_RETRY,
+    HTM_ABORT_NORETRY,
+};
+
+#if defined(__x86_64__)
+/* compile with -mrtm */
+#include <immintrin.h>
+
+static inline enum htm htm_begin(void)
+{
+    int status;
+
+    status = _xbegin();
+    if (unlikely(status != _XBEGIN_STARTED)) {
+        if (status & _XABORT_RETRY) {
+            return HTM_ABORT_RETRY;
+        }
+        return HTM_ABORT_NORETRY;
+    }
+    return HTM_OK;
+}
+
+static inline void htm_end(void)
+{
+    _xend();
+}
+
+static inline bool htm_test(void)
+{
+    return _xtest();
+}
+
+static inline void htm_abort(void)
+{
+    _xabort(0);
+}
+#endif /* ISA */
+#endif /* HTM_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics
  2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
                                   ` (4 preceding siblings ...)
  2016-08-24 22:18                 ` [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics Emilio G. Cota
@ 2016-08-24 22:18                 ` Emilio G. Cota
  2016-08-24 22:18                 ` [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path Emilio G. Cota
  6 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 22:18 UTC (permalink / raw)
  To: Richard Henderson
  Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani,
	nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv,
	peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/htm.h | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/include/qemu/htm.h b/include/qemu/htm.h
index dc84bc1..f367ee4 100644
--- a/include/qemu/htm.h
+++ b/include/qemu/htm.h
@@ -39,5 +39,44 @@ static inline void htm_abort(void)
 {
     _xabort(0);
 }
+
+#elif defined(__powerpc64__)
+/* compile with -mhtm */
+#include <htmintrin.h>
+
+static inline int htm_begin(void)
+{
+    unsigned int status;
+
+    status = __builtin_tbegin(0);
+    if (likely(status)) {
+        return HTM_OK;
+    }
+    if (_TEXASRU_FAILURE_PERSISTENT(__builtin_get_texasru())) {
+        return HTM_ABORT_NORETRY;
+    }
+    return HTM_ABORT_RETRY;
+}
+
+static inline void htm_end(void)
+{
+    __builtin_tend(0);
+}
+
+static inline int htm_test(void)
+{
+    unsigned char state = _HTM_STATE(__builtin_ttest());
+
+    if (likely(state == _HTM_TRANSACTIONAL)) {
+        return 1;
+    }
+    return 0;
+}
+
+static inline void htm_abort(void)
+{
+    __builtin_tabort(0);
+}
+
 #endif /* ISA */
 #endif /* HTM_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path
  2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
                                   ` (5 preceding siblings ...)
  2016-08-24 22:18                 ` [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics Emilio G. Cota
@ 2016-08-24 22:18                 ` Emilio G. Cota
  6 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2016-08-24 22:18 UTC (permalink / raw)
  To: Richard Henderson
  Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani,
	nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv,
	peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter

TODO: convert paired atomics as well.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c                 |  4 ++++
 target-arm/helper-a64.c    | 31 +++++++++++++++++++++++++++++++
 target-arm/helper-a64.h    |  4 ++++
 target-arm/op_helper.c     |  4 ++++
 target-arm/translate-a64.c | 16 ++++++++++------
 5 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 8f1adc4..6e2531f 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -26,6 +26,7 @@
 #include "sysemu/qtest.h"
 #include "qemu/timer.h"
 #include "exec/address-spaces.h"
+#include "qemu/htm.h"
 #include "qemu/rcu.h"
 #include "exec/tb-hash.h"
 #include "exec/log.h"
@@ -651,6 +652,9 @@ int cpu_exec(CPUState *cpu)
             cpu->can_do_io = 1;
             stop_the_world_reset();
             tb_lock_reset();
+            if (unlikely(htm_test())) {
+                htm_end();
+            }
         }
     } /* for(;;) */
 
diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index 6f3fd17..741e6de 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -25,6 +25,7 @@
 #include "qemu/log.h"
 #include "sysemu/sysemu.h"
 #include "qemu/bitops.h"
+#include "qemu/htm.h"
 #include "internals.h"
 #include "qemu/crc32c.h"
 #include "exec/exec-all.h"
@@ -579,3 +580,33 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
 
     return !success;
 }
+
+void HELPER(xbegin)(CPUARMState *env)
+{
+    int status;
+    int retries = 100;
+
+ retry:
+    status = htm_begin();
+    if (unlikely(status != HTM_OK)) {
+        if ((status & HTM_ABORT_RETRY) && retries) {
+            retries--;
+            goto retry;
+        }
+        stop_the_world_lock(ENV_GET_CPU(env));
+    }
+}
+
+void HELPER(xend)(void)
+{
+    if (likely(htm_test())) {
+        htm_end();
+    } else {
+        stop_the_world_unlock();
+    }
+}
+
+uint64_t HELPER(x_ok)(void)
+{
+    return likely(htm_test()) || stw_held;
+}
diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
index dd32000..e7ede43 100644
--- a/target-arm/helper-a64.h
+++ b/target-arm/helper-a64.h
@@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
+
+DEF_HELPER_1(xbegin, void, env)
+DEF_HELPER_0(x_ok, i64)
+DEF_HELPER_0(xend, void)
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index 73da759..91b1413 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -17,6 +17,7 @@
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
 #include "qemu/osdep.h"
+#include "qemu/htm.h"
 #include "cpu.h"
 #include "exec/helper-proto.h"
 #include "internals.h"
@@ -31,6 +32,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
 {
     CPUState *cs = CPU(arm_env_get_cpu(env));
 
+    if (unlikely(htm_test())) {
+        htm_end();
+    }
     assert(!excp_is_internal(excp));
     cs->exception_index = excp;
     env->exception.syndrome = syndrome;
diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 450c359..cc3baa0 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     TCGv_i64 tmp = tcg_temp_new_i64();
     TCGMemOp be = s->be_data;
 
+    gen_helper_xbegin(cpu_env);
+
     g_assert(size <= 3);
     if (is_pair) {
         TCGv_i64 hitmp = tcg_temp_new_i64();
@@ -1825,6 +1827,10 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
     tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label);
 
     tmp = tcg_temp_new_i64();
+    /* strex without a prior ldrex should just fail */
+    gen_helper_x_ok(tmp);
+    tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label);
+
     if (is_pair) {
         if (size == 2) {
             TCGv_i64 val = tcg_temp_new_i64();
@@ -1844,16 +1850,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
         }
     } else {
         TCGv_i64 val = cpu_reg(s, rt);
-        tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val,
-                                   get_mem_index(s),
-                                   size | MO_ALIGN | s->be_data);
-        tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val);
+        tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size);
     }
 
     tcg_temp_free_i64(addr);
-
-    tcg_gen_mov_i64(cpu_reg(s, rd), tmp);
     tcg_temp_free_i64(tmp);
+
+    tcg_gen_movi_i64(cpu_reg(s, rd), 0);
+    gen_helper_xend();
     tcg_gen_br(done_label);
 
     gen_set_label(fail_label);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-08-24 22:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-15 10:46 [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
2016-08-15 11:00 ` Peter Maydell
2016-08-15 11:16   ` Alex Bennée
2016-08-15 15:46 ` Emilio G. Cota
2016-08-15 15:49   ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota
2016-08-17 17:22     ` Richard Henderson
2016-08-17 17:58       ` Emilio G. Cota
2016-08-17 18:18         ` Emilio G. Cota
2016-08-17 18:41         ` Richard Henderson
2016-08-18 15:38           ` Richard Henderson
2016-08-24 21:12             ` Emilio G. Cota
2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
2016-08-24 22:17                 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota
2016-08-24 22:17                 ` [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held() Emilio G. Cota
2016-08-24 22:17                 ` [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics Emilio G. Cota
2016-08-24 22:18                 ` [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop Emilio G. Cota
2016-08-24 22:18                 ` [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics Emilio G. Cota
2016-08-24 22:18                 ` [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics Emilio G. Cota
2016-08-24 22:18                 ` [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path Emilio G. Cota
2016-08-16 11:16   ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
2016-08-16 21:51     ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.