All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path
@ 2016-07-05 16:18 Alex Bennée
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock' Alex Bennée
                   ` (6 more replies)
  0 siblings, 7 replies; 41+ messages in thread
From: Alex Bennée @ 2016-07-05 16:18 UTC (permalink / raw)
  To: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Alex Bennée

Hi,

Well this is the first re-spin of the series posted last week. I've
added a bunch of additional patches to be more aggressive with
avoiding bouncing locks but to be honest the numbers don't seem to
make it worth it.

I think the first 3 patches are ready to take if the TCG maintainers
want to:

    tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'
    tcg: set up tb->page_addr before insertion
    tcg: cpu-exec: remove tb_lock from the hot-path

The remaining patches are included for discussion.

I've re-spun the benchmarks with a larger tarball to show the
difference more clearly:

Baseline Run
============

retry.py called with ['./arm-linux-user/qemu-arm', './pigz.armhf', '-c', '-9', 'linux-4.6.3.tar']
Source code is @ pull-target-arm-20160627-162-ged7e184 or heads/misc/docker-linux-user-v4
run 1: ret=0 (PASS), time=32.786249 (1/1)
run 2: ret=0 (PASS), time=32.535492 (2/2)
run 3: ret=0 (PASS), time=33.036394 (3/3)
run 4: ret=0 (PASS), time=33.036447 (4/4)
run 5: ret=0 (PASS), time=33.036706 (5/5)
run 6: ret=0 (PASS), time=33.536869 (6/6)
run 7: ret=0 (PASS), time=33.286681 (7/7)
run 8: ret=0 (PASS), time=35.292143 (8/8)
run 9: ret=0 (PASS), time=33.286727 (9/9)
run 10: ret=0 (PASS), time=32.786092 (10/10)
Results summary:
0: 10 times (100.00%), avg time 33.262 (0.59 varience/0.77 deviation)

Up to and including tcg: cpu-exec: remove tb_lock from the hot-path
===================================================================

Ran command 10 times, 10 passes
retry.py called with ['./arm-linux-user/qemu-arm', './pigz.armhf', '-c', '-9', 'linux-4.6.3.tar']
Source code is @ pull-target-arm-20160627-165-ga6c4538 or heads/misc/docker-linux-user-v4-3-ga6c4538
run 1: ret=0 (PASS), time=29.783023 (1/1)
run 2: ret=0 (PASS), time=29.532725 (2/2)
run 3: ret=0 (PASS), time=29.783066 (3/3)
run 4: ret=0 (PASS), time=29.783209 (4/4)
run 5: ret=0 (PASS), time=29.783338 (5/5)
run 6: ret=0 (PASS), time=30.033726 (6/6)
run 7: ret=0 (PASS), time=32.039076 (7/7)
run 8: ret=0 (PASS), time=29.783116 (8/8)
run 9: ret=0 (PASS), time=30.033237 (9/9)
run 10: ret=0 (PASS), time=30.283845 (10/10)
Results summary:
0: 10 times (100.00%), avg time 30.084 (0.51 varience/0.72 deviation)

The whole series
================

Ran command 10 times, 10 passes
retry.py called with ['./arm-linux-user/qemu-arm', './pigz.armhf', '-c', '-9', 'linux-4.6.3.tar']
Source code is @ pull-target-arm-20160627-168-ge9609f6 or heads/tcg/hot-path-and-misc-cleanups-v2
run 1: ret=0 (PASS), time=29.532766 (1/1)
run 2: ret=0 (PASS), time=29.534664 (2/2)
run 3: ret=0 (PASS), time=29.533659 (3/3)
run 4: ret=0 (PASS), time=29.282399 (4/4)
run 5: ret=0 (PASS), time=30.283774 (5/5)
run 6: ret=0 (PASS), time=30.033609 (6/6)
run 7: ret=0 (PASS), time=30.283790 (7/7)
run 8: ret=0 (PASS), time=29.783237 (8/8)
run 9: ret=0 (PASS), time=30.033356 (9/9)
run 10: ret=0 (PASS), time=32.536344 (10/10)
Results summary:
0: 10 times (100.00%), avg time 30.084 (0.86 varience/0.93 deviation)
Ran command 10 times, 10 passes

I think the variance and deviation calculations are correct now.The
benchmark is run with my retry script:

    https://github.com/stsquad/retry

The command line was:

    retry.py -l pigz.bench -g -n 10 -c -- ./arm-linux-user/qemu-arm \
        ./pigz.armhf -c -9 linux-4.6.3.tar > /dev/null

Alex Bennée (5):
  tcg: set up tb->page_addr before insertion
  tcg: cpu-exec: remove tb_lock from the hot-path
  tcg: cpu-exec: factor out TB patching code
  tcg: introduce tb_lock_recursive()
  tcg: cpu-exec: roll-up tb_find_fast/slow

Sergey Fedorov (1):
  tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'

 cpu-exec.c      | 153 +++++++++++++++++++++++++++++++++-----------------------
 tcg/tcg.h       |   1 +
 translate-all.c |  28 ++++++++---
 3 files changed, 114 insertions(+), 68 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'
  2016-07-05 16:18 [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Alex Bennée
@ 2016-07-05 16:18 ` Alex Bennée
  2016-07-07 13:52   ` Sergey Fedorov
  2016-07-08 14:51   ` Sergey Fedorov
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 2/6] tcg: set up tb->page_addr before insertion Alex Bennée
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 41+ messages in thread
From: Alex Bennée @ 2016-07-05 16:18 UTC (permalink / raw)
  To: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Sergey Fedorov, Alex Bennée,
	Peter Crosthwaite

From: Sergey Fedorov <serge.fdrv@gmail.com>

First, ensure atomicity of CPU's 'tb_jmp_cache' access by:
 * using atomic_read() to look up a TB when not holding 'tb_lock';
 * using atomic_write() to remove a TB from each CPU's local cache on
   TB invalidation.

Second, add some memory barriers to ensure we don't put the TB being
invalidated back to CPU's 'tb_jmp_cache'. If we fail to look up a TB in
CPU's local cache because it is being invalidated by some other thread
then it must not be found in the shared TB hash table. Otherwise we'd
put it back to CPU's local cache.

Note that this patch does *not* make CPU's TLB invalidation safe if it
is done from some other thread while the CPU is in its execution loop.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
[AJB: fixed missing atomic set, tweak title]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Emilio G. Cota <cota@braap.org>

---
v1 (AJB):
  - tweak title
  - fixed missing set of tb_jmp_cache
v2
  - fix spelling s/con't/can't/
  - add atomic_read while clearing tb_jmp_cache
  - add r-b tags
---
 cpu-exec.c      | 9 +++++++--
 translate-all.c | 9 +++++++--
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index b840e1d..10ce1cb 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -285,6 +285,11 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
 {
     TranslationBlock *tb;
 
+    /* Ensure that we won't find a TB in the shared hash table
+     * if it is being invalidated by some other thread.
+     * Otherwise we'd put it back to CPU's local cache.
+     * Pairs with smp_wmb() in tb_phys_invalidate(). */
+    smp_rmb();
     tb = tb_find_physical(cpu, pc, cs_base, flags);
     if (tb) {
         goto found;
@@ -315,7 +320,7 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
 
 found:
     /* we add the TB in the virtual pc hash table */
-    cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
+    atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
     return tb;
 }
 
@@ -333,7 +338,7 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
        is executed. */
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
     tb_lock();
-    tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
+    tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
     if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
                  tb->flags != flags)) {
         tb = tb_find_slow(cpu, pc, cs_base, flags);
diff --git a/translate-all.c b/translate-all.c
index eaa95e4..96efe48 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1004,11 +1004,16 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
         invalidate_page_bitmap(p);
     }
 
+    /* Ensure that we won't find the TB in the shared hash table
+     * if we can't see it in CPU's local cache.
+     * Pairs with smp_rmb() in tb_find_slow(). */
+    smp_wmb();
+
     /* remove the TB from the hash list */
     h = tb_jmp_cache_hash_func(tb->pc);
     CPU_FOREACH(cpu) {
-        if (cpu->tb_jmp_cache[h] == tb) {
-            cpu->tb_jmp_cache[h] = NULL;
+        if (atomic_read(&cpu->tb_jmp_cache[h]) == tb) {
+            atomic_set(&cpu->tb_jmp_cache[h], NULL);
         }
     }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH v2 2/6] tcg: set up tb->page_addr before insertion
  2016-07-05 16:18 [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Alex Bennée
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock' Alex Bennée
@ 2016-07-05 16:18 ` Alex Bennée
  2016-07-07 14:08   ` Sergey Fedorov
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path Alex Bennée
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-07-05 16:18 UTC (permalink / raw)
  To: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite

This ensures that if we find the TB on the slow path that tb->page_addr
is correctly set before being tested.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 translate-all.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 96efe48..97e834a 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1126,10 +1126,6 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
 {
     uint32_t h;
 
-    /* add in the hash table */
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
-
     /* add in the page list */
     tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
     if (phys_page2 != -1) {
@@ -1138,6 +1134,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
         tb->page_addr[1] = -1;
     }
 
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
+    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
+
 #ifdef DEBUG_TB_CHECK
     tb_page_check();
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path
  2016-07-05 16:18 [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Alex Bennée
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock' Alex Bennée
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 2/6] tcg: set up tb->page_addr before insertion Alex Bennée
@ 2016-07-05 16:18 ` Alex Bennée
  2016-07-07 14:18   ` Sergey Fedorov
  2016-07-08 20:09   ` Sergey Fedorov
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 4/6] tcg: cpu-exec: factor out TB patching code Alex Bennée
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 41+ messages in thread
From: Alex Bennée @ 2016-07-05 16:18 UTC (permalink / raw)
  To: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite

Lock contention in the hot path of moving between existing patched
TranslationBlocks is the main drag in multithreaded performance. This
patch pushes the tb_lock() usage down to the two places that really need
it:

  - code generation (tb_gen_code)
  - jump patching (tb_add_jump)

The rest of the code doesn't really need to hold a lock as it is either
using per-CPU structures, atomically updated or designed to be used in
concurrent read situations (qht_lookup).

To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
locks become NOPs anyway until the MTTCG work is completed.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>

---
v3 (base-patches)
  - fix merge conflicts with Sergey's patch
v1 (hot path, split from base-patches series)
  - revert name tweaking
  - drop test jmp_list_next outside lock
  - mention lock NOPs in comments
v2 (hot path)
  - Add r-b tags
---
 cpu-exec.c | 48 +++++++++++++++++++++---------------------------
 1 file changed, 21 insertions(+), 27 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 10ce1cb..dd0bd50 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -291,35 +291,29 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
      * Pairs with smp_wmb() in tb_phys_invalidate(). */
     smp_rmb();
     tb = tb_find_physical(cpu, pc, cs_base, flags);
-    if (tb) {
-        goto found;
-    }
+    if (!tb) {
 
-#ifdef CONFIG_USER_ONLY
-    /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
-     * taken outside tb_lock.  Since we're momentarily dropping
-     * tb_lock, there's a chance that our desired tb has been
-     * translated.
-     */
-    tb_unlock();
-    mmap_lock();
-    tb_lock();
-    tb = tb_find_physical(cpu, pc, cs_base, flags);
-    if (tb) {
-        mmap_unlock();
-        goto found;
-    }
-#endif
+        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
+         * taken outside tb_lock. As system emulation is currently
+         * single threaded the locks are NOPs.
+         */
+        mmap_lock();
+        tb_lock();
 
-    /* if no translated code available, then translate it now */
-    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+        /* There's a chance that our desired tb has been translated while
+         * taking the locks so we check again inside the lock.
+         */
+        tb = tb_find_physical(cpu, pc, cs_base, flags);
+        if (!tb) {
+            /* if no translated code available, then translate it now */
+            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+        }
 
-#ifdef CONFIG_USER_ONLY
-    mmap_unlock();
-#endif
+        tb_unlock();
+        mmap_unlock();
+    }
 
-found:
-    /* we add the TB in the virtual pc hash table */
+    /* We add the TB in the virtual pc hash table for the fast lookup */
     atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
     return tb;
 }
@@ -337,7 +331,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
        always be the same before a given translated block
        is executed. */
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-    tb_lock();
     tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
     if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
                  tb->flags != flags)) {
@@ -361,9 +354,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
 #endif
     /* See if we can patch the calling TB. */
     if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
+        tb_lock();
         tb_add_jump(*last_tb, tb_exit, tb);
+        tb_unlock();
     }
-    tb_unlock();
     return tb;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH v2 4/6] tcg: cpu-exec: factor out TB patching code
  2016-07-05 16:18 [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Alex Bennée
                   ` (2 preceding siblings ...)
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path Alex Bennée
@ 2016-07-05 16:18 ` Alex Bennée
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 5/6] tcg: introduce tb_lock_recursive() Alex Bennée
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-07-05 16:18 UTC (permalink / raw)
  To: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite

This is going to be useful in the next patch when rolling up the locking
on the slow path into the TB patching.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 cpu-exec.c | 59 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index dd0bd50..59c2ec5 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -278,6 +278,40 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
     return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
+/*
+ * Patch the last TB with a jump to the current TB.
+ *
+ * Modification of the TB has to be protected with tb_lock.
+ */
+static inline void maybe_patch_last_tb(CPUState *cpu,
+                                       TranslationBlock *tb,
+                                       TranslationBlock **last_tb,
+                                       int tb_exit)
+{
+    if (cpu->tb_flushed) {
+        /* Ensure that no TB jump will be modified as the
+         * translation buffer has been flushed.
+         */
+        *last_tb = NULL;
+        cpu->tb_flushed = false;
+    }
+#ifndef CONFIG_USER_ONLY
+    /* We don't take care of direct jumps when address mapping changes in
+     * system emulation. So it's not safe to make a direct jump to a TB
+     * spanning two pages because the mapping for the second page can change.
+     */
+    if (tb->page_addr[1] != -1) {
+        *last_tb = NULL;
+    }
+#endif
+    /* See if we can patch the calling TB. */
+    if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
+        tb_lock();
+        tb_add_jump(*last_tb, tb_exit, tb);
+        tb_unlock();
+    }
+}
+
 static TranslationBlock *tb_find_slow(CPUState *cpu,
                                       target_ulong pc,
                                       target_ulong cs_base,
@@ -336,28 +370,9 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
                  tb->flags != flags)) {
         tb = tb_find_slow(cpu, pc, cs_base, flags);
     }
-    if (cpu->tb_flushed) {
-        /* Ensure that no TB jump will be modified as the
-         * translation buffer has been flushed.
-         */
-        *last_tb = NULL;
-        cpu->tb_flushed = false;
-    }
-#ifndef CONFIG_USER_ONLY
-    /* We don't take care of direct jumps when address mapping changes in
-     * system emulation. So it's not safe to make a direct jump to a TB
-     * spanning two pages because the mapping for the second page can change.
-     */
-    if (tb->page_addr[1] != -1) {
-        *last_tb = NULL;
-    }
-#endif
-    /* See if we can patch the calling TB. */
-    if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
-        tb_lock();
-        tb_add_jump(*last_tb, tb_exit, tb);
-        tb_unlock();
-    }
+
+    maybe_patch_last_tb(cpu, tb, last_tb, tb_exit);
+
     return tb;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH v2 5/6] tcg: introduce tb_lock_recursive()
  2016-07-05 16:18 [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Alex Bennée
                   ` (3 preceding siblings ...)
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 4/6] tcg: cpu-exec: factor out TB patching code Alex Bennée
@ 2016-07-05 16:18 ` Alex Bennée
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 6/6] tcg: cpu-exec: roll-up tb_find_fast/slow Alex Bennée
  2016-07-07 16:04 ` [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Emilio G. Cota
  6 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-07-05 16:18 UTC (permalink / raw)
  To: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite

In some cases it is useful to have a recursive lock that can be safely
taken when nested in other critical regions. Rather than make all
tb_locks() recursive having a separate function serves to notify the
reader that it explicitly a possibility in this case.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 tcg/tcg.h       |  1 +
 translate-all.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 66d7fc0..62e4ced 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -647,6 +647,7 @@ void tcg_pool_reset(TCGContext *s);
 void tcg_pool_delete(TCGContext *s);
 
 void tb_lock(void);
+bool tb_lock_recursive(void);
 void tb_unlock(void);
 void tb_lock_reset(void);
 
diff --git a/translate-all.c b/translate-all.c
index 97e834a..414bc7e 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -134,6 +134,17 @@ void tb_lock(void)
 #endif
 }
 
+bool tb_lock_recursive(void)
+{
+#ifdef CONFIG_USER_ONLY
+    if (have_tb_lock) {
+        return false;
+    }
+    tb_lock();
+#endif
+    return true;
+}
+
 void tb_unlock(void)
 {
 #ifdef CONFIG_USER_ONLY
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH v2 6/6] tcg: cpu-exec: roll-up tb_find_fast/slow
  2016-07-05 16:18 [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Alex Bennée
                   ` (4 preceding siblings ...)
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 5/6] tcg: introduce tb_lock_recursive() Alex Bennée
@ 2016-07-05 16:18 ` Alex Bennée
  2016-07-07 16:44   ` Sergey Fedorov
  2016-07-07 16:04 ` [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Emilio G. Cota
  6 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-07-05 16:18 UTC (permalink / raw)
  To: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite

In an effort not to re-take locks I've re-factored the code so
everything is done inside the new tb_find function. The fast path will
only take a lock if the previous TB needs patching. The generation path
will do the patching inside the generation critical section.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 cpu-exec.c | 117 ++++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 66 insertions(+), 51 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 59c2ec5..529eae1 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -281,7 +281,8 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
 /*
  * Patch the last TB with a jump to the current TB.
  *
- * Modification of the TB has to be protected with tb_lock.
+ * Modification of the TB has to be protected with tb_lock which can
+ * either be already held or taken here.
  */
 static inline void maybe_patch_last_tb(CPUState *cpu,
                                        TranslationBlock *tb,
@@ -306,72 +307,86 @@ static inline void maybe_patch_last_tb(CPUState *cpu,
 #endif
     /* See if we can patch the calling TB. */
     if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
-        tb_lock();
+        bool locked = tb_lock_recursive();
         tb_add_jump(*last_tb, tb_exit, tb);
-        tb_unlock();
-    }
-}
-
-static TranslationBlock *tb_find_slow(CPUState *cpu,
-                                      target_ulong pc,
-                                      target_ulong cs_base,
-                                      uint32_t flags)
-{
-    TranslationBlock *tb;
-
-    /* Ensure that we won't find a TB in the shared hash table
-     * if it is being invalidated by some other thread.
-     * Otherwise we'd put it back to CPU's local cache.
-     * Pairs with smp_wmb() in tb_phys_invalidate(). */
-    smp_rmb();
-    tb = tb_find_physical(cpu, pc, cs_base, flags);
-    if (!tb) {
-
-        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
-         * taken outside tb_lock. As system emulation is currently
-         * single threaded the locks are NOPs.
-         */
-        mmap_lock();
-        tb_lock();
-
-        /* There's a chance that our desired tb has been translated while
-         * taking the locks so we check again inside the lock.
-         */
-        tb = tb_find_physical(cpu, pc, cs_base, flags);
-        if (!tb) {
-            /* if no translated code available, then translate it now */
-            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+        if (locked) {
+            tb_unlock();
         }
-
-        tb_unlock();
-        mmap_unlock();
     }
-
-    /* We add the TB in the virtual pc hash table for the fast lookup */
-    atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
-    return tb;
 }
 
-static inline TranslationBlock *tb_find_fast(CPUState *cpu,
-                                             TranslationBlock **last_tb,
-                                             int tb_exit)
+/*
+ * tb_find - find next TB, possibly generating it
+ *
+ * There is a multi-level lookup for finding the next TB which avoids
+ * locks unless generation is required.
+ *
+ * 1. Lookup via the per-vcpu tb_jmp_cache
+ * 2. Lookup via tb_find_physical (using QHT)
+ *
+ * If both of those fail then we need to grab the mmap_lock and
+ * tb_lock and do code generation.
+ *
+ * As the jump patching of code also needs to be protected by locks we
+ * have multiple paths into maybe_patch_last_tb taking advantage of
+ * the fact we may already have locks held for code generation.
+ */
+static TranslationBlock *tb_find(CPUState *cpu,
+                                 TranslationBlock **last_tb,
+                                 int tb_exit)
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
     target_ulong cs_base, pc;
+    unsigned int h;
     uint32_t flags;
 
     /* we record a subset of the CPU state. It will
        always be the same before a given translated block
        is executed. */
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-    tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
+    h = tb_jmp_cache_hash_func(pc);
+    tb = atomic_read(&cpu->tb_jmp_cache[h]);
+
     if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
                  tb->flags != flags)) {
-        tb = tb_find_slow(cpu, pc, cs_base, flags);
-    }
 
-    maybe_patch_last_tb(cpu, tb, last_tb, tb_exit);
+        /* Ensure that we won't find a TB in the shared hash table
+         * if it is being invalidated by some other thread.
+         * Otherwise we'd put it back to CPU's local cache.
+         * Pairs with smp_wmb() in tb_phys_invalidate(). */
+        smp_rmb();
+        tb = tb_find_physical(cpu, pc, cs_base, flags);
+
+        if (!tb) {
+            /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
+             * taken outside tb_lock. As system emulation is currently
+             * single threaded the locks are NOPs.
+             */
+            mmap_lock();
+            tb_lock();
+
+            /* There's a chance that our desired tb has been translated while
+             * taking the locks so we check again inside the lock.
+             */
+            tb = tb_find_physical(cpu, pc, cs_base, flags);
+            if (!tb) {
+                /* if no translated code available, then translate it now */
+                tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+            }
+            maybe_patch_last_tb(cpu, tb, last_tb, tb_exit);
+
+            tb_unlock();
+            mmap_unlock();
+        } else {
+            maybe_patch_last_tb(cpu, tb, last_tb, tb_exit);
+        }
+
+        /* We update the TB in the virtual pc hash table for the fast lookup */
+        atomic_set(&cpu->tb_jmp_cache[h], tb);
+    } else {
+        maybe_patch_last_tb(cpu, tb, last_tb, tb_exit);
+    }
 
     return tb;
 }
@@ -452,7 +467,7 @@ static inline bool cpu_handle_exception(CPUState *cpu, int *ret)
                && cpu->icount_decr.u16.low + cpu->icount_extra == 0) {
         /* try to cause an exception pending in the log */
         TranslationBlock *last_tb = NULL; /* Avoid chaining TBs */
-        cpu_exec_nocache(cpu, 1, tb_find_fast(cpu, &last_tb, 0), true);
+        cpu_exec_nocache(cpu, 1, tb_find(cpu, &last_tb, 0), true);
         *ret = -1;
         return true;
 #endif
@@ -636,7 +651,7 @@ int cpu_exec(CPUState *cpu)
             cpu->tb_flushed = false; /* reset before first TB lookup */
             for(;;) {
                 cpu_handle_interrupt(cpu, &last_tb);
-                tb = tb_find_fast(cpu, &last_tb, tb_exit);
+                tb = tb_find(cpu, &last_tb, tb_exit);
                 cpu_loop_exec_tb(cpu, tb, &last_tb, &tb_exit, &sc);
                 /* Try to align the host and virtual clocks
                    if the guest is in advance */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock' Alex Bennée
@ 2016-07-07 13:52   ` Sergey Fedorov
  2016-07-08 14:51   ` Sergey Fedorov
  1 sibling, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 13:52 UTC (permalink / raw)
  To: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Sergey Fedorov, Peter Crosthwaite

I was not sure if the language I used in the source code comments is
100% correct. So it would be fine if someone could check if it is easy
to understand ;)

Thanks,
Sergey

On 05/07/16 19:18, Alex Bennée wrote:
> From: Sergey Fedorov <serge.fdrv@gmail.com>
>
> First, ensure atomicity of CPU's 'tb_jmp_cache' access by:
>  * using atomic_read() to look up a TB when not holding 'tb_lock';
>  * using atomic_write() to remove a TB from each CPU's local cache on
>    TB invalidation.
>
> Second, add some memory barriers to ensure we don't put the TB being
> invalidated back to CPU's 'tb_jmp_cache'. If we fail to look up a TB in
> CPU's local cache because it is being invalidated by some other thread
> then it must not be found in the shared TB hash table. Otherwise we'd
> put it back to CPU's local cache.
>
> Note that this patch does *not* make CPU's TLB invalidation safe if it
> is done from some other thread while the CPU is in its execution loop.
>
> Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
> [AJB: fixed missing atomic set, tweak title]
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Emilio G. Cota <cota@braap.org>
>
> ---
> v1 (AJB):
>   - tweak title
>   - fixed missing set of tb_jmp_cache
> v2
>   - fix spelling s/con't/can't/
>   - add atomic_read while clearing tb_jmp_cache
>   - add r-b tags
> ---
>  cpu-exec.c      | 9 +++++++--
>  translate-all.c | 9 +++++++--
>  2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index b840e1d..10ce1cb 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -285,6 +285,11 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>  {
>      TranslationBlock *tb;
>  
> +    /* Ensure that we won't find a TB in the shared hash table
> +     * if it is being invalidated by some other thread.
> +     * Otherwise we'd put it back to CPU's local cache.
> +     * Pairs with smp_wmb() in tb_phys_invalidate(). */
> +    smp_rmb();
>      tb = tb_find_physical(cpu, pc, cs_base, flags);
>      if (tb) {
>          goto found;
> @@ -315,7 +320,7 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>  
>  found:
>      /* we add the TB in the virtual pc hash table */
> -    cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
> +    atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>      return tb;
>  }
>  
> @@ -333,7 +338,7 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>         is executed. */
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>      tb_lock();
> -    tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
> +    tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>                   tb->flags != flags)) {
>          tb = tb_find_slow(cpu, pc, cs_base, flags);
> diff --git a/translate-all.c b/translate-all.c
> index eaa95e4..96efe48 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -1004,11 +1004,16 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>          invalidate_page_bitmap(p);
>      }
>  
> +    /* Ensure that we won't find the TB in the shared hash table
> +     * if we can't see it in CPU's local cache.
> +     * Pairs with smp_rmb() in tb_find_slow(). */
> +    smp_wmb();
> +
>      /* remove the TB from the hash list */
>      h = tb_jmp_cache_hash_func(tb->pc);
>      CPU_FOREACH(cpu) {
> -        if (cpu->tb_jmp_cache[h] == tb) {
> -            cpu->tb_jmp_cache[h] = NULL;
> +        if (atomic_read(&cpu->tb_jmp_cache[h]) == tb) {
> +            atomic_set(&cpu->tb_jmp_cache[h], NULL);
>          }
>      }
>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/6] tcg: set up tb->page_addr before insertion
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 2/6] tcg: set up tb->page_addr before insertion Alex Bennée
@ 2016-07-07 14:08   ` Sergey Fedorov
  2016-07-08  9:40     ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 14:08 UTC (permalink / raw)
  To: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 05/07/16 19:18, Alex Bennée wrote:
> This ensures that if we find the TB on the slow path that tb->page_addr
> is correctly set before being tested.
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

Reveiwed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  translate-all.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/translate-all.c b/translate-all.c
> index 96efe48..97e834a 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -1126,10 +1126,6 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>  {
>      uint32_t h;
>  
> -    /* add in the hash table */
> -    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
> -
>      /* add in the page list */
>      tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
>      if (phys_page2 != -1) {
> @@ -1138,6 +1134,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>          tb->page_addr[1] = -1;
>      }
>  
> +    /* add in the hash table */
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> +    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
> +
>  #ifdef DEBUG_TB_CHECK
>      tb_page_check();
>  #endif

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path Alex Bennée
@ 2016-07-07 14:18   ` Sergey Fedorov
  2016-07-08 15:50     ` Sergey Fedorov
  2016-07-08 17:34     ` Sergey Fedorov
  2016-07-08 20:09   ` Sergey Fedorov
  1 sibling, 2 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 14:18 UTC (permalink / raw)
  To: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 05/07/16 19:18, Alex Bennée wrote:
> Lock contention in the hot path of moving between existing patched
> TranslationBlocks is the main drag in multithreaded performance. This
> patch pushes the tb_lock() usage down to the two places that really need
> it:
>
>   - code generation (tb_gen_code)
>   - jump patching (tb_add_jump)
>
> The rest of the code doesn't really need to hold a lock as it is either
> using per-CPU structures, atomically updated or designed to be used in
> concurrent read situations (qht_lookup).
>
> To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
> locks become NOPs anyway until the MTTCG work is completed.
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

>
> ---
> v3 (base-patches)
>   - fix merge conflicts with Sergey's patch
> v1 (hot path, split from base-patches series)
>   - revert name tweaking
>   - drop test jmp_list_next outside lock
>   - mention lock NOPs in comments
> v2 (hot path)
>   - Add r-b tags
> ---
>  cpu-exec.c | 48 +++++++++++++++++++++---------------------------
>  1 file changed, 21 insertions(+), 27 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 10ce1cb..dd0bd50 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -291,35 +291,29 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>       * Pairs with smp_wmb() in tb_phys_invalidate(). */
>      smp_rmb();
>      tb = tb_find_physical(cpu, pc, cs_base, flags);
> -    if (tb) {
> -        goto found;
> -    }
> +    if (!tb) {
>  
> -#ifdef CONFIG_USER_ONLY
> -    /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
> -     * taken outside tb_lock.  Since we're momentarily dropping
> -     * tb_lock, there's a chance that our desired tb has been
> -     * translated.
> -     */
> -    tb_unlock();
> -    mmap_lock();
> -    tb_lock();
> -    tb = tb_find_physical(cpu, pc, cs_base, flags);
> -    if (tb) {
> -        mmap_unlock();
> -        goto found;
> -    }
> -#endif
> +        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
> +         * taken outside tb_lock. As system emulation is currently
> +         * single threaded the locks are NOPs.
> +         */
> +        mmap_lock();
> +        tb_lock();
>  
> -    /* if no translated code available, then translate it now */
> -    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
> +        /* There's a chance that our desired tb has been translated while
> +         * taking the locks so we check again inside the lock.
> +         */
> +        tb = tb_find_physical(cpu, pc, cs_base, flags);
> +        if (!tb) {
> +            /* if no translated code available, then translate it now */
> +            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
> +        }
>  
> -#ifdef CONFIG_USER_ONLY
> -    mmap_unlock();
> -#endif
> +        tb_unlock();
> +        mmap_unlock();
> +    }
>  
> -found:
> -    /* we add the TB in the virtual pc hash table */
> +    /* We add the TB in the virtual pc hash table for the fast lookup */
>      atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>      return tb;
>  }
> @@ -337,7 +331,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>         always be the same before a given translated block
>         is executed. */
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> -    tb_lock();
>      tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>                   tb->flags != flags)) {
> @@ -361,9 +354,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>  #endif
>      /* See if we can patch the calling TB. */
>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
> +        tb_lock();
>          tb_add_jump(*last_tb, tb_exit, tb);
> +        tb_unlock();
>      }
> -    tb_unlock();
>      return tb;
>  }
>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path
  2016-07-05 16:18 [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Alex Bennée
                   ` (5 preceding siblings ...)
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 6/6] tcg: cpu-exec: roll-up tb_find_fast/slow Alex Bennée
@ 2016-07-07 16:04 ` Emilio G. Cota
  2016-07-07 16:13   ` Paolo Bonzini
  2016-07-07 19:38   ` Alex Bennée
  6 siblings, 2 replies; 41+ messages in thread
From: Emilio G. Cota @ 2016-07-07 16:04 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, bobby.prani,
	rth, mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana

On Tue, Jul 05, 2016 at 17:18:10 +0100, Alex Bennée wrote:
> Well this is the first re-spin of the series posted last week. I've
> added a bunch of additional patches to be more aggressive with
> avoiding bouncing locks but to be honest the numbers don't seem to
> make it worth it.

How many threads are you using? With just a few threads I wouldn't
expect a measurable difference.

> I think the first 3 patches are ready to take if the TCG maintainers
> want to:
> 
>     tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'
>     tcg: set up tb->page_addr before insertion
>     tcg: cpu-exec: remove tb_lock from the hot-path

I think it would be simpler to use tb_lock_recursive and
tb_lock_reset, as pointed out in v1 of this series.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path
  2016-07-07 16:04 ` [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Emilio G. Cota
@ 2016-07-07 16:13   ` Paolo Bonzini
  2016-07-07 19:33     ` Alex Bennée
  2016-07-07 19:38   ` Alex Bennée
  1 sibling, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2016-07-07 16:13 UTC (permalink / raw)
  To: Emilio G. Cota, Alex Bennée
  Cc: mttcg, peter.maydell, claudio.fontana, jan.kiszka, mark.burton,
	a.rigo, qemu-devel, serge.fdrv, bobby.prani, rth, fred.konrad



On 07/07/2016 18:04, Emilio G. Cota wrote:
>> > I think the first 3 patches are ready to take if the TCG maintainers
>> > want to:
>> > 
>> >     tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'
>> >     tcg: set up tb->page_addr before insertion
>> >     tcg: cpu-exec: remove tb_lock from the hot-path
> I think it would be simpler to use tb_lock_recursive and
> tb_lock_reset, as pointed out in v1 of this series.

I agree, this series is doing a lot more restructuring than necessary.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/6] tcg: cpu-exec: roll-up tb_find_fast/slow
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 6/6] tcg: cpu-exec: roll-up tb_find_fast/slow Alex Bennée
@ 2016-07-07 16:44   ` Sergey Fedorov
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 1/3] tcg: Introduce mmap_lock_reset() Sergey Fedorov
                       ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 16:44 UTC (permalink / raw)
  To: qemu-devel, Alex Bennée
  Cc: patches, Sergey Fedorov, mttcg, fred.konrad, a.rigo, cota,
	bobby.prani, rth, mark.burton, pbonzini, jan.kiszka,
	peter.maydell, claudio.fontana, Peter Crosthwaite

From: Sergey Fedorov <serge.fdrv@gmail.com>

Here is my attempt ;-)

Kind regards,
Sergey


Sergey Fedorov (3):
  tcg: Introduce mmap_lock_reset()
  tcg: Introduce tb_lock_locked()
  tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()

 bsd-user/mmap.c         |  7 +++++++
 cpu-exec.c              | 15 +++++++++------
 include/exec/exec-all.h |  2 ++
 linux-user/mmap.c       |  7 +++++++
 tcg/tcg.h               |  1 +
 translate-all.c         |  9 +++++++++
 6 files changed, 35 insertions(+), 6 deletions(-)

--
1.9.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH 1/3] tcg: Introduce mmap_lock_reset()
  2016-07-07 16:44   ` Sergey Fedorov
@ 2016-07-07 16:44     ` Sergey Fedorov
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 2/3] tcg: Introduce tb_lock_locked() Sergey Fedorov
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump() Sergey Fedorov
  2 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 16:44 UTC (permalink / raw)
  To: qemu-devel, Alex Bennée
  Cc: patches, Sergey Fedorov, mttcg, fred.konrad, a.rigo, cota,
	bobby.prani, rth, mark.burton, pbonzini, jan.kiszka,
	peter.maydell, claudio.fontana, Peter Crosthwaite,
	Sergey Fedorov, Riku Voipio

From: Sergey Fedorov <serge.fdrv@gmail.com>

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
---
 bsd-user/mmap.c         | 7 +++++++
 include/exec/exec-all.h | 2 ++
 linux-user/mmap.c       | 7 +++++++
 3 files changed, 16 insertions(+)

diff --git a/bsd-user/mmap.c b/bsd-user/mmap.c
index 610f91b28584..b2e5909b03c4 100644
--- a/bsd-user/mmap.c
+++ b/bsd-user/mmap.c
@@ -42,6 +42,13 @@ void mmap_unlock(void)
     }
 }
 
+void mmap_lock_reset(void)
+{
+    if (mmap_lock_count) {
+        pthread_mutex_unlock(&mmap_mutex);
+    }
+}
+
 /* Grab lock to make sure things are in a consistent state after fork().  */
 void mmap_fork_start(void)
 {
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index c1f59fa59d2c..0375acb5ab40 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -369,6 +369,7 @@ void tlb_fill(CPUState *cpu, target_ulong addr, int is_write, int mmu_idx,
 #if defined(CONFIG_USER_ONLY)
 void mmap_lock(void);
 void mmap_unlock(void);
+void mmap_lock_reset(void);
 
 static inline tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
 {
@@ -377,6 +378,7 @@ static inline tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong
 #else
 static inline void mmap_lock(void) {}
 static inline void mmap_unlock(void) {}
+static inline void mmap_lock_reset(void) {}
 
 /* cputlb.c */
 tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr);
diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index c4371d943a85..4cd998f96766 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -43,6 +43,13 @@ void mmap_unlock(void)
     }
 }
 
+void mmap_lock_reset(void)
+{
+    if (mmap_lock_count) {
+        pthread_mutex_unlock(&mmap_mutex);
+    }
+}
+
 /* Grab lock to make sure things are in a consistent state after fork().  */
 void mmap_fork_start(void)
 {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH 2/3] tcg: Introduce tb_lock_locked()
  2016-07-07 16:44   ` Sergey Fedorov
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 1/3] tcg: Introduce mmap_lock_reset() Sergey Fedorov
@ 2016-07-07 16:44     ` Sergey Fedorov
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump() Sergey Fedorov
  2 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 16:44 UTC (permalink / raw)
  To: qemu-devel, Alex Bennée
  Cc: patches, Sergey Fedorov, mttcg, fred.konrad, a.rigo, cota,
	bobby.prani, rth, mark.burton, pbonzini, jan.kiszka,
	peter.maydell, claudio.fontana, Peter Crosthwaite,
	Sergey Fedorov

From: Sergey Fedorov <serge.fdrv@gmail.com>

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
---
 tcg/tcg.h       | 1 +
 translate-all.c | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 66d7fc01c53c..ca9329d99f90 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -648,6 +648,7 @@ void tcg_pool_delete(TCGContext *s);
 
 void tb_lock(void);
 void tb_unlock(void);
+bool tb_lock_locked(void);
 void tb_lock_reset(void);
 
 static inline void *tcg_malloc(int size)
diff --git a/translate-all.c b/translate-all.c
index 97e834a84503..0dfcf9bc3430 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -143,6 +143,15 @@ void tb_unlock(void)
 #endif
 }
 
+bool tb_lock_locked(void)
+{
+#ifdef CONFIG_USER_ONLY
+    return have_tb_lock;
+#else
+    return true;
+#endif
+}
+
 void tb_lock_reset(void)
 {
 #ifdef CONFIG_USER_ONLY
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-07 16:44   ` Sergey Fedorov
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 1/3] tcg: Introduce mmap_lock_reset() Sergey Fedorov
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 2/3] tcg: Introduce tb_lock_locked() Sergey Fedorov
@ 2016-07-07 16:44     ` Sergey Fedorov
  2016-07-07 19:36       ` Alex Bennée
  2016-07-08  8:40       ` Paolo Bonzini
  2 siblings, 2 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 16:44 UTC (permalink / raw)
  To: qemu-devel, Alex Bennée
  Cc: patches, Sergey Fedorov, mttcg, fred.konrad, a.rigo, cota,
	bobby.prani, rth, mark.burton, pbonzini, jan.kiszka,
	peter.maydell, claudio.fontana, Peter Crosthwaite,
	Sergey Fedorov

From: Sergey Fedorov <serge.fdrv@gmail.com>

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
---
 cpu-exec.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index dd0bd5007701..54c935039592 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
 
         /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
          * taken outside tb_lock. As system emulation is currently
-         * single threaded the locks are NOPs.
+         * single threaded the locks are NOPs. Both locks are to be
+         * released at the end of tb_find_fast().
          */
         mmap_lock();
         tb_lock();
@@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
             /* if no translated code available, then translate it now */
             tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
         }
-
-        tb_unlock();
-        mmap_unlock();
     }
 
     /* We add the TB in the virtual pc hash table for the fast lookup */
@@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
 #endif
     /* See if we can patch the calling TB. */
     if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
-        tb_lock();
+        if (!tb_lock_locked()) {
+            tb_lock();
+        }
         tb_add_jump(*last_tb, tb_exit, tb);
-        tb_unlock();
     }
+
+    tb_lock_reset();
+    mmap_lock_reset();
+
     return tb;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path
  2016-07-07 16:13   ` Paolo Bonzini
@ 2016-07-07 19:33     ` Alex Bennée
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-07-07 19:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Emilio G. Cota, mttcg, peter.maydell, claudio.fontana,
	jan.kiszka, mark.burton, a.rigo, qemu-devel, serge.fdrv,
	bobby.prani, rth, fred.konrad


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 07/07/2016 18:04, Emilio G. Cota wrote:
>>> > I think the first 3 patches are ready to take if the TCG maintainers
>>> > want to:
>>> >
>>> >     tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'
>>> >     tcg: set up tb->page_addr before insertion
>>> >     tcg: cpu-exec: remove tb_lock from the hot-path
>> I think it would be simpler to use tb_lock_recursive and
>> tb_lock_reset, as pointed out in v1 of this series.
>
> I agree, this series is doing a lot more restructuring than necessary.

Really I thought the first 3 patches where simplifying things - that's
why I moved the more aggressive stuff into follow on patches. Wouldn't
using tb_lock_recursive inside the mmap_lock() run the risk of a deadlock?

>
> Paolo


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump() Sergey Fedorov
@ 2016-07-07 19:36       ` Alex Bennée
  2016-07-07 19:46         ` Sergey Fedorov
  2016-07-08  8:40       ` Paolo Bonzini
  1 sibling, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-07-07 19:36 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: qemu-devel, patches, Sergey Fedorov, mttcg, fred.konrad, a.rigo,
	cota, bobby.prani, rth, mark.burton, pbonzini, jan.kiszka,
	peter.maydell, claudio.fontana, Peter Crosthwaite


Sergey Fedorov <sergey.fedorov@linaro.org> writes:

> From: Sergey Fedorov <serge.fdrv@gmail.com>
>
> Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
> ---
>  cpu-exec.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index dd0bd5007701..54c935039592 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>
>          /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>           * taken outside tb_lock. As system emulation is currently
> -         * single threaded the locks are NOPs.
> +         * single threaded the locks are NOPs. Both locks are to be
> +         * released at the end of tb_find_fast().
>           */
>          mmap_lock();
>          tb_lock();
> @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>              /* if no translated code available, then translate it now */
>              tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>          }
> -
> -        tb_unlock();
> -        mmap_unlock();

Hmm pushing these outside of tb_find_slow() makes me uncomfortable. I
guess tb_find_fast/slow are intimately tied together but the idea of
taking locks which are the responsibility of the calling function to
clear seems ugly to me.

>      }
>
>      /* We add the TB in the virtual pc hash table for the fast lookup */
> @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>  #endif
>      /* See if we can patch the calling TB. */
>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
> -        tb_lock();
> +        if (!tb_lock_locked()) {
> +            tb_lock();
> +        }
>          tb_add_jump(*last_tb, tb_exit, tb);
> -        tb_unlock();
>      }
> +
> +    tb_lock_reset();
> +    mmap_lock_reset();
> +
>      return tb;
>  }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path
  2016-07-07 16:04 ` [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Emilio G. Cota
  2016-07-07 16:13   ` Paolo Bonzini
@ 2016-07-07 19:38   ` Alex Bennée
  1 sibling, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-07-07 19:38 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, serge.fdrv, bobby.prani,
	rth, mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana


Emilio G. Cota <cota@braap.org> writes:

> On Tue, Jul 05, 2016 at 17:18:10 +0100, Alex Bennée wrote:
>> Well this is the first re-spin of the series posted last week. I've
>> added a bunch of additional patches to be more aggressive with
>> avoiding bouncing locks but to be honest the numbers don't seem to
>> make it worth it.
>
> How many threads are you using? With just a few threads I wouldn't
> expect a measurable difference.

My setup is 8 "cores" which pigz expands to use but I know you have
beefier machines on hand ;-)

>
>> I think the first 3 patches are ready to take if the TCG maintainers
>> want to:
>>
>>     tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'
>>     tcg: set up tb->page_addr before insertion
>>     tcg: cpu-exec: remove tb_lock from the hot-path
>
> I think it would be simpler to use tb_lock_recursive and
> tb_lock_reset, as pointed out in v1 of this series.

I didn't realise people were suggesting asymmetric lock taking/reseting.
It seems ugly IMHO.

>
> Thanks,
>
> 		Emilio


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-07 19:36       ` Alex Bennée
@ 2016-07-07 19:46         ` Sergey Fedorov
  2016-07-07 20:36           ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 19:46 UTC (permalink / raw)
  To: Alex Bennée, Sergey Fedorov
  Cc: qemu-devel, patches, mttcg, fred.konrad, a.rigo, cota,
	bobby.prani, rth, mark.burton, pbonzini, jan.kiszka,
	peter.maydell, claudio.fontana, Peter Crosthwaite

On 07/07/16 22:36, Alex Bennée wrote:
> Sergey Fedorov <sergey.fedorov@linaro.org> writes:
>
>> From: Sergey Fedorov <serge.fdrv@gmail.com>
>>
>> Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
>> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
>> ---
>>  cpu-exec.c | 15 +++++++++------
>>  1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/cpu-exec.c b/cpu-exec.c
>> index dd0bd5007701..54c935039592 100644
>> --- a/cpu-exec.c
>> +++ b/cpu-exec.c
>> @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>
>>          /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>>           * taken outside tb_lock. As system emulation is currently
>> -         * single threaded the locks are NOPs.
>> +         * single threaded the locks are NOPs. Both locks are to be
>> +         * released at the end of tb_find_fast().
>>           */
>>          mmap_lock();
>>          tb_lock();
>> @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>              /* if no translated code available, then translate it now */
>>              tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>>          }
>> -
>> -        tb_unlock();
>> -        mmap_unlock();
> Hmm pushing these outside of tb_find_slow() makes me uncomfortable. I
> guess tb_find_fast/slow are intimately tied together but the idea of
> taking locks which are the responsibility of the calling function to
> clear seems ugly to me.

Okay, what if we also:

diff --git a/cpu-exec.c b/cpu-exec.c
index 54c935039592..ff8f92bc1dc1 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -278,45 +278,7 @@ static TranslationBlock *tb_find_physical(CPUState
*cpu,
     return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
-static TranslationBlock *tb_find_slow(CPUState *cpu,
-                                      target_ulong pc,
-                                      target_ulong cs_base,
-                                      uint32_t flags)
-{
-    TranslationBlock *tb;
-
-    /* Ensure that we won't find a TB in the shared hash table
-     * if it is being invalidated by some other thread.
-     * Otherwise we'd put it back to CPU's local cache.
-     * Pairs with smp_wmb() in tb_phys_invalidate(). */
-    smp_rmb();
-    tb = tb_find_physical(cpu, pc, cs_base, flags);
-    if (!tb) {
-
-        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
-         * taken outside tb_lock. As system emulation is currently
-         * single threaded the locks are NOPs. Both locks are to be
-         * released at the end of tb_find_fast().
-         */
-        mmap_lock();
-        tb_lock();
-
-        /* There's a chance that our desired tb has been translated while
-         * taking the locks so we check again inside the lock.
-         */
-        tb = tb_find_physical(cpu, pc, cs_base, flags);
-        if (!tb) {
-            /* if no translated code available, then translate it now */
-            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
-        }
-    }
-
-    /* We add the TB in the virtual pc hash table for the fast lookup */
-    atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
-    return tb;
-}
-
-static inline TranslationBlock *tb_find_fast(CPUState *cpu,
+static inline TranslationBlock *tb_find(CPUState *cpu,
                                              TranslationBlock **last_tb,
                                              int tb_exit)
 {
@@ -332,7 +294,34 @@ static inline TranslationBlock
*tb_find_fast(CPUState *cpu,
     tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
     if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
                  tb->flags != flags)) {
-        tb = tb_find_slow(cpu, pc, cs_base, flags);
+        /* Ensure that we won't find a TB in the shared hash table
+         * if it is being invalidated by some other thread.
+         * Otherwise we'd put it back to CPU's local cache.
+         * Pairs with smp_wmb() in tb_phys_invalidate(). */
+        smp_rmb();
+        tb = tb_find_physical(cpu, pc, cs_base, flags);
+        if (!tb) {
+
+            /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
+             * taken outside tb_lock. As system emulation is currently
+             * single threaded the locks are NOPs. Both locks are to be
+             * released at the end of the function.
+             */
+            mmap_lock();
+            tb_lock();
+
+            /* There's a chance that our desired tb has been translated
while
+             * taking the locks so we check again inside the lock.
+             */
+            tb = tb_find_physical(cpu, pc, cs_base, flags);
+            if (!tb) {
+                /* if no translated code available, then translate it
now */
+                tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+            }
+        }
+
+        /* We add the TB in the virtual pc hash table for the fast
lookup */
+        atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
     }
     if (cpu->tb_flushed) {
         /* Ensure that no TB jump will be modified as the
@@ -440,7 +429,7 @@ static inline bool cpu_handle_exception(CPUState
*cpu, int *ret)
                && cpu->icount_decr.u16.low + cpu->icount_extra == 0) {
         /* try to cause an exception pending in the log */
         TranslationBlock *last_tb = NULL; /* Avoid chaining TBs */
-        cpu_exec_nocache(cpu, 1, tb_find_fast(cpu, &last_tb, 0), true);
+        cpu_exec_nocache(cpu, 1, tb_find(cpu, &last_tb, 0), true);
         *ret = -1;
         return true;
 #endif
@@ -624,7 +613,7 @@ int cpu_exec(CPUState *cpu)
             cpu->tb_flushed = false; /* reset before first TB lookup */
             for(;;) {
                 cpu_handle_interrupt(cpu, &last_tb);
-                tb = tb_find_fast(cpu, &last_tb, tb_exit);
+                tb = tb_find(cpu, &last_tb, tb_exit);
                 cpu_loop_exec_tb(cpu, tb, &last_tb, &tb_exit, &sc);
                 /* Try to align the host and virtual clocks
                    if the guest is in advance */



Kind regards,
Sergey


>
>>      }
>>
>>      /* We add the TB in the virtual pc hash table for the fast lookup */
>> @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>  #endif
>>      /* See if we can patch the calling TB. */
>>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
>> -        tb_lock();
>> +        if (!tb_lock_locked()) {
>> +            tb_lock();
>> +        }
>>          tb_add_jump(*last_tb, tb_exit, tb);
>> -        tb_unlock();
>>      }
>> +
>> +    tb_lock_reset();
>> +    mmap_lock_reset();
>> +
>>      return tb;
>>  }
>
> --
> Alex Bennée

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-07 19:46         ` Sergey Fedorov
@ 2016-07-07 20:36           ` Sergey Fedorov
  2016-07-07 21:40             ` Alex Bennée
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-07 20:36 UTC (permalink / raw)
  To: Alex Bennée, Sergey Fedorov
  Cc: qemu-devel, patches, mttcg, fred.konrad, a.rigo, cota,
	bobby.prani, rth, mark.burton, pbonzini, jan.kiszka,
	peter.maydell, claudio.fontana, Peter Crosthwaite

On 07/07/16 22:46, Sergey Fedorov wrote:
> On 07/07/16 22:36, Alex Bennée wrote:
>> Sergey Fedorov <sergey.fedorov@linaro.org> writes:
>>
>>> From: Sergey Fedorov <serge.fdrv@gmail.com>
>>>
>>> Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
>>> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
>>> ---
>>>  cpu-exec.c | 15 +++++++++------
>>>  1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/cpu-exec.c b/cpu-exec.c
>>> index dd0bd5007701..54c935039592 100644
>>> --- a/cpu-exec.c
>>> +++ b/cpu-exec.c
>>> @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>>
>>>          /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>>>           * taken outside tb_lock. As system emulation is currently
>>> -         * single threaded the locks are NOPs.
>>> +         * single threaded the locks are NOPs. Both locks are to be
>>> +         * released at the end of tb_find_fast().
>>>           */
>>>          mmap_lock();
>>>          tb_lock();
>>> @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>>              /* if no translated code available, then translate it now */
>>>              tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>>>          }
>>> -
>>> -        tb_unlock();
>>> -        mmap_unlock();
>> Hmm pushing these outside of tb_find_slow() makes me uncomfortable. I
>> guess tb_find_fast/slow are intimately tied together but the idea of
>> taking locks which are the responsibility of the calling function to
>> clear seems ugly to me.
> Okay, what if we also:
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 54c935039592..ff8f92bc1dc1 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -278,45 +278,7 @@ static TranslationBlock *tb_find_physical(CPUState
> *cpu,
>      return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
>  }
>  
> -static TranslationBlock *tb_find_slow(CPUState *cpu,
> -                                      target_ulong pc,
> -                                      target_ulong cs_base,
> -                                      uint32_t flags)
> -{
> -    TranslationBlock *tb;
> -
> -    /* Ensure that we won't find a TB in the shared hash table
> -     * if it is being invalidated by some other thread.
> -     * Otherwise we'd put it back to CPU's local cache.
> -     * Pairs with smp_wmb() in tb_phys_invalidate(). */
> -    smp_rmb();
> -    tb = tb_find_physical(cpu, pc, cs_base, flags);
> -    if (!tb) {
> -
> -        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
> -         * taken outside tb_lock. As system emulation is currently
> -         * single threaded the locks are NOPs. Both locks are to be
> -         * released at the end of tb_find_fast().
> -         */
> -        mmap_lock();
> -        tb_lock();
> -
> -        /* There's a chance that our desired tb has been translated while
> -         * taking the locks so we check again inside the lock.
> -         */
> -        tb = tb_find_physical(cpu, pc, cs_base, flags);
> -        if (!tb) {
> -            /* if no translated code available, then translate it now */
> -            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
> -        }
> -    }
> -
> -    /* We add the TB in the virtual pc hash table for the fast lookup */
> -    atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
> -    return tb;
> -}
> -
> -static inline TranslationBlock *tb_find_fast(CPUState *cpu,
> +static inline TranslationBlock *tb_find(CPUState *cpu,
>                                               TranslationBlock **last_tb,
>                                               int tb_exit)
>  {
> @@ -332,7 +294,34 @@ static inline TranslationBlock
> *tb_find_fast(CPUState *cpu,
>      tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>                   tb->flags != flags)) {
> -        tb = tb_find_slow(cpu, pc, cs_base, flags);
> +        /* Ensure that we won't find a TB in the shared hash table
> +         * if it is being invalidated by some other thread.
> +         * Otherwise we'd put it back to CPU's local cache.
> +         * Pairs with smp_wmb() in tb_phys_invalidate(). */
> +        smp_rmb();
> +        tb = tb_find_physical(cpu, pc, cs_base, flags);
> +        if (!tb) {
> +
> +            /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
> +             * taken outside tb_lock. As system emulation is currently
> +             * single threaded the locks are NOPs. Both locks are to be
> +             * released at the end of the function.
> +             */
> +            mmap_lock();
> +            tb_lock();
> +
> +            /* There's a chance that our desired tb has been translated
> while
> +             * taking the locks so we check again inside the lock.
> +             */
> +            tb = tb_find_physical(cpu, pc, cs_base, flags);
> +            if (!tb) {
> +                /* if no translated code available, then translate it
> now */
> +                tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
> +            }
> +        }
> +
> +        /* We add the TB in the virtual pc hash table for the fast
> lookup */
> +        atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>      }
>      if (cpu->tb_flushed) {
>          /* Ensure that no TB jump will be modified as the
> @@ -440,7 +429,7 @@ static inline bool cpu_handle_exception(CPUState
> *cpu, int *ret)
>                 && cpu->icount_decr.u16.low + cpu->icount_extra == 0) {
>          /* try to cause an exception pending in the log */
>          TranslationBlock *last_tb = NULL; /* Avoid chaining TBs */
> -        cpu_exec_nocache(cpu, 1, tb_find_fast(cpu, &last_tb, 0), true);
> +        cpu_exec_nocache(cpu, 1, tb_find(cpu, &last_tb, 0), true);
>          *ret = -1;
>          return true;
>  #endif
> @@ -624,7 +613,7 @@ int cpu_exec(CPUState *cpu)
>              cpu->tb_flushed = false; /* reset before first TB lookup */
>              for(;;) {
>                  cpu_handle_interrupt(cpu, &last_tb);
> -                tb = tb_find_fast(cpu, &last_tb, tb_exit);
> +                tb = tb_find(cpu, &last_tb, tb_exit);
>                  cpu_loop_exec_tb(cpu, tb, &last_tb, &tb_exit, &sc);
>                  /* Try to align the host and virtual clocks
>                     if the guest is in advance */
>
>

Using goto statements, the lines from tb_find_slow() which take the
locks can even be on the same indentation level as the lines resetting
them. I could prepare a patch series based on your first 3 patches and
this idea.

Regards,
Sergey

>
>
>>>      }
>>>
>>>      /* We add the TB in the virtual pc hash table for the fast lookup */
>>> @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>>  #endif
>>>      /* See if we can patch the calling TB. */
>>>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
>>> -        tb_lock();
>>> +        if (!tb_lock_locked()) {
>>> +            tb_lock();
>>> +        }
>>>          tb_add_jump(*last_tb, tb_exit, tb);
>>> -        tb_unlock();
>>>      }
>>> +
>>> +    tb_lock_reset();
>>> +    mmap_lock_reset();
>>> +
>>>      return tb;
>>>  }
>> --
>> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-07 20:36           ` Sergey Fedorov
@ 2016-07-07 21:40             ` Alex Bennée
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-07-07 21:40 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Sergey Fedorov, qemu-devel, patches, mttcg, fred.konrad, a.rigo,
	cota, bobby.prani, rth, mark.burton, pbonzini, jan.kiszka,
	peter.maydell, claudio.fontana, Peter Crosthwaite


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 07/07/16 22:46, Sergey Fedorov wrote:
>> On 07/07/16 22:36, Alex Bennée wrote:
>>> Sergey Fedorov <sergey.fedorov@linaro.org> writes:
>>>
>>>> From: Sergey Fedorov <serge.fdrv@gmail.com>
>>>>
>>>> Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
>>>> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
>>>> ---
>>>>  cpu-exec.c | 15 +++++++++------
>>>>  1 file changed, 9 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/cpu-exec.c b/cpu-exec.c
>>>> index dd0bd5007701..54c935039592 100644
>>>> --- a/cpu-exec.c
>>>> +++ b/cpu-exec.c
>>>> @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>>>
>>>>          /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>>>>           * taken outside tb_lock. As system emulation is currently
>>>> -         * single threaded the locks are NOPs.
>>>> +         * single threaded the locks are NOPs. Both locks are to be
>>>> +         * released at the end of tb_find_fast().
>>>>           */
>>>>          mmap_lock();
>>>>          tb_lock();
>>>> @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>>>              /* if no translated code available, then translate it now */
>>>>              tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>>>>          }
>>>> -
>>>> -        tb_unlock();
>>>> -        mmap_unlock();
>>> Hmm pushing these outside of tb_find_slow() makes me uncomfortable. I
>>> guess tb_find_fast/slow are intimately tied together but the idea of
>>> taking locks which are the responsibility of the calling function to
>>> clear seems ugly to me.
>> Okay, what if we also:
>>
>> diff --git a/cpu-exec.c b/cpu-exec.c
>> index 54c935039592..ff8f92bc1dc1 100644
>> --- a/cpu-exec.c
>> +++ b/cpu-exec.c
>> @@ -278,45 +278,7 @@ static TranslationBlock *tb_find_physical(CPUState
>> *cpu,
>>      return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
>>  }
>>
>> -static TranslationBlock *tb_find_slow(CPUState *cpu,
>> -                                      target_ulong pc,
>> -                                      target_ulong cs_base,
>> -                                      uint32_t flags)
>> -{
>> -    TranslationBlock *tb;
>> -
>> -    /* Ensure that we won't find a TB in the shared hash table
>> -     * if it is being invalidated by some other thread.
>> -     * Otherwise we'd put it back to CPU's local cache.
>> -     * Pairs with smp_wmb() in tb_phys_invalidate(). */
>> -    smp_rmb();
>> -    tb = tb_find_physical(cpu, pc, cs_base, flags);
>> -    if (!tb) {
>> -
>> -        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>> -         * taken outside tb_lock. As system emulation is currently
>> -         * single threaded the locks are NOPs. Both locks are to be
>> -         * released at the end of tb_find_fast().
>> -         */
>> -        mmap_lock();
>> -        tb_lock();
>> -
>> -        /* There's a chance that our desired tb has been translated while
>> -         * taking the locks so we check again inside the lock.
>> -         */
>> -        tb = tb_find_physical(cpu, pc, cs_base, flags);
>> -        if (!tb) {
>> -            /* if no translated code available, then translate it now */
>> -            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>> -        }
>> -    }
>> -
>> -    /* We add the TB in the virtual pc hash table for the fast lookup */
>> -    atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>> -    return tb;
>> -}
>> -
>> -static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>> +static inline TranslationBlock *tb_find(CPUState *cpu,
>>                                               TranslationBlock **last_tb,
>>                                               int tb_exit)
>>  {
>> @@ -332,7 +294,34 @@ static inline TranslationBlock
>> *tb_find_fast(CPUState *cpu,
>>      tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>>                   tb->flags != flags)) {
>> -        tb = tb_find_slow(cpu, pc, cs_base, flags);
>> +        /* Ensure that we won't find a TB in the shared hash table
>> +         * if it is being invalidated by some other thread.
>> +         * Otherwise we'd put it back to CPU's local cache.
>> +         * Pairs with smp_wmb() in tb_phys_invalidate(). */
>> +        smp_rmb();
>> +        tb = tb_find_physical(cpu, pc, cs_base, flags);
>> +        if (!tb) {
>> +
>> +            /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>> +             * taken outside tb_lock. As system emulation is currently
>> +             * single threaded the locks are NOPs. Both locks are to be
>> +             * released at the end of the function.
>> +             */
>> +            mmap_lock();
>> +            tb_lock();
>> +
>> +            /* There's a chance that our desired tb has been translated
>> while
>> +             * taking the locks so we check again inside the lock.
>> +             */
>> +            tb = tb_find_physical(cpu, pc, cs_base, flags);
>> +            if (!tb) {
>> +                /* if no translated code available, then translate it
>> now */
>> +                tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>> +            }
>> +        }
>> +
>> +        /* We add the TB in the virtual pc hash table for the fast
>> lookup */
>> +        atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>>      }
>>      if (cpu->tb_flushed) {
>>          /* Ensure that no TB jump will be modified as the
>> @@ -440,7 +429,7 @@ static inline bool cpu_handle_exception(CPUState
>> *cpu, int *ret)
>>                 && cpu->icount_decr.u16.low + cpu->icount_extra == 0) {
>>          /* try to cause an exception pending in the log */
>>          TranslationBlock *last_tb = NULL; /* Avoid chaining TBs */
>> -        cpu_exec_nocache(cpu, 1, tb_find_fast(cpu, &last_tb, 0), true);
>> +        cpu_exec_nocache(cpu, 1, tb_find(cpu, &last_tb, 0), true);
>>          *ret = -1;
>>          return true;
>>  #endif
>> @@ -624,7 +613,7 @@ int cpu_exec(CPUState *cpu)
>>              cpu->tb_flushed = false; /* reset before first TB lookup */
>>              for(;;) {
>>                  cpu_handle_interrupt(cpu, &last_tb);
>> -                tb = tb_find_fast(cpu, &last_tb, tb_exit);
>> +                tb = tb_find(cpu, &last_tb, tb_exit);
>>                  cpu_loop_exec_tb(cpu, tb, &last_tb, &tb_exit, &sc);
>>                  /* Try to align the host and virtual clocks
>>                     if the guest is in advance */
>>
>>
>
> Using goto statements, the lines from tb_find_slow() which take the
> locks can even be on the same indentation level as the lines resetting
> them. I could prepare a patch series based on your first 3 patches and
> this idea.

Sure, lets see what it looks like ;-)

>
> Regards,
> Sergey
>
>>
>>
>>>>      }
>>>>
>>>>      /* We add the TB in the virtual pc hash table for the fast lookup */
>>>> @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>>>  #endif
>>>>      /* See if we can patch the calling TB. */
>>>>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
>>>> -        tb_lock();
>>>> +        if (!tb_lock_locked()) {
>>>> +            tb_lock();
>>>> +        }
>>>>          tb_add_jump(*last_tb, tb_exit, tb);
>>>> -        tb_unlock();
>>>>      }
>>>> +
>>>> +    tb_lock_reset();
>>>> +    mmap_lock_reset();
>>>> +
>>>>      return tb;
>>>>  }
>>> --
>>> Alex Bennée


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-07 16:44     ` [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump() Sergey Fedorov
  2016-07-07 19:36       ` Alex Bennée
@ 2016-07-08  8:40       ` Paolo Bonzini
  2016-07-08 10:25         ` Sergey Fedorov
  1 sibling, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2016-07-08  8:40 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: qemu-devel, Alex Bennée, patches, Sergey Fedorov, mttcg,
	fred konrad, a rigo, cota, bobby prani, rth, mark burton,
	jan kiszka, peter maydell, claudio fontana, Peter Crosthwaite


> diff --git a/cpu-exec.c b/cpu-exec.c
> index dd0bd5007701..54c935039592 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>  
>          /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>           * taken outside tb_lock. As system emulation is currently
> -         * single threaded the locks are NOPs.
> +         * single threaded the locks are NOPs. Both locks are to be
> +         * released at the end of tb_find_fast().
>           */

Even better: add a "bool *tb_locked" argument to tb_find_slow, and
don't move the mmap_lock release.  Then tb_find_fast knows directly
whether tb_lock is taken, and you don't need any of tb_lock_reset
or mmap_lock_reset.

Thanks,

Paolo

>          mmap_lock();
>          tb_lock();
> @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>              /* if no translated code available, then translate it now */
>              tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>          }
> -
> -        tb_unlock();
> -        mmap_unlock();
>      }
>  
>      /* We add the TB in the virtual pc hash table for the fast lookup */
> @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState
> *cpu,
>  #endif
>      /* See if we can patch the calling TB. */
>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
> -        tb_lock();
> +        if (!tb_lock_locked()) {
> +            tb_lock();
> +        }
>          tb_add_jump(*last_tb, tb_exit, tb);
> -        tb_unlock();
>      }
> +
> +    tb_lock_reset();
> +    mmap_lock_reset();
> +
>      return tb;
>  }
>  
> --
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/6] tcg: set up tb->page_addr before insertion
  2016-07-07 14:08   ` Sergey Fedorov
@ 2016-07-08  9:40     ` Sergey Fedorov
  0 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08  9:40 UTC (permalink / raw)
  To: Sergey Fedorov, Alex Bennée, mttcg, qemu-devel, fred.konrad,
	a.rigo, cota, bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 07/07/16 17:08, Sergey Fedorov wrote:
> On 05/07/16 19:18, Alex Bennée wrote:
>> This ensures that if we find the TB on the slow path that tb->page_addr
>> is correctly set before being tested.
>>
>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> Reveiwed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

However, we may need to use smp_wmb() here paired with smp_rmb() in
tb_find_fast().

Alternatively, QHT might provide some explicit guarantee about memory
ordering around qht_insert() and qht_lookup(). Actually, there is
already right ordering thanks to seqlock operations in QHT
implementation. This means we could elide explicit memory barrier from
the first patch as well.

Kind regards,
Sergey

>
>> ---
>>  translate-all.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/translate-all.c b/translate-all.c
>> index 96efe48..97e834a 100644
>> --- a/translate-all.c
>> +++ b/translate-all.c
>> @@ -1126,10 +1126,6 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>>  {
>>      uint32_t h;
>>  
>> -    /* add in the hash table */
>> -    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
>> -    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
>> -
>>      /* add in the page list */
>>      tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
>>      if (phys_page2 != -1) {
>> @@ -1138,6 +1134,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>>          tb->page_addr[1] = -1;
>>      }
>>  
>> +    /* add in the hash table */
>> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
>> +    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
>> +
>>  #ifdef DEBUG_TB_CHECK
>>      tb_page_check();
>>  #endif

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08  8:40       ` Paolo Bonzini
@ 2016-07-08 10:25         ` Sergey Fedorov
  2016-07-08 11:02           ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 10:25 UTC (permalink / raw)
  To: Paolo Bonzini, Sergey Fedorov
  Cc: qemu-devel, Alex Bennée, patches, mttcg, fred konrad,
	a rigo, cota, bobby prani, rth, mark burton, jan kiszka,
	peter maydell, claudio fontana, Peter Crosthwaite

On 08/07/16 11:40, Paolo Bonzini wrote:
> Even better: add a "bool *tb_locked" argument to tb_find_slow, and
> don't move the mmap_lock release.  Then tb_find_fast knows directly
> whether tb_lock is taken, and you don't need any of tb_lock_reset
> or mmap_lock_reset.

I think we can do even better. One option is using a separate tiny lock
to protect direct jump set/reset instead of tb_lock.

Another option which I've had in my mind for some time is to make direct
jump set/reset thread-safe. We already have thread-safe TB patching. The
only question is the right order of operations and handling
jmp_list_next/jmp_list_first safely. I think that could be done by
removing tb_remove_from_jmp_list() and making RCU-like manipulation with
jmp_list_next/jmp_list_first. What do you think?

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08 10:25         ` Sergey Fedorov
@ 2016-07-08 11:02           ` Paolo Bonzini
  2016-07-08 12:32             ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2016-07-08 11:02 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Sergey Fedorov, qemu-devel, Alex Bennée, patches, mttcg,
	fred konrad, a rigo, cota, bobby prani, rth, mark burton,
	jan kiszka, peter maydell, claudio fontana, Peter Crosthwaite

> On 08/07/16 11:40, Paolo Bonzini wrote:
> > Even better: add a "bool *tb_locked" argument to tb_find_slow, and
> > don't move the mmap_lock release.  Then tb_find_fast knows directly
> > whether tb_lock is taken, and you don't need any of tb_lock_reset
> > or mmap_lock_reset.
> 
> I think we can do even better. One option is using a separate tiny lock
> to protect direct jump set/reset instead of tb_lock.

If you have to use a separate tiny lock, you don't gain anything compared
to the two critical sections, do you?

In any case, this seems to be more complex than necessary.  The "bool *"
convention is pretty common in Linux for example---it works well.

The one below is even more complicated.  I'm all for simple lock-free
stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free
lists are too much, especially if with the complicated first/next mechanism
of TCG's chained block lists.

Paolo

> Another option which I've had in my mind for some time is to make direct
> jump set/reset thread-safe. We already have thread-safe TB patching. The
> only question is the right order of operations and handling
> jmp_list_next/jmp_list_first safely. I think that could be done by
> removing tb_remove_from_jmp_list() and making RCU-like manipulation with
> jmp_list_next/jmp_list_first. What do you think?
> 
> Kind regards,
> Sergey
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08 11:02           ` Paolo Bonzini
@ 2016-07-08 12:32             ` Sergey Fedorov
  2016-07-08 14:07               ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 12:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sergey Fedorov, qemu-devel, Alex Bennée, patches, mttcg,
	fred konrad, a rigo, cota, bobby prani, rth, mark burton,
	jan kiszka, peter maydell, claudio fontana, Peter Crosthwaite

On 08/07/16 14:02, Paolo Bonzini wrote:
>> On 08/07/16 11:40, Paolo Bonzini wrote:
>>> Even better: add a "bool *tb_locked" argument to tb_find_slow, and
>>> don't move the mmap_lock release.  Then tb_find_fast knows directly
>>> whether tb_lock is taken, and you don't need any of tb_lock_reset
>>> or mmap_lock_reset.
>> I think we can do even better. One option is using a separate tiny lock
>> to protect direct jump set/reset instead of tb_lock.
> If you have to use a separate tiny lock, you don't gain anything compared
> to the two critical sections, do you?

If we have a separate lock for direct jump set/reset then we can do fast
TB lookup + direct jump patching without taking tb_lock at all. How much
this would reduce lock contention largely depends on the workload we use.

>
> In any case, this seems to be more complex than necessary.  The "bool *"
> convention is pretty common in Linux for example---it works well.

It could work, no doubts.

>
> The one below is even more complicated.  I'm all for simple lock-free
> stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free
> lists are too much, especially if with the complicated first/next mechanism
> of TCG's chained block lists.

Direct jump handling code is pretty isolated and self-contained. It
would require to back out of tb_remove_from_jmp_list() and sprinkle a
couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I
think. Maybe it could be easier to justify looking at actual patches.

Thanks,
Sergey

>
> Paolo
>
>> Another option which I've had in my mind for some time is to make direct
>> jump set/reset thread-safe. We already have thread-safe TB patching. The
>> only question is the right order of operations and handling
>> jmp_list_next/jmp_list_first safely. I think that could be done by
>> removing tb_remove_from_jmp_list() and making RCU-like manipulation with
>> jmp_list_next/jmp_list_first. What do you think?
>>
>> Kind regards,
>> Sergey
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08 12:32             ` Sergey Fedorov
@ 2016-07-08 14:07               ` Paolo Bonzini
  2016-07-08 19:55                 ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2016-07-08 14:07 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Sergey Fedorov, qemu-devel, Alex Bennée, patches, mttcg,
	fred konrad, a rigo, cota, bobby prani, rth, mark burton,
	jan kiszka, peter maydell, claudio fontana, Peter Crosthwaite



On 08/07/2016 14:32, Sergey Fedorov wrote:
>>> >> I think we can do even better. One option is using a separate tiny lock
>>> >> to protect direct jump set/reset instead of tb_lock.
>> > If you have to use a separate tiny lock, you don't gain anything compared
>> > to the two critical sections, do you?
> If we have a separate lock for direct jump set/reset then we can do fast
> TB lookup + direct jump patching without taking tb_lock at all. How much
> this would reduce lock contention largely depends on the workload we use.

Yeah, it probably would be easy enough that it's hard to object to it
(unlike the other idea below, which I'm not very comfortable with, at
least without seeing patches).

The main advantage would be that this tiny lock could be a spinlock
rather than a mutex.

Paolo

>> The one below is even more complicated.  I'm all for simple lock-free
>> stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free
>> lists are too much, especially if with the complicated first/next mechanism
>> of TCG's chained block lists.
> 
> Direct jump handling code is pretty isolated and self-contained. It
> would require to back out of tb_remove_from_jmp_list() and sprinkle a
> couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I
> think. Maybe it could be easier to justify looking at actual patches.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock'
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock' Alex Bennée
  2016-07-07 13:52   ` Sergey Fedorov
@ 2016-07-08 14:51   ` Sergey Fedorov
  1 sibling, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 14:51 UTC (permalink / raw)
  To: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Sergey Fedorov, Peter Crosthwaite

On 05/07/16 19:18, Alex Bennée wrote:
> From: Sergey Fedorov <serge.fdrv@gmail.com>
>
> First, ensure atomicity of CPU's 'tb_jmp_cache' access by:
>  * using atomic_read() to look up a TB when not holding 'tb_lock';
>  * using atomic_write() to remove a TB from each CPU's local cache on
>    TB invalidation.
>
> Second, add some memory barriers to ensure we don't put the TB being
> invalidated back to CPU's 'tb_jmp_cache'. If we fail to look up a TB in
> CPU's local cache because it is being invalidated by some other thread
> then it must not be found in the shared TB hash table. Otherwise we'd
> put it back to CPU's local cache.
>
> Note that this patch does *not* make CPU's TLB invalidation safe if it
> is done from some other thread while the CPU is in its execution loop.
>
> Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
> [AJB: fixed missing atomic set, tweak title]
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Emilio G. Cota <cota@braap.org>
>
> ---
> v1 (AJB):
>   - tweak title
>   - fixed missing set of tb_jmp_cache
> v2
>   - fix spelling s/con't/can't/
>   - add atomic_read while clearing tb_jmp_cache
>   - add r-b tags
> ---
>  cpu-exec.c      | 9 +++++++--
>  translate-all.c | 9 +++++++--
>  2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index b840e1d..10ce1cb 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -285,6 +285,11 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>  {
>      TranslationBlock *tb;
>  
> +    /* Ensure that we won't find a TB in the shared hash table
> +     * if it is being invalidated by some other thread.
> +     * Otherwise we'd put it back to CPU's local cache.
> +     * Pairs with smp_wmb() in tb_phys_invalidate(). */
> +    smp_rmb();
>      tb = tb_find_physical(cpu, pc, cs_base, flags);
>      if (tb) {
>          goto found;
> @@ -315,7 +320,7 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>  
>  found:
>      /* we add the TB in the virtual pc hash table */
> -    cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
> +    atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>      return tb;
>  }
>  
> @@ -333,7 +338,7 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>         is executed. */
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>      tb_lock();
> -    tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
> +    tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>                   tb->flags != flags)) {
>          tb = tb_find_slow(cpu, pc, cs_base, flags);
> diff --git a/translate-all.c b/translate-all.c
> index eaa95e4..96efe48 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -1004,11 +1004,16 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>          invalidate_page_bitmap(p);
>      }
>  
> +    /* Ensure that we won't find the TB in the shared hash table
> +     * if we can't see it in CPU's local cache.
> +     * Pairs with smp_rmb() in tb_find_slow(). */
> +    smp_wmb();
> +
>      /* remove the TB from the hash list */
>      h = tb_jmp_cache_hash_func(tb->pc);
>      CPU_FOREACH(cpu) {
> -        if (cpu->tb_jmp_cache[h] == tb) {
> -            cpu->tb_jmp_cache[h] = NULL;
> +        if (atomic_read(&cpu->tb_jmp_cache[h]) == tb) {

I think this can be a simple read because we can't race here with
setting 'tb_jmp_cache' in tb_find_slow().

Kind regards,
Sergey

> +            atomic_set(&cpu->tb_jmp_cache[h], NULL);
>          }
>      }
>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path
  2016-07-07 14:18   ` Sergey Fedorov
@ 2016-07-08 15:50     ` Sergey Fedorov
  2016-07-08 17:34     ` Sergey Fedorov
  1 sibling, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 15:50 UTC (permalink / raw)
  To: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 07/07/16 17:18, Sergey Fedorov wrote:
> On 05/07/16 19:18, Alex Bennée wrote:
>> Lock contention in the hot path of moving between existing patched
>> TranslationBlocks is the main drag in multithreaded performance. This
>> patch pushes the tb_lock() usage down to the two places that really need
>> it:
>>
>>   - code generation (tb_gen_code)
>>   - jump patching (tb_add_jump)
>>
>> The rest of the code doesn't really need to hold a lock as it is either
>> using per-CPU structures, atomically updated or designed to be used in
>> concurrent read situations (qht_lookup).
>>
>> To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
>> locks become NOPs anyway until the MTTCG work is completed.
>>
>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

Revoked: CPUState::tb_flushed must also be protected by tb_lock.

>
>> ---
>> v3 (base-patches)
>>   - fix merge conflicts with Sergey's patch
>> v1 (hot path, split from base-patches series)
>>   - revert name tweaking
>>   - drop test jmp_list_next outside lock
>>   - mention lock NOPs in comments
>> v2 (hot path)
>>   - Add r-b tags
>> ---
>>  cpu-exec.c | 48 +++++++++++++++++++++---------------------------
>>  1 file changed, 21 insertions(+), 27 deletions(-)
>>
>> diff --git a/cpu-exec.c b/cpu-exec.c
>> index 10ce1cb..dd0bd50 100644
>> --- a/cpu-exec.c
>> +++ b/cpu-exec.c
>> @@ -291,35 +291,29 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>       * Pairs with smp_wmb() in tb_phys_invalidate(). */
>>      smp_rmb();
>>      tb = tb_find_physical(cpu, pc, cs_base, flags);
>> -    if (tb) {
>> -        goto found;
>> -    }
>> +    if (!tb) {
>>  
>> -#ifdef CONFIG_USER_ONLY
>> -    /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>> -     * taken outside tb_lock.  Since we're momentarily dropping
>> -     * tb_lock, there's a chance that our desired tb has been
>> -     * translated.
>> -     */
>> -    tb_unlock();
>> -    mmap_lock();
>> -    tb_lock();
>> -    tb = tb_find_physical(cpu, pc, cs_base, flags);
>> -    if (tb) {
>> -        mmap_unlock();
>> -        goto found;
>> -    }
>> -#endif
>> +        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>> +         * taken outside tb_lock. As system emulation is currently
>> +         * single threaded the locks are NOPs.
>> +         */
>> +        mmap_lock();
>> +        tb_lock();
>>  
>> -    /* if no translated code available, then translate it now */
>> -    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>> +        /* There's a chance that our desired tb has been translated while
>> +         * taking the locks so we check again inside the lock.
>> +         */
>> +        tb = tb_find_physical(cpu, pc, cs_base, flags);
>> +        if (!tb) {
>> +            /* if no translated code available, then translate it now */
>> +            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>> +        }
>>  
>> -#ifdef CONFIG_USER_ONLY
>> -    mmap_unlock();
>> -#endif
>> +        tb_unlock();
>> +        mmap_unlock();
>> +    }
>>  
>> -found:
>> -    /* we add the TB in the virtual pc hash table */
>> +    /* We add the TB in the virtual pc hash table for the fast lookup */
>>      atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>>      return tb;
>>  }
>> @@ -337,7 +331,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>         always be the same before a given translated block
>>         is executed. */
>>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>> -    tb_lock();
>>      tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>>                   tb->flags != flags)) {
>> @@ -361,9 +354,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>  #endif
>>      /* See if we can patch the calling TB. */
>>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
>> +        tb_lock();
>>          tb_add_jump(*last_tb, tb_exit, tb);
>> +        tb_unlock();
>>      }
>> -    tb_unlock();
>>      return tb;
>>  }
>>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path
  2016-07-07 14:18   ` Sergey Fedorov
  2016-07-08 15:50     ` Sergey Fedorov
@ 2016-07-08 17:34     ` Sergey Fedorov
  2016-07-08 18:03       ` Alex Bennée
  1 sibling, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 17:34 UTC (permalink / raw)
  To: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 07/07/16 17:18, Sergey Fedorov wrote:
> On 05/07/16 19:18, Alex Bennée wrote:
>> Lock contention in the hot path of moving between existing patched
>> TranslationBlocks is the main drag in multithreaded performance. This
>> patch pushes the tb_lock() usage down to the two places that really need
>> it:
>>
>>   - code generation (tb_gen_code)
>>   - jump patching (tb_add_jump)
>>
>> The rest of the code doesn't really need to hold a lock as it is either
>> using per-CPU structures, atomically updated or designed to be used in
>> concurrent read situations (qht_lookup).
>>
>> To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
>> locks become NOPs anyway until the MTTCG work is completed.
>>
>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

Revoked:
(1) Acess to 'cpu->tb_flushed' must be safe (atomic_xchg() can help)
(2) A read from 'cpu->tb_jmp_cache' in tb_find_fast() can see a torn
write by memset() from tb_flush()

So I'm afraid, this series is dependent on safe tb_flush() implementation.

Kind regards,
Sergey

>
>> ---
>> v3 (base-patches)
>>   - fix merge conflicts with Sergey's patch
>> v1 (hot path, split from base-patches series)
>>   - revert name tweaking
>>   - drop test jmp_list_next outside lock
>>   - mention lock NOPs in comments
>> v2 (hot path)
>>   - Add r-b tags
>> ---
>>  cpu-exec.c | 48 +++++++++++++++++++++---------------------------
>>  1 file changed, 21 insertions(+), 27 deletions(-)
>>
>> diff --git a/cpu-exec.c b/cpu-exec.c
>> index 10ce1cb..dd0bd50 100644
>> --- a/cpu-exec.c
>> +++ b/cpu-exec.c
>> @@ -291,35 +291,29 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>       * Pairs with smp_wmb() in tb_phys_invalidate(). */
>>      smp_rmb();
>>      tb = tb_find_physical(cpu, pc, cs_base, flags);
>> -    if (tb) {
>> -        goto found;
>> -    }
>> +    if (!tb) {
>>  
>> -#ifdef CONFIG_USER_ONLY
>> -    /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>> -     * taken outside tb_lock.  Since we're momentarily dropping
>> -     * tb_lock, there's a chance that our desired tb has been
>> -     * translated.
>> -     */
>> -    tb_unlock();
>> -    mmap_lock();
>> -    tb_lock();
>> -    tb = tb_find_physical(cpu, pc, cs_base, flags);
>> -    if (tb) {
>> -        mmap_unlock();
>> -        goto found;
>> -    }
>> -#endif
>> +        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>> +         * taken outside tb_lock. As system emulation is currently
>> +         * single threaded the locks are NOPs.
>> +         */
>> +        mmap_lock();
>> +        tb_lock();
>>  
>> -    /* if no translated code available, then translate it now */
>> -    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>> +        /* There's a chance that our desired tb has been translated while
>> +         * taking the locks so we check again inside the lock.
>> +         */
>> +        tb = tb_find_physical(cpu, pc, cs_base, flags);
>> +        if (!tb) {
>> +            /* if no translated code available, then translate it now */
>> +            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>> +        }
>>  
>> -#ifdef CONFIG_USER_ONLY
>> -    mmap_unlock();
>> -#endif
>> +        tb_unlock();
>> +        mmap_unlock();
>> +    }
>>  
>> -found:
>> -    /* we add the TB in the virtual pc hash table */
>> +    /* We add the TB in the virtual pc hash table for the fast lookup */
>>      atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>>      return tb;
>>  }
>> @@ -337,7 +331,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>         always be the same before a given translated block
>>         is executed. */
>>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>> -    tb_lock();
>>      tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>>                   tb->flags != flags)) {
>> @@ -361,9 +354,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>  #endif
>>      /* See if we can patch the calling TB. */
>>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
>> +        tb_lock();
>>          tb_add_jump(*last_tb, tb_exit, tb);
>> +        tb_unlock();
>>      }
>> -    tb_unlock();
>>      return tb;
>>  }
>>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path
  2016-07-08 17:34     ` Sergey Fedorov
@ 2016-07-08 18:03       ` Alex Bennée
  2016-07-08 18:20         ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-07-08 18:03 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, cota, bobby.prani, rth,
	mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Peter Crosthwaite


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 07/07/16 17:18, Sergey Fedorov wrote:
>> On 05/07/16 19:18, Alex Bennée wrote:
>>> Lock contention in the hot path of moving between existing patched
>>> TranslationBlocks is the main drag in multithreaded performance. This
>>> patch pushes the tb_lock() usage down to the two places that really need
>>> it:
>>>
>>>   - code generation (tb_gen_code)
>>>   - jump patching (tb_add_jump)
>>>
>>> The rest of the code doesn't really need to hold a lock as it is either
>>> using per-CPU structures, atomically updated or designed to be used in
>>> concurrent read situations (qht_lookup).
>>>
>>> To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
>>> locks become NOPs anyway until the MTTCG work is completed.
>>>
>>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>>> Reviewed-by: Richard Henderson <rth@twiddle.net>
>> Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>
>
> Revoked:
> (1) Acess to 'cpu->tb_flushed' must be safe (atomic_xchg() can help)
> (2) A read from 'cpu->tb_jmp_cache' in tb_find_fast() can see a torn
> write by memset() from tb_flush()

But that is still the case already isn't it? While system emulation mode
is still single-threaded is it still safe?

>
> So I'm afraid, this series is dependent on safe tb_flush()
> implementation.

I've argue if it doesn't worsen the situation for either mode it's not
> dependant.


>
> Kind regards,
> Sergey
>
>>
>>> ---
>>> v3 (base-patches)
>>>   - fix merge conflicts with Sergey's patch
>>> v1 (hot path, split from base-patches series)
>>>   - revert name tweaking
>>>   - drop test jmp_list_next outside lock
>>>   - mention lock NOPs in comments
>>> v2 (hot path)
>>>   - Add r-b tags
>>> ---
>>>  cpu-exec.c | 48 +++++++++++++++++++++---------------------------
>>>  1 file changed, 21 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/cpu-exec.c b/cpu-exec.c
>>> index 10ce1cb..dd0bd50 100644
>>> --- a/cpu-exec.c
>>> +++ b/cpu-exec.c
>>> @@ -291,35 +291,29 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>>       * Pairs with smp_wmb() in tb_phys_invalidate(). */
>>>      smp_rmb();
>>>      tb = tb_find_physical(cpu, pc, cs_base, flags);
>>> -    if (tb) {
>>> -        goto found;
>>> -    }
>>> +    if (!tb) {
>>>  
>>> -#ifdef CONFIG_USER_ONLY
>>> -    /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>>> -     * taken outside tb_lock.  Since we're momentarily dropping
>>> -     * tb_lock, there's a chance that our desired tb has been
>>> -     * translated.
>>> -     */
>>> -    tb_unlock();
>>> -    mmap_lock();
>>> -    tb_lock();
>>> -    tb = tb_find_physical(cpu, pc, cs_base, flags);
>>> -    if (tb) {
>>> -        mmap_unlock();
>>> -        goto found;
>>> -    }
>>> -#endif
>>> +        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>>> +         * taken outside tb_lock. As system emulation is currently
>>> +         * single threaded the locks are NOPs.
>>> +         */
>>> +        mmap_lock();
>>> +        tb_lock();
>>>  
>>> -    /* if no translated code available, then translate it now */
>>> -    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>>> +        /* There's a chance that our desired tb has been translated while
>>> +         * taking the locks so we check again inside the lock.
>>> +         */
>>> +        tb = tb_find_physical(cpu, pc, cs_base, flags);
>>> +        if (!tb) {
>>> +            /* if no translated code available, then translate it now */
>>> +            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>>> +        }
>>>  
>>> -#ifdef CONFIG_USER_ONLY
>>> -    mmap_unlock();
>>> -#endif
>>> +        tb_unlock();
>>> +        mmap_unlock();
>>> +    }
>>>  
>>> -found:
>>> -    /* we add the TB in the virtual pc hash table */
>>> +    /* We add the TB in the virtual pc hash table for the fast lookup */
>>>      atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>>>      return tb;
>>>  }
>>> @@ -337,7 +331,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>>         always be the same before a given translated block
>>>         is executed. */
>>>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>>> -    tb_lock();
>>>      tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>>>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>>>                   tb->flags != flags)) {
>>> @@ -361,9 +354,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>>  #endif
>>>      /* See if we can patch the calling TB. */
>>>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
>>> +        tb_lock();
>>>          tb_add_jump(*last_tb, tb_exit, tb);
>>> +        tb_unlock();
>>>      }
>>> -    tb_unlock();
>>>      return tb;
>>>  }
>>>  


-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path
  2016-07-08 18:03       ` Alex Bennée
@ 2016-07-08 18:20         ` Sergey Fedorov
  0 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 18:20 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, qemu-devel, fred.konrad, a.rigo, cota, bobby.prani, rth,
	mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 08/07/16 21:03, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 07/07/16 17:18, Sergey Fedorov wrote:
>>> On 05/07/16 19:18, Alex Bennée wrote:
>>>> Lock contention in the hot path of moving between existing patched
>>>> TranslationBlocks is the main drag in multithreaded performance. This
>>>> patch pushes the tb_lock() usage down to the two places that really need
>>>> it:
>>>>
>>>>   - code generation (tb_gen_code)
>>>>   - jump patching (tb_add_jump)
>>>>
>>>> The rest of the code doesn't really need to hold a lock as it is either
>>>> using per-CPU structures, atomically updated or designed to be used in
>>>> concurrent read situations (qht_lookup).
>>>>
>>>> To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
>>>> locks become NOPs anyway until the MTTCG work is completed.
>>>>
>>>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>>>> Reviewed-by: Richard Henderson <rth@twiddle.net>
>>> Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>
>> Revoked:
>> (1) Acess to 'cpu->tb_flushed' must be safe (atomic_xchg() can help)
>> (2) A read from 'cpu->tb_jmp_cache' in tb_find_fast() can see a torn
>> write by memset() from tb_flush()
> But that is still the case already isn't it? While system emulation mode
> is still single-threaded is it still safe?

Before this patch, tb_flush(), 'cpu->tb_jmp_cache' lookup, and
'cpu->tb_flushed' are protected by tb_lock. The only problem for
multi-threaded user mode I know is that we can flush translation buffer
and start new translation while there can be some other thread still
executing old translated code from it. After this patch we have two more
possible races in addition.

>
>> So I'm afraid, this series is dependent on safe tb_flush()
>> implementation.
> I've argue if it doesn't worsen the situation for either mode it's not
> dependant.

It does for user mode, see the explanation above.

Kind regards,
Sergey

>
>> Kind regards,
>> Sergey
>>
>>>> ---
>>>> v3 (base-patches)
>>>>   - fix merge conflicts with Sergey's patch
>>>> v1 (hot path, split from base-patches series)
>>>>   - revert name tweaking
>>>>   - drop test jmp_list_next outside lock
>>>>   - mention lock NOPs in comments
>>>> v2 (hot path)
>>>>   - Add r-b tags
>>>> ---
>>>>  cpu-exec.c | 48 +++++++++++++++++++++---------------------------
>>>>  1 file changed, 21 insertions(+), 27 deletions(-)
>>>>
>>>> diff --git a/cpu-exec.c b/cpu-exec.c
>>>> index 10ce1cb..dd0bd50 100644
>>>> --- a/cpu-exec.c
>>>> +++ b/cpu-exec.c
>>>> @@ -291,35 +291,29 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>>>>       * Pairs with smp_wmb() in tb_phys_invalidate(). */
>>>>      smp_rmb();
>>>>      tb = tb_find_physical(cpu, pc, cs_base, flags);
>>>> -    if (tb) {
>>>> -        goto found;
>>>> -    }
>>>> +    if (!tb) {
>>>>  
>>>> -#ifdef CONFIG_USER_ONLY
>>>> -    /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>>>> -     * taken outside tb_lock.  Since we're momentarily dropping
>>>> -     * tb_lock, there's a chance that our desired tb has been
>>>> -     * translated.
>>>> -     */
>>>> -    tb_unlock();
>>>> -    mmap_lock();
>>>> -    tb_lock();
>>>> -    tb = tb_find_physical(cpu, pc, cs_base, flags);
>>>> -    if (tb) {
>>>> -        mmap_unlock();
>>>> -        goto found;
>>>> -    }
>>>> -#endif
>>>> +        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>>>> +         * taken outside tb_lock. As system emulation is currently
>>>> +         * single threaded the locks are NOPs.
>>>> +         */
>>>> +        mmap_lock();
>>>> +        tb_lock();
>>>>  
>>>> -    /* if no translated code available, then translate it now */
>>>> -    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>>>> +        /* There's a chance that our desired tb has been translated while
>>>> +         * taking the locks so we check again inside the lock.
>>>> +         */
>>>> +        tb = tb_find_physical(cpu, pc, cs_base, flags);
>>>> +        if (!tb) {
>>>> +            /* if no translated code available, then translate it now */
>>>> +            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>>>> +        }
>>>>  
>>>> -#ifdef CONFIG_USER_ONLY
>>>> -    mmap_unlock();
>>>> -#endif
>>>> +        tb_unlock();
>>>> +        mmap_unlock();
>>>> +    }
>>>>  
>>>> -found:
>>>> -    /* we add the TB in the virtual pc hash table */
>>>> +    /* We add the TB in the virtual pc hash table for the fast lookup */
>>>>      atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
>>>>      return tb;
>>>>  }
>>>> @@ -337,7 +331,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>>>         always be the same before a given translated block
>>>>         is executed. */
>>>>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>>>> -    tb_lock();
>>>>      tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>>>>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>>>>                   tb->flags != flags)) {
>>>> @@ -361,9 +354,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>>>>  #endif
>>>>      /* See if we can patch the calling TB. */
>>>>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
>>>> +        tb_lock();
>>>>          tb_add_jump(*last_tb, tb_exit, tb);
>>>> +        tb_unlock();
>>>>      }
>>>> -    tb_unlock();
>>>>      return tb;
>>>>  }
>>>>  
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08 14:07               ` Paolo Bonzini
@ 2016-07-08 19:55                 ` Sergey Fedorov
  2016-07-08 20:18                   ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 19:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sergey Fedorov, qemu-devel, Alex Bennée, patches, mttcg,
	fred konrad, a rigo, cota, bobby prani, rth, mark burton,
	jan kiszka, peter maydell, claudio fontana, Peter Crosthwaite

On 08/07/16 17:07, Paolo Bonzini wrote:
>
> On 08/07/2016 14:32, Sergey Fedorov wrote:
>>>>>> I think we can do even better. One option is using a separate tiny lock
>>>>>> to protect direct jump set/reset instead of tb_lock.
>>>> If you have to use a separate tiny lock, you don't gain anything compared
>>>> to the two critical sections, do you?
>> If we have a separate lock for direct jump set/reset then we can do fast
>> TB lookup + direct jump patching without taking tb_lock at all. How much
>> this would reduce lock contention largely depends on the workload we use.
> Yeah, it probably would be easy enough that it's hard to object to it
> (unlike the other idea below, which I'm not very comfortable with, at
> least without seeing patches).
>
> The main advantage would be that this tiny lock could be a spinlock
> rather than a mutex.

Well, the problem is more subtle than we thought: tb_find_fast() can
race with tb_phys_invalidate(). The first tb_find_phys() out of the lock
can return a TB which is being invalidated. Then a direct jump can be
set up to this TB. It can happen after concurrent tb_phys_invalidate()
resets all the direct jumps to the TB. Thus we can end up with a direct
jump to an invalidated TB. Even extending tb_lock critical section
wouldn't help if at least one tb_find_phys() is performed out of the lock.

Kind regards,
Sergey

>
>>> The one below is even more complicated.  I'm all for simple lock-free
>>> stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free
>>> lists are too much, especially if with the complicated first/next mechanism
>>> of TCG's chained block lists.
>> Direct jump handling code is pretty isolated and self-contained. It
>> would require to back out of tb_remove_from_jmp_list() and sprinkle a
>> couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I
>> think. Maybe it could be easier to justify looking at actual patches.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path
  2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path Alex Bennée
  2016-07-07 14:18   ` Sergey Fedorov
@ 2016-07-08 20:09   ` Sergey Fedorov
  1 sibling, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 20:09 UTC (permalink / raw)
  To: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, cota,
	bobby.prani, rth
  Cc: mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 05/07/16 19:18, Alex Bennée wrote:
> Lock contention in the hot path of moving between existing patched
> TranslationBlocks is the main drag in multithreaded performance. This
> patch pushes the tb_lock() usage down to the two places that really need
> it:
>
>   - code generation (tb_gen_code)
>   - jump patching (tb_add_jump)
>
> The rest of the code doesn't really need to hold a lock as it is either
> using per-CPU structures, atomically updated or designed to be used in
> concurrent read situations (qht_lookup).
>
> To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
> locks become NOPs anyway until the MTTCG work is completed.
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
>
> ---
> v3 (base-patches)
>   - fix merge conflicts with Sergey's patch
> v1 (hot path, split from base-patches series)
>   - revert name tweaking
>   - drop test jmp_list_next outside lock
>   - mention lock NOPs in comments
> v2 (hot path)
>   - Add r-b tags
> ---
>  cpu-exec.c | 48 +++++++++++++++++++++---------------------------
>  1 file changed, 21 insertions(+), 27 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 10ce1cb..dd0bd50 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -291,35 +291,29 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>       * Pairs with smp_wmb() in tb_phys_invalidate(). */
>      smp_rmb();
>      tb = tb_find_physical(cpu, pc, cs_base, flags);
> -    if (tb) {
> -        goto found;
> -    }
> +    if (!tb) {
>  
> -#ifdef CONFIG_USER_ONLY
> -    /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
> -     * taken outside tb_lock.  Since we're momentarily dropping
> -     * tb_lock, there's a chance that our desired tb has been
> -     * translated.
> -     */
> -    tb_unlock();
> -    mmap_lock();
> -    tb_lock();
> -    tb = tb_find_physical(cpu, pc, cs_base, flags);
> -    if (tb) {
> -        mmap_unlock();
> -        goto found;
> -    }
> -#endif
> +        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
> +         * taken outside tb_lock. As system emulation is currently
> +         * single threaded the locks are NOPs.
> +         */
> +        mmap_lock();
> +        tb_lock();
>  
> -    /* if no translated code available, then translate it now */
> -    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
> +        /* There's a chance that our desired tb has been translated while
> +         * taking the locks so we check again inside the lock.
> +         */
> +        tb = tb_find_physical(cpu, pc, cs_base, flags);
> +        if (!tb) {
> +            /* if no translated code available, then translate it now */
> +            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
> +        }
>  
> -#ifdef CONFIG_USER_ONLY
> -    mmap_unlock();
> -#endif
> +        tb_unlock();
> +        mmap_unlock();
> +    }
>  
> -found:
> -    /* we add the TB in the virtual pc hash table */
> +    /* We add the TB in the virtual pc hash table for the fast lookup */
>      atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);

There can be a race with tb_phys_invalidate() if we do this out of
tb_lock. Suppose:
(1) Thread 1: The first tb_find_phys() out of the lock is successful
(2) Thread 2: Remove the TB from the global hash table with qht_remove()
(3) Thread 2: Reset the 'tb_jmp_cache' entry
(4) Thread 1: Set the 'tb_jmp_cache' entry back (this line)

So it is only safe to do 'tb_jmp_cache' lookup out of lock, not to set it.

Kind regards,
Sergey

>      return tb;
>  }
> @@ -337,7 +331,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>         always be the same before a given translated block
>         is executed. */
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> -    tb_lock();
>      tb = atomic_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>                   tb->flags != flags)) {
> @@ -361,9 +354,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>  #endif
>      /* See if we can patch the calling TB. */
>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
> +        tb_lock();
>          tb_add_jump(*last_tb, tb_exit, tb);
> +        tb_unlock();
>      }
> -    tb_unlock();
>      return tb;
>  }
>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08 19:55                 ` Sergey Fedorov
@ 2016-07-08 20:18                   ` Paolo Bonzini
  2016-07-08 20:24                     ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2016-07-08 20:18 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Sergey Fedorov, qemu-devel, Alex Bennée, patches, mttcg,
	fred konrad, a rigo, cota, bobby prani, rth, mark burton,
	jan kiszka, peter maydell, claudio fontana, Peter Crosthwaite



On 08/07/2016 21:55, Sergey Fedorov wrote:
> On 08/07/16 17:07, Paolo Bonzini wrote:
>>
>> On 08/07/2016 14:32, Sergey Fedorov wrote:
>>>>>>> I think we can do even better. One option is using a separate tiny lock
>>>>>>> to protect direct jump set/reset instead of tb_lock.
>>>>> If you have to use a separate tiny lock, you don't gain anything compared
>>>>> to the two critical sections, do you?
>>> If we have a separate lock for direct jump set/reset then we can do fast
>>> TB lookup + direct jump patching without taking tb_lock at all. How much
>>> this would reduce lock contention largely depends on the workload we use.
>> Yeah, it probably would be easy enough that it's hard to object to it
>> (unlike the other idea below, which I'm not very comfortable with, at
>> least without seeing patches).
>>
>> The main advantage would be that this tiny lock could be a spinlock
>> rather than a mutex.
> 
> Well, the problem is more subtle than we thought: tb_find_fast() can
> race with tb_phys_invalidate(). The first tb_find_phys() out of the lock
> can return a TB which is being invalidated. Then a direct jump can be
> set up to this TB. It can happen after concurrent tb_phys_invalidate()
> resets all the direct jumps to the TB. Thus we can end up with a direct
> jump to an invalidated TB. Even extending tb_lock critical section
> wouldn't help if at least one tb_find_phys() is performed out of the lock.

Ahem, isn't this exactly why tb_find_phys was invalidating the PC in my
patches, as the very first step?...  (The smp_wmb after invalidating the
PC paired with an atomic_rcu_read in tb_find_fast; now we could do it
after computing the hash and before calling qht_remove).

It turned out that invalidating the PC wasn't as easy as writing -1 to
the pc, but it's possible to do one of these:

1) set cs_base to an invalid value (all-ones works for everything except
x86---instead anything nonzero is enough except on
x86 and SPARC)

2) set the flags to an invalid combination (x86 can use all ones or
rename the useless HF_SOFTMMU_MASK to HF_INVALID_MASK).

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08 20:18                   ` Paolo Bonzini
@ 2016-07-08 20:24                     ` Sergey Fedorov
  2016-07-08 20:52                       ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sergey Fedorov, qemu-devel, Alex Bennée, patches, mttcg,
	fred konrad, a rigo, cota, bobby prani, rth, mark burton,
	jan kiszka, peter maydell, claudio fontana, Peter Crosthwaite

On 08/07/16 23:18, Paolo Bonzini wrote:
>
> On 08/07/2016 21:55, Sergey Fedorov wrote:
>> On 08/07/16 17:07, Paolo Bonzini wrote:
>>> On 08/07/2016 14:32, Sergey Fedorov wrote:
>>>>>>>> I think we can do even better. One option is using a separate tiny lock
>>>>>>>> to protect direct jump set/reset instead of tb_lock.
>>>>>> If you have to use a separate tiny lock, you don't gain anything compared
>>>>>> to the two critical sections, do you?
>>>> If we have a separate lock for direct jump set/reset then we can do fast
>>>> TB lookup + direct jump patching without taking tb_lock at all. How much
>>>> this would reduce lock contention largely depends on the workload we use.
>>> Yeah, it probably would be easy enough that it's hard to object to it
>>> (unlike the other idea below, which I'm not very comfortable with, at
>>> least without seeing patches).
>>>
>>> The main advantage would be that this tiny lock could be a spinlock
>>> rather than a mutex.
>> Well, the problem is more subtle than we thought: tb_find_fast() can
>> race with tb_phys_invalidate(). The first tb_find_phys() out of the lock
>> can return a TB which is being invalidated. Then a direct jump can be
>> set up to this TB. It can happen after concurrent tb_phys_invalidate()
>> resets all the direct jumps to the TB. Thus we can end up with a direct
>> jump to an invalidated TB. Even extending tb_lock critical section
>> wouldn't help if at least one tb_find_phys() is performed out of the lock.
> Ahem, isn't this exactly why tb_find_phys was invalidating the PC in my
> patches, as the very first step?...  (The smp_wmb after invalidating the
> PC paired with an atomic_rcu_read in tb_find_fast; now we could do it
> after computing the hash and before calling qht_remove).
>
> It turned out that invalidating the PC wasn't as easy as writing -1 to
> the pc, but it's possible to do one of these:
>
> 1) set cs_base to an invalid value (all-ones works for everything except
> x86---instead anything nonzero is enough except on
> x86 and SPARC)
>
> 2) set the flags to an invalid combination (x86 can use all ones or
> rename the useless HF_SOFTMMU_MASK to HF_INVALID_MASK).

I remember, I've just found that we discussed it in this thread:

http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852

I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
out of the lock, too. This changes things.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08 20:24                     ` Sergey Fedorov
@ 2016-07-08 20:52                       ` Paolo Bonzini
  2016-07-11 13:06                         ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2016-07-08 20:52 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, peter maydell, claudio fontana, Sergey Fedorov, patches,
	jan kiszka, Peter Crosthwaite, mark burton, qemu-devel, a rigo,
	cota, bobby prani, rth, Alex Bennée, fred konrad



On 08/07/2016 22:24, Sergey Fedorov wrote:
> I remember, I've just found that we discussed it in this thread:
> 
> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852
> 
> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
> out of the lock, too. This changes things.

But in my patch ("tcg: move tb_find_fast outside the tb_lock critical
section", which originally was written by Fred---most of my contribution
was getting the invalidation right, not the lock-free lookup)
tb_find_physical was also done out of the lock.  It was then retried
inside the lock, if it failed.

This is why I needed to fail all concurrent lookups as the first step in
the invalidation.

Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic
concepts are the same.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-08 20:52                       ` Paolo Bonzini
@ 2016-07-11 13:06                         ` Sergey Fedorov
  2016-07-11 14:03                           ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-11 13:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, peter maydell, claudio fontana, Sergey Fedorov, patches,
	jan kiszka, Peter Crosthwaite, mark burton, qemu-devel, a rigo,
	cota, bobby prani, rth, Alex Bennée, fred konrad

On 08/07/16 23:52, Paolo Bonzini wrote:
>
> On 08/07/2016 22:24, Sergey Fedorov wrote:
>> I remember, I've just found that we discussed it in this thread:
>>
>> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852
>>
>> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
>> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
>> out of the lock, too. This changes things.
> But in my patch ("tcg: move tb_find_fast outside the tb_lock critical
> section", which originally was written by Fred---most of my contribution
> was getting the invalidation right, not the lock-free lookup)
> tb_find_physical was also done out of the lock.  It was then retried
> inside the lock, if it failed.
>
> This is why I needed to fail all concurrent lookups as the first step in
> the invalidation.
>
> Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic
> concepts are the same.

That could work, I think, if we re-check under tb_lock whether the TB is
still valid before adding a direct jump to it.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-11 13:06                         ` Sergey Fedorov
@ 2016-07-11 14:03                           ` Paolo Bonzini
  2016-07-11 14:27                             ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2016-07-11 14:03 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, peter maydell, claudio fontana, Sergey Fedorov, patches,
	jan kiszka, Peter Crosthwaite, mark burton, qemu-devel, a rigo,
	cota, bobby prani, rth, Alex Bennée, fred konrad



On 11/07/2016 15:06, Sergey Fedorov wrote:
> On 08/07/16 23:52, Paolo Bonzini wrote:
>>
>> On 08/07/2016 22:24, Sergey Fedorov wrote:
>>> I remember, I've just found that we discussed it in this thread:
>>>
>>> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852
>>>
>>> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
>>> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
>>> out of the lock, too. This changes things.
>> But in my patch ("tcg: move tb_find_fast outside the tb_lock critical
>> section", which originally was written by Fred---most of my contribution
>> was getting the invalidation right, not the lock-free lookup)
>> tb_find_physical was also done out of the lock.  It was then retried
>> inside the lock, if it failed.
>>
>> This is why I needed to fail all concurrent lookups as the first step in
>> the invalidation.
>>
>> Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic
>> concepts are the same.
> 
> That could work, I think, if we re-check under tb_lock whether the TB is
> still valid before adding a direct jump to it.

Right, this can still happen:

	tb_find_fast			tb_phys_invalidate
					 tb_lock
	 jmp_cache miss
	 -> tb_find_slow
	  -> tb_find_physical
	   QHT hit
	 tb_lock
					 invalidate tb->pc
					 remove from lists
					 tb_unlock
	 tb_add_jump
	 tb_unlock

I seem to recall that Emilio added a seqlock for this purpose, but
adding a tb_check_invalidated(TranslationBlock *tb) inline function will
also do.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
  2016-07-11 14:03                           ` Paolo Bonzini
@ 2016-07-11 14:27                             ` Sergey Fedorov
  0 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-07-11 14:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, peter maydell, claudio fontana, Sergey Fedorov, patches,
	jan kiszka, Peter Crosthwaite, mark burton, qemu-devel, a rigo,
	cota, bobby prani, rth, Alex Bennée, fred konrad

On 11/07/16 17:03, Paolo Bonzini wrote:
>
> On 11/07/2016 15:06, Sergey Fedorov wrote:
>> On 08/07/16 23:52, Paolo Bonzini wrote:
>>> On 08/07/2016 22:24, Sergey Fedorov wrote:
>>>> I remember, I've just found that we discussed it in this thread:
>>>>
>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852
>>>>
>>>> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
>>>> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
>>>> out of the lock, too. This changes things.
>>> But in my patch ("tcg: move tb_find_fast outside the tb_lock critical
>>> section", which originally was written by Fred---most of my contribution
>>> was getting the invalidation right, not the lock-free lookup)
>>> tb_find_physical was also done out of the lock.  It was then retried
>>> inside the lock, if it failed.
>>>
>>> This is why I needed to fail all concurrent lookups as the first step in
>>> the invalidation.
>>>
>>> Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic
>>> concepts are the same.
>> That could work, I think, if we re-check under tb_lock whether the TB is
>> still valid before adding a direct jump to it.
> Right, this can still happen:
>
> 	tb_find_fast			tb_phys_invalidate
> 					 tb_lock
> 	 jmp_cache miss
> 	 -> tb_find_slow
> 	  -> tb_find_physical
> 	   QHT hit
> 	 tb_lock
> 					 invalidate tb->pc
> 					 remove from lists
> 					 tb_unlock
> 	 tb_add_jump
> 	 tb_unlock
>
> I seem to recall that Emilio added a seqlock for this purpose, but
> adding a tb_check_invalidated(TranslationBlock *tb) inline function will
> also do.

He used seqlock for 'tb_jmp_cache' only:
http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356774

He also added a dedicated field into TranslationBlock struction to mark
it invalid:
http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356785

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2016-07-11 14:27 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-05 16:18 [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Alex Bennée
2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 1/6] tcg: Ensure safe tb_jmp_cache lookup out of 'tb_lock' Alex Bennée
2016-07-07 13:52   ` Sergey Fedorov
2016-07-08 14:51   ` Sergey Fedorov
2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 2/6] tcg: set up tb->page_addr before insertion Alex Bennée
2016-07-07 14:08   ` Sergey Fedorov
2016-07-08  9:40     ` Sergey Fedorov
2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 3/6] tcg: cpu-exec: remove tb_lock from the hot-path Alex Bennée
2016-07-07 14:18   ` Sergey Fedorov
2016-07-08 15:50     ` Sergey Fedorov
2016-07-08 17:34     ` Sergey Fedorov
2016-07-08 18:03       ` Alex Bennée
2016-07-08 18:20         ` Sergey Fedorov
2016-07-08 20:09   ` Sergey Fedorov
2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 4/6] tcg: cpu-exec: factor out TB patching code Alex Bennée
2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 5/6] tcg: introduce tb_lock_recursive() Alex Bennée
2016-07-05 16:18 ` [Qemu-devel] [PATCH v2 6/6] tcg: cpu-exec: roll-up tb_find_fast/slow Alex Bennée
2016-07-07 16:44   ` Sergey Fedorov
2016-07-07 16:44     ` [Qemu-devel] [PATCH 1/3] tcg: Introduce mmap_lock_reset() Sergey Fedorov
2016-07-07 16:44     ` [Qemu-devel] [PATCH 2/3] tcg: Introduce tb_lock_locked() Sergey Fedorov
2016-07-07 16:44     ` [Qemu-devel] [PATCH 3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump() Sergey Fedorov
2016-07-07 19:36       ` Alex Bennée
2016-07-07 19:46         ` Sergey Fedorov
2016-07-07 20:36           ` Sergey Fedorov
2016-07-07 21:40             ` Alex Bennée
2016-07-08  8:40       ` Paolo Bonzini
2016-07-08 10:25         ` Sergey Fedorov
2016-07-08 11:02           ` Paolo Bonzini
2016-07-08 12:32             ` Sergey Fedorov
2016-07-08 14:07               ` Paolo Bonzini
2016-07-08 19:55                 ` Sergey Fedorov
2016-07-08 20:18                   ` Paolo Bonzini
2016-07-08 20:24                     ` Sergey Fedorov
2016-07-08 20:52                       ` Paolo Bonzini
2016-07-11 13:06                         ` Sergey Fedorov
2016-07-11 14:03                           ` Paolo Bonzini
2016-07-11 14:27                             ` Sergey Fedorov
2016-07-07 16:04 ` [Qemu-devel] [PATCH v2 0/6] Reduce lock contention on TCG hot-path Emilio G. Cota
2016-07-07 16:13   ` Paolo Bonzini
2016-07-07 19:33     ` Alex Bennée
2016-07-07 19:38   ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.