All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/3] per-TLB lock
@ 2018-10-02 21:29 Emilio G. Cota
  2018-10-02 21:29 ` [Qemu-devel] [PATCH 1/3] exec: introduce tlb_init Emilio G. Cota
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Emilio G. Cota @ 2018-10-02 21:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Richard Henderson, Alex Bennée

This series introduces a per-TLB lock. This removes existing UB
(e.g. memset racing with cmpxchg on another thread while flushing),
and in my opinion makes the TLB code simpler to understand.

I had a bit of trouble finding the best place to initialize the
mutex, since it has to be called before tlb_flush, and tlb_flush
is called quite early during cpu initialization. I settled on
cpu_exec_realizefn, since then cpu->env_ptr has been set
but tlb_flush hasn't yet been called.

Perf-wise this change does have a small impact (~2% slowdown for
the aarch64 bootup+shutdown test; 1.2% comes from using atomic_read
consistently), but I think this is a fair price for avoiding UB.
Numbers below.

Initially I tried using atomics instead of memset for flushing (i.e.
no mutex), and the slowdown is close to 2X due to the repeated
(full) memory barriers. That's when I turned to using a lock.

Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

- Before this series:
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7464.797838      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.14% )
    31,473,652,436      cycles                    #    4.216 GHz                      ( +-  0.14% )
    57,032,288,549      instructions              #    1.81  insns per cycle          ( +-  0.08% )
    10,239,975,873      branches                  # 1371.769 M/sec                    ( +-  0.07% )
       172,150,358      branch-misses             #    1.68% of all branches          ( +-  0.12% )

       7.482009203 seconds time elapsed                                          ( +-  0.18% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
       7621.625434      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.10% )
    32,149,898,976      cycles                    #    4.218 GHz                      ( +-  0.10% )
    58,168,454,452      instructions              #    1.81  insns per cycle          ( +-  0.10% )
    10,486,183,612      branches                  # 1375.846 M/sec                    ( +-  0.10% )
       173,900,633      branch-misses             #    1.66% of all branches          ( +-  0.11% )

       7.632067213 seconds time elapsed                                          ( +-  0.10% )

This series is checkpatch-clean. You can fetch the code from:
  https://github.com/cota/qemu/tree/tlb-lock

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 1/3] exec: introduce tlb_init
  2018-10-02 21:29 [Qemu-devel] [PATCH 0/3] per-TLB lock Emilio G. Cota
@ 2018-10-02 21:29 ` Emilio G. Cota
  2018-10-02 21:29 ` [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock Emilio G. Cota
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Emilio G. Cota @ 2018-10-02 21:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Richard Henderson, Alex Bennée

Paves the way for the addition of a per-TLB lock.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h | 8 ++++++++
 accel/tcg/cputlb.c      | 4 ++++
 exec.c                  | 1 +
 3 files changed, 13 insertions(+)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 5f78125582..815e5b1e83 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -99,6 +99,11 @@ void cpu_address_space_init(CPUState *cpu, int asidx,
 
 #if !defined(CONFIG_USER_ONLY) && defined(CONFIG_TCG)
 /* cputlb.c */
+/**
+ * tlb_init - initialize a CPU's TLB
+ * @cpu: CPU whose TLB should be initialized
+ */
+void tlb_init(CPUState *cpu);
 /**
  * tlb_flush_page:
  * @cpu: CPU whose TLB should be flushed
@@ -258,6 +263,9 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
 void probe_write(CPUArchState *env, target_ulong addr, int size, int mmu_idx,
                  uintptr_t retaddr);
 #else
+static inline void tlb_init(CPUState *cpu)
+{
+}
 static inline void tlb_flush_page(CPUState *cpu, target_ulong addr)
 {
 }
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index f4702ce91f..502eea2850 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -73,6 +73,10 @@ QEMU_BUILD_BUG_ON(sizeof(target_ulong) > sizeof(run_on_cpu_data));
 QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16);
 #define ALL_MMUIDX_BITS ((1 << NB_MMU_MODES) - 1)
 
+void tlb_init(CPUState *cpu)
+{
+}
+
 /* flush_all_helper: run fn across all cpus
  *
  * If the wait flag is set then the src cpu's helper will be queued as
diff --git a/exec.c b/exec.c
index 6826c8337d..595deae346 100644
--- a/exec.c
+++ b/exec.c
@@ -965,6 +965,7 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp)
         tcg_target_initialized = true;
         cc->tcg_initialize();
     }
+    tlb_init(cpu);
 
 #ifndef CONFIG_USER_ONLY
     if (qdev_get_vmsd(DEVICE(cpu)) == NULL) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock
  2018-10-02 21:29 [Qemu-devel] [PATCH 0/3] per-TLB lock Emilio G. Cota
  2018-10-02 21:29 ` [Qemu-devel] [PATCH 1/3] exec: introduce tlb_init Emilio G. Cota
@ 2018-10-02 21:29 ` Emilio G. Cota
  2018-10-03  9:19   ` Alex Bennée
  2018-10-02 21:29 ` [Qemu-devel] [PATCH 3/3] cputlb: read CPUTLBEntry.addr_write atomically Emilio G. Cota
  2018-10-03  7:56 ` [Qemu-devel] [PATCH 0/3] per-TLB lock Paolo Bonzini
  3 siblings, 1 reply; 12+ messages in thread
From: Emilio G. Cota @ 2018-10-02 21:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Richard Henderson, Alex Bennée

Currently we rely on atomic operations for cross-CPU invalidations.
There are two cases that these atomics miss: cross-CPU invalidations
can race with either (1) vCPU threads flushing their TLB, which
happens via memset, or (2) vCPUs calling tlb_reset_dirty on their TLB,
which updates .addr_write with a regular store. This results in
undefined behaviour, since we're mixing regular and atomic ops
on concurrent accesses.

Fix it by using tlb_lock, a per-vCPU lock. All updaters of tlb_table
and the corresponding victim cache now hold the lock.
The readers that do not hold tlb_lock must use atomic reads when
reading .addr_write, since this field can be updated by other threads;
the conversion to atomic reads is done in the next patch.

Note that an alternative fix would be to expand the use of atomic ops.
However, in the case of TLB flushes this would have a huge performance
impact, since (1) TLB flushes can happen very frequently and (2) we
currently use a full memory barrier to flush each TLB entry, and a TLB
has many entries. Instead, acquiring the lock is barely slower than a
full memory barrier since it is uncontended, and with a single lock
acquisition we can flush the entire TLB.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h |   2 +
 accel/tcg/cputlb.c      | 108 ++++++++++++++++++----------------------
 2 files changed, 50 insertions(+), 60 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index a171ffc1a4..bcc40c8ef5 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -142,6 +142,8 @@ typedef struct CPUIOTLBEntry {
 
 #define CPU_COMMON_TLB \
     /* The meaning of the MMU modes is defined in the target code. */   \
+    /* tlb_lock serializes updates to tlb_table and tlb_v_table */      \
+    QemuMutex tlb_lock;                                                 \
     CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
     CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
     CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                    \
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 502eea2850..01b33f9d28 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -75,6 +75,9 @@ QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16);
 
 void tlb_init(CPUState *cpu)
 {
+    CPUArchState *env = cpu->env_ptr;
+
+    qemu_mutex_init(&env->tlb_lock);
 }
 
 /* flush_all_helper: run fn across all cpus
@@ -129,8 +132,11 @@ static void tlb_flush_nocheck(CPUState *cpu)
     atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
     tlb_debug("(count: %zu)\n", tlb_flush_count());
 
+    qemu_mutex_lock(&env->tlb_lock);
     memset(env->tlb_table, -1, sizeof(env->tlb_table));
     memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
+    qemu_mutex_unlock(&env->tlb_lock);
+
     cpu_tb_jmp_cache_clear(cpu);
 
     env->vtlb_index = 0;
@@ -454,72 +460,48 @@ void tlb_unprotect_code(ram_addr_t ram_addr)
  * most usual is detecting writes to code regions which may invalidate
  * generated code.
  *
- * Because we want other vCPUs to respond to changes straight away we
- * update the te->addr_write field atomically. If the TLB entry has
- * been changed by the vCPU in the mean time we skip the update.
+ * Other vCPUs might be reading their TLBs during guest execution, so we update
+ * te->addr_write with atomic_set. We don't need to worry about this for
+ * oversized guests as MTTCG is disabled for them.
  *
- * As this function uses atomic accesses we also need to ensure
- * updates to tlb_entries follow the same access rules. We don't need
- * to worry about this for oversized guests as MTTCG is disabled for
- * them.
+ * Called with tlb_lock held.
  */
-
-static void tlb_reset_dirty_range(CPUTLBEntry *tlb_entry, uintptr_t start,
-                           uintptr_t length)
+static void tlb_reset_dirty_range_locked(CPUTLBEntry *tlb_entry,
+                                         uintptr_t start, uintptr_t length)
 {
-#if TCG_OVERSIZED_GUEST
     uintptr_t addr = tlb_entry->addr_write;
 
     if ((addr & (TLB_INVALID_MASK | TLB_MMIO | TLB_NOTDIRTY)) == 0) {
         addr &= TARGET_PAGE_MASK;
         addr += tlb_entry->addend;
         if ((addr - start) < length) {
+#if TCG_OVERSIZED_GUEST
             tlb_entry->addr_write |= TLB_NOTDIRTY;
-        }
-    }
 #else
-    /* paired with atomic_mb_set in tlb_set_page_with_attrs */
-    uintptr_t orig_addr = atomic_mb_read(&tlb_entry->addr_write);
-    uintptr_t addr = orig_addr;
-
-    if ((addr & (TLB_INVALID_MASK | TLB_MMIO | TLB_NOTDIRTY)) == 0) {
-        addr &= TARGET_PAGE_MASK;
-        addr += atomic_read(&tlb_entry->addend);
-        if ((addr - start) < length) {
-            uintptr_t notdirty_addr = orig_addr | TLB_NOTDIRTY;
-            atomic_cmpxchg(&tlb_entry->addr_write, orig_addr, notdirty_addr);
+            atomic_set(&tlb_entry->addr_write,
+                       tlb_entry->addr_write | TLB_NOTDIRTY);
+#endif
         }
     }
-#endif
 }
 
-/* For atomic correctness when running MTTCG we need to use the right
- * primitives when copying entries */
-static inline void copy_tlb_helper(CPUTLBEntry *d, CPUTLBEntry *s,
-                                   bool atomic_set)
+/* Called with tlb_lock held */
+static void copy_tlb_helper_locked(CPUTLBEntry *d, const CPUTLBEntry *s)
 {
-#if TCG_OVERSIZED_GUEST
     *d = *s;
-#else
-    if (atomic_set) {
-        d->addr_read = s->addr_read;
-        d->addr_code = s->addr_code;
-        atomic_set(&d->addend, atomic_read(&s->addend));
-        /* Pairs with flag setting in tlb_reset_dirty_range */
-        atomic_mb_set(&d->addr_write, atomic_read(&s->addr_write));
-    } else {
-        d->addr_read = s->addr_read;
-        d->addr_write = atomic_read(&s->addr_write);
-        d->addr_code = s->addr_code;
-        d->addend = atomic_read(&s->addend);
-    }
-#endif
+}
+
+static void copy_tlb_helper(CPUArchState *env, CPUTLBEntry *d, CPUTLBEntry *s)
+{
+    qemu_mutex_lock(&env->tlb_lock);
+    copy_tlb_helper_locked(d, s);
+    qemu_mutex_unlock(&env->tlb_lock);
 }
 
 /* This is a cross vCPU call (i.e. another vCPU resetting the flags of
- * the target vCPU). As such care needs to be taken that we don't
- * dangerously race with another vCPU update. The only thing actually
- * updated is the target TLB entry ->addr_write flags.
+ * the target vCPU).
+ * We must take tlb_lock to avoid racing with another vCPU update. The only
+ * thing actually updated is the target TLB entry ->addr_write flags.
  */
 void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length)
 {
@@ -528,22 +510,26 @@ void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length)
     int mmu_idx;
 
     env = cpu->env_ptr;
+    qemu_mutex_lock(&env->tlb_lock);
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         unsigned int i;
 
         for (i = 0; i < CPU_TLB_SIZE; i++) {
-            tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
-                                  start1, length);
+            tlb_reset_dirty_range_locked(&env->tlb_table[mmu_idx][i], start1,
+                                         length);
         }
 
         for (i = 0; i < CPU_VTLB_SIZE; i++) {
-            tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
-                                  start1, length);
+            tlb_reset_dirty_range_locked(&env->tlb_v_table[mmu_idx][i], start1,
+                                         length);
         }
     }
+    qemu_mutex_unlock(&env->tlb_lock);
 }
 
-static inline void tlb_set_dirty1(CPUTLBEntry *tlb_entry, target_ulong vaddr)
+/* Called with tlb_lock held */
+static inline void tlb_set_dirty1_locked(CPUTLBEntry *tlb_entry,
+                                         target_ulong vaddr)
 {
     if (tlb_entry->addr_write == (vaddr | TLB_NOTDIRTY)) {
         tlb_entry->addr_write = vaddr;
@@ -562,16 +548,18 @@ void tlb_set_dirty(CPUState *cpu, target_ulong vaddr)
 
     vaddr &= TARGET_PAGE_MASK;
     i = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    qemu_mutex_lock(&env->tlb_lock);
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
-        tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
+        tlb_set_dirty1_locked(&env->tlb_table[mmu_idx][i], vaddr);
     }
 
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         int k;
         for (k = 0; k < CPU_VTLB_SIZE; k++) {
-            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
+            tlb_set_dirty1_locked(&env->tlb_v_table[mmu_idx][k], vaddr);
         }
     }
+    qemu_mutex_unlock(&env->tlb_lock);
 }
 
 /* Our TLB does not support large pages, so remember the area covered by
@@ -677,7 +665,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
         CPUTLBEntry *tv = &env->tlb_v_table[mmu_idx][vidx];
 
         /* Evict the old entry into the victim tlb.  */
-        copy_tlb_helper(tv, te, true);
+        copy_tlb_helper(env, tv, te);
         env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
     }
 
@@ -729,9 +717,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
         }
     }
 
-    /* Pairs with flag setting in tlb_reset_dirty_range */
-    copy_tlb_helper(te, &tn, true);
-    /* atomic_mb_set(&te->addr_write, write_address); */
+    copy_tlb_helper(env, te, &tn);
 }
 
 /* Add a new TLB entry, but without specifying the memory
@@ -903,9 +889,11 @@ static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
             /* Found entry in victim tlb, swap tlb and iotlb.  */
             CPUTLBEntry tmptlb, *tlb = &env->tlb_table[mmu_idx][index];
 
-            copy_tlb_helper(&tmptlb, tlb, false);
-            copy_tlb_helper(tlb, vtlb, true);
-            copy_tlb_helper(vtlb, &tmptlb, true);
+            qemu_mutex_lock(&env->tlb_lock);
+            copy_tlb_helper_locked(&tmptlb, tlb);
+            copy_tlb_helper_locked(tlb, vtlb);
+            copy_tlb_helper_locked(vtlb, &tmptlb);
+            qemu_mutex_unlock(&env->tlb_lock);
 
             CPUIOTLBEntry tmpio, *io = &env->iotlb[mmu_idx][index];
             CPUIOTLBEntry *vio = &env->iotlb_v[mmu_idx][vidx];
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 3/3] cputlb: read CPUTLBEntry.addr_write atomically
  2018-10-02 21:29 [Qemu-devel] [PATCH 0/3] per-TLB lock Emilio G. Cota
  2018-10-02 21:29 ` [Qemu-devel] [PATCH 1/3] exec: introduce tlb_init Emilio G. Cota
  2018-10-02 21:29 ` [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock Emilio G. Cota
@ 2018-10-02 21:29 ` Emilio G. Cota
  2018-10-03  7:56 ` [Qemu-devel] [PATCH 0/3] per-TLB lock Paolo Bonzini
  3 siblings, 0 replies; 12+ messages in thread
From: Emilio G. Cota @ 2018-10-02 21:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Richard Henderson, Alex Bennée

Updates can come from other threads, so readers that do not
take tlb_lock must use atomic_read to avoid undefined behaviour.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/softmmu_template.h     | 16 ++++++++++------
 include/exec/cpu_ldst.h          |  2 +-
 include/exec/cpu_ldst_template.h |  2 +-
 accel/tcg/cputlb.c               | 15 +++++++++------
 4 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/accel/tcg/softmmu_template.h b/accel/tcg/softmmu_template.h
index f060a693d4..1e50263871 100644
--- a/accel/tcg/softmmu_template.h
+++ b/accel/tcg/softmmu_template.h
@@ -277,7 +277,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 {
     unsigned mmu_idx = get_mmuidx(oi);
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    target_ulong tlb_addr =
+        atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
     unsigned a_bits = get_alignment_bits(get_memop(oi));
     uintptr_t haddr;
 
@@ -292,7 +293,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, MMU_DATA_STORE,
                      mmu_idx, retaddr);
         }
-        tlb_addr = env->tlb_table[mmu_idx][index].addr_write & ~TLB_INVALID_MASK;
+        tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write) &
+            ~TLB_INVALID_MASK;
     }
 
     /* Handle an IO access.  */
@@ -321,7 +323,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
            cannot evict the first.  */
         page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK;
         index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-        tlb_addr2 = env->tlb_table[mmu_idx][index2].addr_write;
+        tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write);
         if (!tlb_hit_page(tlb_addr2, page2)
             && !VICTIM_TLB_HIT(addr_write, page2)) {
             tlb_fill(ENV_GET_CPU(env), page2, DATA_SIZE, MMU_DATA_STORE,
@@ -354,7 +356,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 {
     unsigned mmu_idx = get_mmuidx(oi);
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    target_ulong tlb_addr =
+        atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
     unsigned a_bits = get_alignment_bits(get_memop(oi));
     uintptr_t haddr;
 
@@ -369,7 +372,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, MMU_DATA_STORE,
                      mmu_idx, retaddr);
         }
-        tlb_addr = env->tlb_table[mmu_idx][index].addr_write & ~TLB_INVALID_MASK;
+        tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write) &
+            ~TLB_INVALID_MASK;
     }
 
     /* Handle an IO access.  */
@@ -398,7 +402,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
            cannot evict the first.  */
         page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK;
         index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-        tlb_addr2 = env->tlb_table[mmu_idx][index2].addr_write;
+        tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write);
         if (!tlb_hit_page(tlb_addr2, page2)
             && !VICTIM_TLB_HIT(addr_write, page2)) {
             tlb_fill(ENV_GET_CPU(env), page2, DATA_SIZE, MMU_DATA_STORE,
diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index 41ed0526e2..9581587ce1 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -426,7 +426,7 @@ static inline void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr,
         tlb_addr = tlbentry->addr_read;
         break;
     case 1:
-        tlb_addr = tlbentry->addr_write;
+        tlb_addr = atomic_read(&tlbentry->addr_write);
         break;
     case 2:
         tlb_addr = tlbentry->addr_code;
diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h
index 4db2302962..ba7a11123c 100644
--- a/include/exec/cpu_ldst_template.h
+++ b/include/exec/cpu_ldst_template.h
@@ -176,7 +176,7 @@ glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
     addr = ptr;
     page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     mmu_idx = CPU_MMU_INDEX;
-    if (unlikely(env->tlb_table[mmu_idx][page_index].addr_write !=
+    if (unlikely(atomic_read(&env->tlb_table[mmu_idx][page_index].addr_write) !=
                  (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
         oi = make_memop_idx(SHIFT, mmu_idx);
         glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(env, addr, v, oi,
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 01b33f9d28..77091474ad 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -249,7 +249,7 @@ static inline bool tlb_hit_page_anyprot(CPUTLBEntry *tlb_entry,
                                         target_ulong page)
 {
     return tlb_hit_page(tlb_entry->addr_read, page) ||
-           tlb_hit_page(tlb_entry->addr_write, page) ||
+           tlb_hit_page(atomic_read(&tlb_entry->addr_write), page) ||
            tlb_hit_page(tlb_entry->addr_code, page);
 }
 
@@ -836,7 +836,7 @@ static void io_writex(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
         tlb_fill(cpu, addr, size, MMU_DATA_STORE, mmu_idx, retaddr);
 
         index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+        tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
         if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) {
             /* RAM access */
             uintptr_t haddr = addr + env->tlb_table[mmu_idx][index].addend;
@@ -883,7 +883,9 @@ static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
     size_t vidx;
     for (vidx = 0; vidx < CPU_VTLB_SIZE; ++vidx) {
         CPUTLBEntry *vtlb = &env->tlb_v_table[mmu_idx][vidx];
-        target_ulong cmp = *(target_ulong *)((uintptr_t)vtlb + elt_ofs);
+        /* elt_ofs might correspond to .addr_write, so use atomic_read */
+        target_ulong cmp =
+            atomic_read((target_ulong *)((uintptr_t)vtlb + elt_ofs));
 
         if (cmp == page) {
             /* Found entry in victim tlb, swap tlb and iotlb.  */
@@ -955,7 +957,8 @@ void probe_write(CPUArchState *env, target_ulong addr, int size, int mmu_idx,
                  uintptr_t retaddr)
 {
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    target_ulong tlb_addr =
+        atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
 
     if (!tlb_hit(tlb_addr, addr)) {
         /* TLB entry is for a different page */
@@ -975,7 +978,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
     size_t mmu_idx = get_mmuidx(oi);
     size_t index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     CPUTLBEntry *tlbe = &env->tlb_table[mmu_idx][index];
-    target_ulong tlb_addr = tlbe->addr_write;
+    target_ulong tlb_addr = atomic_read(&tlbe->addr_write);
     TCGMemOp mop = get_memop(oi);
     int a_bits = get_alignment_bits(mop);
     int s_bits = mop & MO_SIZE;
@@ -1006,7 +1009,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
             tlb_fill(ENV_GET_CPU(env), addr, 1 << s_bits, MMU_DATA_STORE,
                      mmu_idx, retaddr);
         }
-        tlb_addr = tlbe->addr_write & ~TLB_INVALID_MASK;
+        tlb_addr = atomic_read(&tlbe->addr_write) & ~TLB_INVALID_MASK;
     }
 
     /* Notice an IO access or a needs-MMU-lookup access */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] per-TLB lock
  2018-10-02 21:29 [Qemu-devel] [PATCH 0/3] per-TLB lock Emilio G. Cota
                   ` (2 preceding siblings ...)
  2018-10-02 21:29 ` [Qemu-devel] [PATCH 3/3] cputlb: read CPUTLBEntry.addr_write atomically Emilio G. Cota
@ 2018-10-03  7:56 ` Paolo Bonzini
  3 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2018-10-03  7:56 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Richard Henderson, Alex Bennée

On 02/10/2018 23:29, Emilio G. Cota wrote:
> This series introduces a per-TLB lock. This removes existing UB
> (e.g. memset racing with cmpxchg on another thread while flushing),
> and in my opinion makes the TLB code simpler to understand.
> 
> I had a bit of trouble finding the best place to initialize the
> mutex, since it has to be called before tlb_flush, and tlb_flush
> is called quite early during cpu initialization. I settled on
> cpu_exec_realizefn, since then cpu->env_ptr has been set
> but tlb_flush hasn't yet been called.
> 
> Perf-wise this change does have a small impact (~2% slowdown for
> the aarch64 bootup+shutdown test; 1.2% comes from using atomic_read
> consistently), but I think this is a fair price for avoiding UB.

The UB is unlikely to be an issue in practice, but I like the
simplicity.  In retrospect it was premature optimization.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock
  2018-10-02 21:29 ` [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock Emilio G. Cota
@ 2018-10-03  9:19   ` Alex Bennée
  2018-10-03 10:02     ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Bennée @ 2018-10-03  9:19 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Currently we rely on atomic operations for cross-CPU invalidations.
> There are two cases that these atomics miss: cross-CPU invalidations
> can race with either (1) vCPU threads flushing their TLB, which
> happens via memset, or (2) vCPUs calling tlb_reset_dirty on their TLB,
> which updates .addr_write with a regular store. This results in
> undefined behaviour, since we're mixing regular and atomic ops
> on concurrent accesses.
>
> Fix it by using tlb_lock, a per-vCPU lock. All updaters of tlb_table
> and the corresponding victim cache now hold the lock.
> The readers that do not hold tlb_lock must use atomic reads when
> reading .addr_write, since this field can be updated by other threads;
> the conversion to atomic reads is done in the next patch.

What about the inline TLB lookup code? The original purpose of the
cmpxchg was to ensure the inline code would either see a valid entry or
and invalid one, not a potentially torn read.

> Note that an alternative fix would be to expand the use of atomic ops.
> However, in the case of TLB flushes this would have a huge performance
> impact, since (1) TLB flushes can happen very frequently and (2) we
> currently use a full memory barrier to flush each TLB entry, and a TLB
> has many entries. Instead, acquiring the lock is barely slower than a
> full memory barrier since it is uncontended, and with a single lock
> acquisition we can flush the entire TLB.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/exec/cpu-defs.h |   2 +
>  accel/tcg/cputlb.c      | 108 ++++++++++++++++++----------------------
>  2 files changed, 50 insertions(+), 60 deletions(-)
>
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index a171ffc1a4..bcc40c8ef5 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -142,6 +142,8 @@ typedef struct CPUIOTLBEntry {
>
>  #define CPU_COMMON_TLB \
>      /* The meaning of the MMU modes is defined in the target code. */   \
> +    /* tlb_lock serializes updates to tlb_table and tlb_v_table */      \
> +    QemuMutex tlb_lock;                                                 \
>      CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
>      CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
>      CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                    \
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 502eea2850..01b33f9d28 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -75,6 +75,9 @@ QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16);
>
>  void tlb_init(CPUState *cpu)
>  {
> +    CPUArchState *env = cpu->env_ptr;
> +
> +    qemu_mutex_init(&env->tlb_lock);
>  }
>
>  /* flush_all_helper: run fn across all cpus
> @@ -129,8 +132,11 @@ static void tlb_flush_nocheck(CPUState *cpu)
>      atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
>      tlb_debug("(count: %zu)\n", tlb_flush_count());
>
> +    qemu_mutex_lock(&env->tlb_lock);
>      memset(env->tlb_table, -1, sizeof(env->tlb_table));
>      memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
> +    qemu_mutex_unlock(&env->tlb_lock);
> +
>      cpu_tb_jmp_cache_clear(cpu);
>
>      env->vtlb_index = 0;
> @@ -454,72 +460,48 @@ void tlb_unprotect_code(ram_addr_t ram_addr)
>   * most usual is detecting writes to code regions which may invalidate
>   * generated code.
>   *
> - * Because we want other vCPUs to respond to changes straight away we
> - * update the te->addr_write field atomically. If the TLB entry has
> - * been changed by the vCPU in the mean time we skip the update.
> + * Other vCPUs might be reading their TLBs during guest execution, so we update
> + * te->addr_write with atomic_set. We don't need to worry about this for
> + * oversized guests as MTTCG is disabled for them.
>   *
> - * As this function uses atomic accesses we also need to ensure
> - * updates to tlb_entries follow the same access rules. We don't need
> - * to worry about this for oversized guests as MTTCG is disabled for
> - * them.
> + * Called with tlb_lock held.
>   */
> -
> -static void tlb_reset_dirty_range(CPUTLBEntry *tlb_entry, uintptr_t start,
> -                           uintptr_t length)
> +static void tlb_reset_dirty_range_locked(CPUTLBEntry *tlb_entry,
> +                                         uintptr_t start, uintptr_t length)
>  {
> -#if TCG_OVERSIZED_GUEST
>      uintptr_t addr = tlb_entry->addr_write;
>
>      if ((addr & (TLB_INVALID_MASK | TLB_MMIO | TLB_NOTDIRTY)) == 0) {
>          addr &= TARGET_PAGE_MASK;
>          addr += tlb_entry->addend;
>          if ((addr - start) < length) {
> +#if TCG_OVERSIZED_GUEST
>              tlb_entry->addr_write |= TLB_NOTDIRTY;
> -        }
> -    }
>  #else
> -    /* paired with atomic_mb_set in tlb_set_page_with_attrs */
> -    uintptr_t orig_addr = atomic_mb_read(&tlb_entry->addr_write);
> -    uintptr_t addr = orig_addr;
> -
> -    if ((addr & (TLB_INVALID_MASK | TLB_MMIO | TLB_NOTDIRTY)) == 0) {
> -        addr &= TARGET_PAGE_MASK;
> -        addr += atomic_read(&tlb_entry->addend);
> -        if ((addr - start) < length) {
> -            uintptr_t notdirty_addr = orig_addr | TLB_NOTDIRTY;
> -            atomic_cmpxchg(&tlb_entry->addr_write, orig_addr, notdirty_addr);
> +            atomic_set(&tlb_entry->addr_write,
> +                       tlb_entry->addr_write | TLB_NOTDIRTY);
> +#endif
>          }
>      }
> -#endif
>  }
>
> -/* For atomic correctness when running MTTCG we need to use the right
> - * primitives when copying entries */
> -static inline void copy_tlb_helper(CPUTLBEntry *d, CPUTLBEntry *s,
> -                                   bool atomic_set)
> +/* Called with tlb_lock held */
> +static void copy_tlb_helper_locked(CPUTLBEntry *d, const CPUTLBEntry *s)
>  {
> -#if TCG_OVERSIZED_GUEST
>      *d = *s;
> -#else
> -    if (atomic_set) {
> -        d->addr_read = s->addr_read;
> -        d->addr_code = s->addr_code;
> -        atomic_set(&d->addend, atomic_read(&s->addend));
> -        /* Pairs with flag setting in tlb_reset_dirty_range */
> -        atomic_mb_set(&d->addr_write, atomic_read(&s->addr_write));
> -    } else {
> -        d->addr_read = s->addr_read;
> -        d->addr_write = atomic_read(&s->addr_write);
> -        d->addr_code = s->addr_code;
> -        d->addend = atomic_read(&s->addend);
> -    }
> -#endif
> +}
> +
> +static void copy_tlb_helper(CPUArchState *env, CPUTLBEntry *d, CPUTLBEntry *s)
> +{
> +    qemu_mutex_lock(&env->tlb_lock);
> +    copy_tlb_helper_locked(d, s);
> +    qemu_mutex_unlock(&env->tlb_lock);
>  }
>
>  /* This is a cross vCPU call (i.e. another vCPU resetting the flags of
> - * the target vCPU). As such care needs to be taken that we don't
> - * dangerously race with another vCPU update. The only thing actually
> - * updated is the target TLB entry ->addr_write flags.
> + * the target vCPU).
> + * We must take tlb_lock to avoid racing with another vCPU update. The only
> + * thing actually updated is the target TLB entry ->addr_write flags.
>   */
>  void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length)
>  {
> @@ -528,22 +510,26 @@ void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length)
>      int mmu_idx;
>
>      env = cpu->env_ptr;
> +    qemu_mutex_lock(&env->tlb_lock);
>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>          unsigned int i;
>
>          for (i = 0; i < CPU_TLB_SIZE; i++) {
> -            tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
> -                                  start1, length);
> +            tlb_reset_dirty_range_locked(&env->tlb_table[mmu_idx][i], start1,
> +                                         length);
>          }
>
>          for (i = 0; i < CPU_VTLB_SIZE; i++) {
> -            tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
> -                                  start1, length);
> +            tlb_reset_dirty_range_locked(&env->tlb_v_table[mmu_idx][i], start1,
> +                                         length);
>          }
>      }
> +    qemu_mutex_unlock(&env->tlb_lock);
>  }
>
> -static inline void tlb_set_dirty1(CPUTLBEntry *tlb_entry, target_ulong vaddr)
> +/* Called with tlb_lock held */
> +static inline void tlb_set_dirty1_locked(CPUTLBEntry *tlb_entry,
> +                                         target_ulong vaddr)
>  {
>      if (tlb_entry->addr_write == (vaddr | TLB_NOTDIRTY)) {
>          tlb_entry->addr_write = vaddr;
> @@ -562,16 +548,18 @@ void tlb_set_dirty(CPUState *cpu, target_ulong vaddr)
>
>      vaddr &= TARGET_PAGE_MASK;
>      i = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +    qemu_mutex_lock(&env->tlb_lock);
>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
> -        tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
> +        tlb_set_dirty1_locked(&env->tlb_table[mmu_idx][i], vaddr);
>      }
>
>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>          int k;
>          for (k = 0; k < CPU_VTLB_SIZE; k++) {
> -            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
> +            tlb_set_dirty1_locked(&env->tlb_v_table[mmu_idx][k], vaddr);
>          }
>      }
> +    qemu_mutex_unlock(&env->tlb_lock);
>  }
>
>  /* Our TLB does not support large pages, so remember the area covered by
> @@ -677,7 +665,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>          CPUTLBEntry *tv = &env->tlb_v_table[mmu_idx][vidx];
>
>          /* Evict the old entry into the victim tlb.  */
> -        copy_tlb_helper(tv, te, true);
> +        copy_tlb_helper(env, tv, te);
>          env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
>      }
>
> @@ -729,9 +717,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>          }
>      }
>
> -    /* Pairs with flag setting in tlb_reset_dirty_range */
> -    copy_tlb_helper(te, &tn, true);
> -    /* atomic_mb_set(&te->addr_write, write_address); */
> +    copy_tlb_helper(env, te, &tn);
>  }
>
>  /* Add a new TLB entry, but without specifying the memory
> @@ -903,9 +889,11 @@ static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
>              /* Found entry in victim tlb, swap tlb and iotlb.  */
>              CPUTLBEntry tmptlb, *tlb = &env->tlb_table[mmu_idx][index];
>
> -            copy_tlb_helper(&tmptlb, tlb, false);
> -            copy_tlb_helper(tlb, vtlb, true);
> -            copy_tlb_helper(vtlb, &tmptlb, true);
> +            qemu_mutex_lock(&env->tlb_lock);
> +            copy_tlb_helper_locked(&tmptlb, tlb);
> +            copy_tlb_helper_locked(tlb, vtlb);
> +            copy_tlb_helper_locked(vtlb, &tmptlb);
> +            qemu_mutex_unlock(&env->tlb_lock);
>
>              CPUIOTLBEntry tmpio, *io = &env->iotlb[mmu_idx][index];
>              CPUIOTLBEntry *vio = &env->iotlb_v[mmu_idx][vidx];


--
Alex Bennée

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock
  2018-10-03  9:19   ` Alex Bennée
@ 2018-10-03 10:02     ` Paolo Bonzini
  2018-10-03 15:48       ` Emilio G. Cota
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2018-10-03 10:02 UTC (permalink / raw)
  To: Alex Bennée, Emilio G. Cota; +Cc: qemu-devel, Richard Henderson

On 03/10/2018 11:19, Alex Bennée wrote:
>> Fix it by using tlb_lock, a per-vCPU lock. All updaters of tlb_table
>> and the corresponding victim cache now hold the lock.
>> The readers that do not hold tlb_lock must use atomic reads when
>> reading .addr_write, since this field can be updated by other threads;
>> the conversion to atomic reads is done in the next patch.
> What about the inline TLB lookup code? The original purpose of the
> cmpxchg was to ensure the inline code would either see a valid entry or
> and invalid one, not a potentially torn read.
> 

atomic_set also ensures that there are no torn reads.  However, here:

static void copy_tlb_helper_locked(CPUTLBEntry *d, const CPUTLBEntry *s)
{
#if TCG_OVERSIZED_GUEST
    *d = *s;
#else
    if (atomic_set) {
        d->addr_read = s->addr_read;
        d->addr_code = s->addr_code;
        atomic_set(&d->addend, atomic_read(&s->addend));
        /* Pairs with flag setting in tlb_reset_dirty_range */
        atomic_mb_set(&d->addr_write, atomic_read(&s->addr_write));
    } else {
        d->addr_read = s->addr_read;
        d->addr_write = atomic_read(&s->addr_write);
        d->addr_code = s->addr_code;
        d->addend = atomic_read(&s->addend);
    }
#endif
}

it's probably best to do all atomic_set instead of just the memberwise copy.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock
  2018-10-03 10:02     ` Paolo Bonzini
@ 2018-10-03 15:48       ` Emilio G. Cota
  2018-10-03 15:52         ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Emilio G. Cota @ 2018-10-03 15:48 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alex Bennée, qemu-devel, Richard Henderson

On Wed, Oct 03, 2018 at 12:02:19 +0200, Paolo Bonzini wrote:
> On 03/10/2018 11:19, Alex Bennée wrote:
> >> Fix it by using tlb_lock, a per-vCPU lock. All updaters of tlb_table
> >> and the corresponding victim cache now hold the lock.
> >> The readers that do not hold tlb_lock must use atomic reads when
> >> reading .addr_write, since this field can be updated by other threads;
> >> the conversion to atomic reads is done in the next patch.
> > What about the inline TLB lookup code? The original purpose of the
> > cmpxchg was to ensure the inline code would either see a valid entry or
> > and invalid one, not a potentially torn read.
> > 
> 
> atomic_set also ensures that there are no torn reads.

Yes. On the reader side for inline TLB reads, we're emitting
appropriately sized loads that are guaranteed to be atomic
by the ISA. For oversized guests that isn't possible, so we
disable MTTCG.

>  However, here:
> 
> static void copy_tlb_helper_locked(CPUTLBEntry *d, const CPUTLBEntry *s)
> {
> #if TCG_OVERSIZED_GUEST
>     *d = *s;
> #else
>     if (atomic_set) {
>         d->addr_read = s->addr_read;
>         d->addr_code = s->addr_code;
>         atomic_set(&d->addend, atomic_read(&s->addend));
>         /* Pairs with flag setting in tlb_reset_dirty_range */
>         atomic_mb_set(&d->addr_write, atomic_read(&s->addr_write));
>     } else {
>         d->addr_read = s->addr_read;
>         d->addr_write = atomic_read(&s->addr_write);
>         d->addr_code = s->addr_code;
>         d->addend = atomic_read(&s->addend);
>     }
> #endif
> }
> 
> it's probably best to do all atomic_set instead of just the memberwise copy.

Atomics aren't necessary here, as long as the copy is protected by the
lock. This allows other vCPUs to see a consistent view of the data (since
they always acquire the TLB lock), and since copy_tlb is only called
by the vCPU that owns the TLB, regular reads from this vCPU will always
see consistent data.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock
  2018-10-03 15:48       ` Emilio G. Cota
@ 2018-10-03 15:52         ` Paolo Bonzini
  2018-10-03 17:02           ` Emilio G. Cota
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2018-10-03 15:52 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: Alex Bennée, qemu-devel, Richard Henderson

On 03/10/2018 17:48, Emilio G. Cota wrote:
>> it's probably best to do all atomic_set instead of just the memberwise copy.
> Atomics aren't necessary here, as long as the copy is protected by the
> lock. This allows other vCPUs to see a consistent view of the data (since
> they always acquire the TLB lock), and since copy_tlb is only called
> by the vCPU that owns the TLB, regular reads from this vCPU will always
> see consistent data.

For reads I agree, but you may actually get a torn read if the writer
doesn't use atomic_set.

That's because in order to avoid UB all reads or writes that are
concurrent with another write must be atomic, and the write must be
atomic too.  The lock does prevent write-write concurrent accesses, but
not read-write, so the write must be atomic here.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock
  2018-10-03 15:52         ` Paolo Bonzini
@ 2018-10-03 17:02           ` Emilio G. Cota
  2018-10-03 17:05             ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Emilio G. Cota @ 2018-10-03 17:02 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alex Bennée, qemu-devel, Richard Henderson

On Wed, Oct 03, 2018 at 17:52:32 +0200, Paolo Bonzini wrote:
> On 03/10/2018 17:48, Emilio G. Cota wrote:
> >> it's probably best to do all atomic_set instead of just the memberwise copy.
> > Atomics aren't necessary here, as long as the copy is protected by the
> > lock. This allows other vCPUs to see a consistent view of the data (since
> > they always acquire the TLB lock), and since copy_tlb is only called
> > by the vCPU that owns the TLB, regular reads from this vCPU will always
> > see consistent data.
> 
> For reads I agree, but you may actually get a torn read if the writer
> doesn't use atomic_set.

But you cannot get a torn read if all reads that don't hold the lock
are coming from the same thread that performed the write.

> That's because in order to avoid UB all reads or writes that are
> concurrent with another write must be atomic, and the write must be
> atomic too.

Agreed, and that's why there's a patch adding atomic_read to all
.addr_write reads not protected by the lock.

> The lock does prevent write-write concurrent accesses, but
> not read-write, so the write must be atomic here.

But here we have no concurrent read-write, because the non-owner
thread always holds the lock, and the owner thread doesn't need
to synchronize with itself.

I'm appending a small example that might help make my point;
.addr_write corresponds to .written_by_all, and the other
fields correspond to .written_by_owner.

Thanks,

		E.

---
/*
 * baz.c
 * gcc -O3 -fsanitize=thread -o baz baz.c
 */
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define atomic_read(ptr)   __atomic_load_n(ptr, __ATOMIC_RELAXED)
#define atomic_set(ptr, i) __atomic_store_n(ptr, i, __ATOMIC_RELAXED)

#define N_ITER 1000000

struct entry {
	int written_by_all;
	int written_by_owner;
};

static struct entry entry;
static pthread_mutex_t lock;

void *owner_thread(void *arg)
{
	int i;

	for (i = 0; i < N_ITER; i++) {
		if (i % 2) {
			volatile int a __attribute__((unused));
			volatile int b __attribute__((unused));

			a = atomic_read(&entry.written_by_all);
			b = entry.written_by_owner;
		} else {
			pthread_mutex_lock(&lock);
			entry.written_by_all++;
			entry.written_by_owner++;
			pthread_mutex_unlock(&lock);
		}
	}
	return NULL;
}

void *other_thread(void *arg)
{
	int i;

	for (i = 0; i < N_ITER; i++) {
		pthread_mutex_lock(&lock);
		atomic_set(&entry.written_by_all, entry.written_by_all + 1);
		pthread_mutex_unlock(&lock);
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t threads[2];
	int i;

	pthread_mutex_init(&lock, NULL);
	if (pthread_create(&threads[0], NULL, owner_thread, NULL)) {
		exit(1);
	}
	if (pthread_create(&threads[1], NULL, other_thread, NULL)) {
		exit(1);
	}
	for (i = 0; i < 2; i++) {
		pthread_join(threads[i], NULL);
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock
  2018-10-03 17:02           ` Emilio G. Cota
@ 2018-10-03 17:05             ` Paolo Bonzini
  2018-10-03 18:07               ` Emilio G. Cota
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2018-10-03 17:05 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: Alex Bennée, qemu-devel, Richard Henderson

On 03/10/2018 19:02, Emilio G. Cota wrote:
>> For reads I agree, but you may actually get a torn read if the writer
>> doesn't use atomic_set.
>
> But you cannot get a torn read if all reads that don't hold the lock
> are coming from the same thread that performed the write.

Ah, so you are relying on copy_tlb_helper(_locked) being invoked only
from the vCPU thread (as opposed to someone else doing tlb_flush)?
Maybe it's worth adding a comment if that's what I missed.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock
  2018-10-03 17:05             ` Paolo Bonzini
@ 2018-10-03 18:07               ` Emilio G. Cota
  0 siblings, 0 replies; 12+ messages in thread
From: Emilio G. Cota @ 2018-10-03 18:07 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alex Bennée, qemu-devel, Richard Henderson

On Wed, Oct 03, 2018 at 19:05:51 +0200, Paolo Bonzini wrote:
> On 03/10/2018 19:02, Emilio G. Cota wrote:
> >> For reads I agree, but you may actually get a torn read if the writer
> >> doesn't use atomic_set.
> >
> > But you cannot get a torn read if all reads that don't hold the lock
> > are coming from the same thread that performed the write.
> 
> Ah, so you are relying on copy_tlb_helper(_locked) being invoked only
> from the vCPU thread (as opposed to someone else doing tlb_flush)?

Yes. tlb_flush_nocheck is always run by the owner thread--tlb_flush
checks for this, and if !qemu_cpu_is_self(cpu) then it schedules
async work on the owner thread.

> Maybe it's worth adding a comment if that's what I missed.

I'll send a v2 with an updated comment and a debug-only assert
in the copy helper.

Thanks,

		E.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-10-03 18:07 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-02 21:29 [Qemu-devel] [PATCH 0/3] per-TLB lock Emilio G. Cota
2018-10-02 21:29 ` [Qemu-devel] [PATCH 1/3] exec: introduce tlb_init Emilio G. Cota
2018-10-02 21:29 ` [Qemu-devel] [PATCH 2/3] cputlb: serialize tlb updates with env->tlb_lock Emilio G. Cota
2018-10-03  9:19   ` Alex Bennée
2018-10-03 10:02     ` Paolo Bonzini
2018-10-03 15:48       ` Emilio G. Cota
2018-10-03 15:52         ` Paolo Bonzini
2018-10-03 17:02           ` Emilio G. Cota
2018-10-03 17:05             ` Paolo Bonzini
2018-10-03 18:07               ` Emilio G. Cota
2018-10-02 21:29 ` [Qemu-devel] [PATCH 3/3] cputlb: read CPUTLBEntry.addr_write atomically Emilio G. Cota
2018-10-03  7:56 ` [Qemu-devel] [PATCH 0/3] per-TLB lock Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.