All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC 0/6] Dynamic TLB sizing
@ 2018-10-06 21:45 Emilio G. Cota
  2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée

After reading this paper [1], I wondered whether how far one
could push the idea of dynamic TLB resizing. We discussed
it briefly in this thread:

 https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02340.html

Since then, (1) rth helped me (thanks!) with TCG backend code,
and (2) I've abandoned the idea of substituting malloc
for memset, and instead focused on dynamically resizing the
TLBs. The rationale is that if a process touches a lot of
memory, having a large TLB will pay off, since the perf
gains will dwarf the increased cost of flushing via memset.

This series shows that the indirection necessary to do this
does not cause a perf decrease, at least for x86_64 hosts.

This series is incomplete, since it only implements changes
to the i386 backend, and it probably only works on x86_64.
But the whole point is to (1) see whether the performance gains
are worth it, and (2) discuss how crazy this approach is. I was
looking for things to break badly, but so far I've found no obvious
issues. But there might be some assumptions about the TLB size
baked in the code that I might have missed, so please point those
out if they exist.

Performance numbers are in the last patch.

You can fetch this series from:
  https://github.com/cota/qemu/tree/tlb-dyn

Note that it applies on top of my tlb-lock-v3 series:
  https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01087.html

Thanks,

		Emilio

[1] "Optimizing Memory Translation Emulation in Full System Emulators",
Tong et al, TACO'15 https://dl.acm.org/citation.cfm?id=2686034

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing
  2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
  2018-10-08  1:47   ` Richard Henderson
  2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée

No dynamic sizing yet, but the indirection is there.

XXX:
- convert other TCG backends

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/softmmu_template.h     | 14 +++++----
 include/exec/cpu-defs.h          | 14 ++++++---
 include/exec/cpu_ldst.h          |  2 +-
 include/exec/cpu_ldst_template.h |  6 ++--
 accel/tcg/cputlb.c               | 49 +++++++++++++++++++++-----------
 tcg/i386/tcg-target.inc.c        | 26 ++++++++---------
 6 files changed, 68 insertions(+), 43 deletions(-)

diff --git a/accel/tcg/softmmu_template.h b/accel/tcg/softmmu_template.h
index 1e50263871..3f5a0d4017 100644
--- a/accel/tcg/softmmu_template.h
+++ b/accel/tcg/softmmu_template.h
@@ -112,7 +112,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr,
                             TCGMemOpIdx oi, uintptr_t retaddr)
 {
     unsigned mmu_idx = get_mmuidx(oi);
-    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     unsigned a_bits = get_alignment_bits(get_memop(oi));
     uintptr_t haddr;
@@ -180,7 +180,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr,
                             TCGMemOpIdx oi, uintptr_t retaddr)
 {
     unsigned mmu_idx = get_mmuidx(oi);
-    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     unsigned a_bits = get_alignment_bits(get_memop(oi));
     uintptr_t haddr;
@@ -276,7 +276,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
     unsigned mmu_idx = get_mmuidx(oi);
-    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     target_ulong tlb_addr =
         atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
     unsigned a_bits = get_alignment_bits(get_memop(oi));
@@ -322,7 +322,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
            is already guaranteed to be filled, and that the second page
            cannot evict the first.  */
         page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK;
-        index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+        index2 = (page2 >> TARGET_PAGE_BITS) &
+            (env->tlb_desc[mmu_idx].size - 1);
         tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write);
         if (!tlb_hit_page(tlb_addr2, page2)
             && !VICTIM_TLB_HIT(addr_write, page2)) {
@@ -355,7 +356,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
     unsigned mmu_idx = get_mmuidx(oi);
-    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     target_ulong tlb_addr =
         atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
     unsigned a_bits = get_alignment_bits(get_memop(oi));
@@ -401,7 +402,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
            is already guaranteed to be filled, and that the second page
            cannot evict the first.  */
         page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK;
-        index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+        index2 = (page2 >> TARGET_PAGE_BITS) &
+            (env->tlb_desc[mmu_idx].size - 1);
         tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write);
         if (!tlb_hit_page(tlb_addr2, page2)
             && !VICTIM_TLB_HIT(addr_write, page2)) {
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 4ff62f32bf..fa95a4257e 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -141,13 +141,19 @@ typedef struct CPUIOTLBEntry {
     MemTxAttrs attrs;
 } CPUIOTLBEntry;
 
-#define CPU_COMMON_TLB \
+typedef struct CPUTLBDesc {
+    size_t size;
+    size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
+} CPUTLBDesc;
+
+#define CPU_COMMON_TLB  \
     /* The meaning of the MMU modes is defined in the target code. */   \
-    /* tlb_lock serializes updates to tlb_table and tlb_v_table */      \
+    /* tlb_lock serializes updates to tlb_desc, tlb_table and tlb_v_table */ \
     QemuSpin tlb_lock;                                                  \
-    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
+    CPUTLBDesc tlb_desc[NB_MMU_MODES];                                  \
+    CPUTLBEntry *tlb_table[NB_MMU_MODES];                               \
     CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
-    CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                    \
+    CPUIOTLBEntry *iotlb[NB_MMU_MODES];                                 \
     CPUIOTLBEntry iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                 \
     size_t tlb_flush_count;                                             \
     target_ulong tlb_flush_addr;                                        \
diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index 9581587ce1..df452f5977 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -416,7 +416,7 @@ static inline void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr,
 #if defined(CONFIG_USER_ONLY)
     return g2h(addr);
 #else
-    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     CPUTLBEntry *tlbentry = &env->tlb_table[mmu_idx][index];
     abi_ptr tlb_addr;
     uintptr_t haddr;
diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h
index ba7a11123c..6ab9909f46 100644
--- a/include/exec/cpu_ldst_template.h
+++ b/include/exec/cpu_ldst_template.h
@@ -94,8 +94,8 @@ glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
 #endif
 
     addr = ptr;
-    page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     mmu_idx = CPU_MMU_INDEX;
+    page_index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ !=
                  (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
         oi = make_memop_idx(SHIFT, mmu_idx);
@@ -132,8 +132,8 @@ glue(glue(glue(cpu_lds, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
 #endif
 
     addr = ptr;
-    page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     mmu_idx = CPU_MMU_INDEX;
+    page_index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ !=
                  (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
         oi = make_memop_idx(SHIFT, mmu_idx);
@@ -174,8 +174,8 @@ glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
 #endif
 
     addr = ptr;
-    page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     mmu_idx = CPU_MMU_INDEX;
+    page_index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     if (unlikely(atomic_read(&env->tlb_table[mmu_idx][page_index].addr_write) !=
                  (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
         oi = make_memop_idx(SHIFT, mmu_idx);
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index c7608ccdf8..0b51efc374 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -76,8 +76,17 @@ QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16);
 void tlb_init(CPUState *cpu)
 {
     CPUArchState *env = cpu->env_ptr;
+    int i;
 
     qemu_spin_init(&env->tlb_lock);
+    for (i = 0; i < NB_MMU_MODES; i++) {
+        CPUTLBDesc *desc = &env->tlb_desc[i];
+
+        desc->size = CPU_TLB_SIZE;
+        desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
+        env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
+        env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
+    }
 }
 
 /* flush_all_helper: run fn across all cpus
@@ -120,6 +129,7 @@ size_t tlb_flush_count(void)
 static void tlb_flush_nocheck(CPUState *cpu)
 {
     CPUArchState *env = cpu->env_ptr;
+    int i;
 
     /* The QOM tests will trigger tlb_flushes without setting up TCG
      * so we bug out here in that case.
@@ -139,7 +149,10 @@ static void tlb_flush_nocheck(CPUState *cpu)
      * that do not hold the lock are performed by the same owner thread.
      */
     qemu_spin_lock(&env->tlb_lock);
-    memset(env->tlb_table, -1, sizeof(env->tlb_table));
+    for (i = 0; i < NB_MMU_MODES; i++) {
+        memset(env->tlb_table[i], -1,
+               env->tlb_desc[i].size * sizeof(CPUTLBEntry));
+    }
     memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
     qemu_spin_unlock(&env->tlb_lock);
 
@@ -200,7 +213,8 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
         if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
             tlb_debug("%d\n", mmu_idx);
 
-            memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
+            memset(env->tlb_table[mmu_idx], -1,
+                   env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
             memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
         }
     }
@@ -286,7 +300,6 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data)
 {
     CPUArchState *env = cpu->env_ptr;
     target_ulong addr = (target_ulong) data.target_ptr;
-    int i;
     int mmu_idx;
 
     assert_cpu_is_self(cpu);
@@ -304,9 +317,10 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data)
     }
 
     addr &= TARGET_PAGE_MASK;
-    i = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     qemu_spin_lock(&env->tlb_lock);
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        int i = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
+
         tlb_flush_entry_locked(&env->tlb_table[mmu_idx][i], addr);
         tlb_flush_vtlb_page_locked(env, mmu_idx, addr);
     }
@@ -339,16 +353,17 @@ static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu,
     target_ulong addr_and_mmuidx = (target_ulong) data.target_ptr;
     target_ulong addr = addr_and_mmuidx & TARGET_PAGE_MASK;
     unsigned long mmu_idx_bitmap = addr_and_mmuidx & ALL_MMUIDX_BITS;
-    int page = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     int mmu_idx;
 
     assert_cpu_is_self(cpu);
 
-    tlb_debug("page:%d addr:"TARGET_FMT_lx" mmu_idx:0x%lx\n",
-              page, addr, mmu_idx_bitmap);
+    tlb_debug("addr: "TARGET_FMT_lx" mmu_idx:0x%lx\n", addr, mmu_idx_bitmap);
 
     qemu_spin_lock(&env->tlb_lock);
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        int page;
+
+        page = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
         if (test_bit(mmu_idx, &mmu_idx_bitmap)) {
             tlb_flush_entry_locked(&env->tlb_table[mmu_idx][page], addr);
             tlb_flush_vtlb_page_locked(env, mmu_idx, addr);
@@ -524,7 +539,7 @@ void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length)
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         unsigned int i;
 
-        for (i = 0; i < CPU_TLB_SIZE; i++) {
+        for (i = 0; i < env->tlb_desc[mmu_idx].size; i++) {
             tlb_reset_dirty_range_locked(&env->tlb_table[mmu_idx][i], start1,
                                          length);
         }
@@ -551,15 +566,15 @@ static inline void tlb_set_dirty1_locked(CPUTLBEntry *tlb_entry,
 void tlb_set_dirty(CPUState *cpu, target_ulong vaddr)
 {
     CPUArchState *env = cpu->env_ptr;
-    int i;
     int mmu_idx;
 
     assert_cpu_is_self(cpu);
 
     vaddr &= TARGET_PAGE_MASK;
-    i = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     qemu_spin_lock(&env->tlb_lock);
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        int i = (vaddr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
+
         tlb_set_dirty1_locked(&env->tlb_table[mmu_idx][i], vaddr);
     }
 
@@ -660,7 +675,8 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     iotlb = memory_region_section_get_iotlb(cpu, section, vaddr_page,
                                             paddr_page, xlat, prot, &address);
 
-    index = (vaddr_page >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    index = (vaddr_page >> TARGET_PAGE_BITS) &
+        (env->tlb_desc[mmu_idx].size - 1);
     te = &env->tlb_table[mmu_idx][index];
 
     /*
@@ -788,7 +804,7 @@ static uint64_t io_readx(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
 
         tlb_fill(cpu, addr, size, MMU_DATA_LOAD, mmu_idx, retaddr);
 
-        index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+        index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
         tlb_addr = env->tlb_table[mmu_idx][index].addr_read;
         if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) {
             /* RAM access */
@@ -855,7 +871,7 @@ static void io_writex(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
 
         tlb_fill(cpu, addr, size, MMU_DATA_STORE, mmu_idx, retaddr);
 
-        index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+        index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
         tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
         if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) {
             /* RAM access */
@@ -943,8 +959,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr)
     int mmu_idx, index;
     void *p;
 
-    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     mmu_idx = cpu_mmu_index(env, true);
+    index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     if (unlikely(!tlb_hit(env->tlb_table[mmu_idx][index].addr_code, addr))) {
         if (!VICTIM_TLB_HIT(addr_code, addr)) {
             tlb_fill(ENV_GET_CPU(env), addr, 0, MMU_INST_FETCH, mmu_idx, 0);
@@ -978,7 +994,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr)
 void probe_write(CPUArchState *env, target_ulong addr, int size, int mmu_idx,
                  uintptr_t retaddr)
 {
-    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
     target_ulong tlb_addr =
         atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
 
@@ -998,7 +1014,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
                                NotDirtyInfo *ndi)
 {
     size_t mmu_idx = get_mmuidx(oi);
-    size_t index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    size_t index = (addr >> TARGET_PAGE_BITS) &
+        (env->tlb_desc[mmu_idx].size - 1);
     CPUTLBEntry *tlbe = &env->tlb_table[mmu_idx][index];
     target_ulong tlb_addr = atomic_read(&tlbe->addr_write);
     TCGMemOp mop = get_memop(oi);
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 436195894b..fce6a94e22 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -330,6 +330,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define OPC_ARITH_GvEv	(0x03)		/* ... plus (ARITH_FOO << 3) */
 #define OPC_ANDN        (0xf2 | P_EXT38)
 #define OPC_ADD_GvEv	(OPC_ARITH_GvEv | (ARITH_ADD << 3))
+#define OPC_AND_GvEv    (OPC_ARITH_GvEv | (ARITH_AND << 3))
 #define OPC_BLENDPS     (0x0c | P_EXT3A | P_DATA16)
 #define OPC_BSF         (0xbc | P_EXT)
 #define OPC_BSR         (0xbd | P_EXT)
@@ -1633,6 +1634,15 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
     }
 
     tcg_out_mov(s, tlbtype, r0, addrlo);
+    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+
+    tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, r0, TCG_AREG0,
+                         offsetof(CPUArchState, tlb_desc[mem_index].mask));
+
+    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r0, TCG_AREG0,
+                         offsetof(CPUArchState, tlb_table[mem_index]));
+
     /* If the required alignment is at least as large as the access, simply
        copy the address and mask.  For lesser alignments, check that we don't
        cross pages for the complete access.  */
@@ -1642,20 +1652,10 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
         tcg_out_modrm_offset(s, OPC_LEA + trexw, r1, addrlo, s_mask - a_mask);
     }
     tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-
-    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
-                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-
     tgen_arithi(s, ARITH_AND + trexw, r1, tlb_mask, 0);
-    tgen_arithi(s, ARITH_AND + tlbrexw, r0,
-                (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS, 0);
-
-    tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, 0,
-                             offsetof(CPUArchState, tlb_table[mem_index][0])
-                             + which);
 
     /* cmp 0(r0), r1 */
-    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, 0);
+    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which);
 
     /* Prepare for both the fast path add of the tlb addend, and the slow
        path function argument setup.  There are two cases worth note:
@@ -1672,7 +1672,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
 
     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
         /* cmp 4(r0), addrhi */
-        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, 4);
+        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, which + 4);
 
         /* jne slow_path */
         tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
@@ -1684,7 +1684,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
 
     /* add addend(r0), r1 */
     tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r1, r0,
-                         offsetof(CPUTLBEntry, addend) - which);
+                         offsetof(CPUTLBEntry, addend));
 }
 
 /*
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
  2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
  2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
  2018-10-08  2:09   ` Richard Henderson
  2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée

Currently we evict an entry to the victim TLB when it doesn't match
the current address. But it could be that there's no match because
the current entry is invalid. Do not evict the entry to the vtlb
in that case.

This change will help us keep track of the TLB's use rate.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-all.h | 14 ++++++++++++++
 accel/tcg/cputlb.c     |  2 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index 117d2fbbca..d938dedafc 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
     return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
 }
 
+/**
+ * tlb_is_valid - return true if at least one of the addresses is valid
+ * @te: pointer to CPUTLBEntry
+ *
+ * This is useful when we don't have a particular address to compare against,
+ * and we just want to know whether any entry holds valid data.
+ */
+static inline bool tlb_is_valid(const CPUTLBEntry *te)
+{
+    return !(te->addr_read & TLB_INVALID_MASK) ||
+           !(te->addr_write & TLB_INVALID_MASK) ||
+           !(te->addr_code & TLB_INVALID_MASK);
+}
+
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
 #endif /* !CONFIG_USER_ONLY */
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 0b51efc374..0e2c149d6b 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -695,7 +695,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
      * Only evict the old entry to the victim tlb if it's for a
      * different page; otherwise just overwrite the stale data.
      */
-    if (!tlb_hit_page_anyprot(te, vaddr_page)) {
+    if (!tlb_hit_page_anyprot(te, vaddr_page) && tlb_is_valid(te)) {
         unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
         CPUTLBEntry *tv = &env->tlb_v_table[mmu_idx][vidx];
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates
  2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
  2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
  2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
  2018-10-08  2:54   ` Richard Henderson
  2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée

This paves the way for implementing a dynamically-sized softmmu.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h |  1 +
 accel/tcg/cputlb.c      | 17 ++++++++++++++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index fa95a4257e..af9fe04b0b 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -144,6 +144,7 @@ typedef struct CPUIOTLBEntry {
 typedef struct CPUTLBDesc {
     size_t size;
     size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
+    size_t used;
 } CPUTLBDesc;
 
 #define CPU_COMMON_TLB  \
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 0e2c149d6b..ed19ac0e40 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -84,6 +84,7 @@ void tlb_init(CPUState *cpu)
 
         desc->size = CPU_TLB_SIZE;
         desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
+        desc->used = 0;
         env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
         env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
     }
@@ -152,6 +153,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
     for (i = 0; i < NB_MMU_MODES; i++) {
         memset(env->tlb_table[i], -1,
                env->tlb_desc[i].size * sizeof(CPUTLBEntry));
+        env->tlb_desc[i].used = 0;
     }
     memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
     qemu_spin_unlock(&env->tlb_lock);
@@ -216,6 +218,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
             memset(env->tlb_table[mmu_idx], -1,
                    env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
             memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+            env->tlb_desc[mmu_idx].used = 0;
         }
     }
     qemu_spin_unlock(&env->tlb_lock);
@@ -276,12 +279,14 @@ static inline bool tlb_hit_page_anyprot(CPUTLBEntry *tlb_entry,
 }
 
 /* Called with tlb_lock held */
-static inline void tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
+static inline bool tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
                                           target_ulong page)
 {
     if (tlb_hit_page_anyprot(tlb_entry, page)) {
         memset(tlb_entry, -1, sizeof(*tlb_entry));
+        return true;
     }
+    return false;
 }
 
 /* Called with tlb_lock held */
@@ -321,7 +326,9 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data)
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         int i = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
 
-        tlb_flush_entry_locked(&env->tlb_table[mmu_idx][i], addr);
+        if (tlb_flush_entry_locked(&env->tlb_table[mmu_idx][i], addr)) {
+            env->tlb_desc[mmu_idx].used--;
+        }
         tlb_flush_vtlb_page_locked(env, mmu_idx, addr);
     }
     qemu_spin_unlock(&env->tlb_lock);
@@ -365,7 +372,9 @@ static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu,
 
         page = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
         if (test_bit(mmu_idx, &mmu_idx_bitmap)) {
-            tlb_flush_entry_locked(&env->tlb_table[mmu_idx][page], addr);
+            if (tlb_flush_entry_locked(&env->tlb_table[mmu_idx][page], addr)) {
+                env->tlb_desc[mmu_idx].used--;
+            }
             tlb_flush_vtlb_page_locked(env, mmu_idx, addr);
         }
     }
@@ -702,6 +711,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
         /* Evict the old entry into the victim tlb.  */
         copy_tlb_helper_locked(tv, te);
         env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
+        env->tlb_desc[mmu_idx].used--;
     }
 
     /* refill the tlb */
@@ -753,6 +763,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     }
 
     copy_tlb_helper_locked(te, &tn);
+    env->tlb_desc[mmu_idx].used++;
     qemu_spin_unlock(&env->tlb_lock);
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS
  2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
                   ` (2 preceding siblings ...)
  2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
  2018-10-08  2:56   ` Richard Henderson
  2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
  2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
  5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée

From: Pranith Kumar <bobby.prani@gmail.com>

This paves the way for implementing a dynamically-sized softmmu.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/aarch64/tcg-target.h | 1 +
 tcg/arm/tcg-target.h     | 1 +
 tcg/i386/tcg-target.h    | 2 ++
 tcg/mips/tcg-target.h    | 2 ++
 tcg/ppc/tcg-target.h     | 1 +
 tcg/s390/tcg-target.h    | 1 +
 tcg/sparc/tcg-target.h   | 1 +
 tcg/tci/tcg-target.h     | 1 +
 8 files changed, 10 insertions(+)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 9aea1d1771..55af43d55f 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -15,6 +15,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 #undef TCG_TARGET_STACK_GROWSUP
 
 typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 94b3578c55..0cd07906b3 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -60,6 +60,7 @@ extern int arm_arch;
 #undef TCG_TARGET_STACK_GROWSUP
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 8
 
 typedef enum {
     TCG_REG_R0 = 0,
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 9fdf37f23c..4e79e0a550 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -200,6 +200,8 @@ extern bool have_avx2;
 # define TCG_AREG0 TCG_REG_EBP
 #endif
 
+#define TCG_TARGET_TLB_MAX_INDEX_BITS (32 - CPU_TLB_ENTRY_BITS)
+
 static inline void flush_icache_range(uintptr_t start, uintptr_t stop)
 {
 }
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index a8222476f0..b791e2b4cd 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -39,6 +39,8 @@
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define TCG_TARGET_NB_REGS 32
 
+#define TCG_TARGET_TLB_MAX_INDEX_BITS (16 - CPU_TLB_ENTRY_BITS)
+
 typedef enum {
     TCG_REG_ZERO = 0,
     TCG_REG_AT,
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index be52ad1d2e..e0ad7c122d 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -34,6 +34,7 @@
 #define TCG_TARGET_NB_REGS 32
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 
 typedef enum {
     TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index 6f2b06a7d1..a1e25e13b3 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -27,6 +27,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 
 typedef enum TCGReg {
     TCG_REG_R0 = 0,
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index d8339bf010..72ace760d5 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -29,6 +29,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 12
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index 26140d78cb..3f28219afc 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -43,6 +43,7 @@
 #define TCG_TARGET_INTERPRETER 1
 #define TCG_TARGET_INSN_UNIT_SIZE 1
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 
 #if UINTPTR_MAX == UINT32_MAX
 # define TCG_TARGET_REG_BITS 32
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE
  2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
                   ` (3 preceding siblings ...)
  2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
  2018-10-08  3:01   ` Richard Henderson
  2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
  5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h   | 6 +++---
 accel/tcg/cputlb.c        | 2 +-
 tcg/i386/tcg-target.inc.c | 3 ++-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index af9fe04b0b..27b9433976 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -67,7 +67,7 @@ typedef uint64_t target_ulong;
 #define CPU_TLB_ENTRY_BITS 5
 #endif
 
-/* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
+/* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in MIN_CPU_TLB_BITS to ensure that
  * the TLB is not unnecessarily small, but still small enough for the
  * TLB lookup instruction sequence used by the TCG target.
  *
@@ -89,7 +89,7 @@ typedef uint64_t target_ulong;
  * 0x18 (the offset of the addend field in each TLB entry) plus the offset
  * of tlb_table inside env (which is non-trivial but not huge).
  */
-#define CPU_TLB_BITS                                             \
+#define MIN_CPU_TLB_BITS                                         \
     MIN(8,                                                       \
         TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
         (NB_MMU_MODES <= 1 ? 0 :                                 \
@@ -97,7 +97,7 @@ typedef uint64_t target_ulong;
          NB_MMU_MODES <= 4 ? 2 :                                 \
          NB_MMU_MODES <= 8 ? 3 : 4))
 
-#define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
+#define MIN_CPU_TLB_SIZE (1 << MIN_CPU_TLB_BITS)
 
 typedef struct CPUTLBEntry {
     /* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index ed19ac0e40..1ca71ecfc4 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -82,7 +82,7 @@ void tlb_init(CPUState *cpu)
     for (i = 0; i < NB_MMU_MODES; i++) {
         CPUTLBDesc *desc = &env->tlb_desc[i];
 
-        desc->size = CPU_TLB_SIZE;
+        desc->size = MIN_CPU_TLB_SIZE;
         desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
         desc->used = 0;
         env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index fce6a94e22..60d8ed5264 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -1626,7 +1626,8 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
         }
         if (TCG_TYPE_PTR == TCG_TYPE_I64) {
             hrexw = P_REXW;
-            if (TARGET_PAGE_BITS + CPU_TLB_BITS > 32) {
+            /* XXX the size here is variable */
+            if (TARGET_PAGE_BITS + MIN_CPU_TLB_BITS > 32) {
                 tlbtype = TCG_TYPE_I64;
                 tlbrexw = P_REXW;
             }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
  2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
                   ` (4 preceding siblings ...)
  2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
  2018-10-07 17:37   ` Philippe Mathieu-Daudé
  2018-10-08  3:21   ` Richard Henderson
  5 siblings, 2 replies; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée

Perform the resizing only on flushes, otherwise we'd
have to take a perf hit by either rehashing the array
or unnecessarily flushing it.

We grow the array aggressively, and reduce the size more
slowly. This accommodates mixed workloads, where some
processes might be memory-heavy while others are not.

As the following experiments show, this a net perf gain,
particularly for memory-heavy workloads. Experiments
are run on an Intel i7-6700K CPU @ 4.00GHz.

1. System boot + shudown, debian aarch64:

- Before (tb-lock-v3):
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7469.363393      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.07% )
    31,507,707,190      cycles                    #    4.218 GHz                      ( +-  0.07% )
    57,101,577,452      instructions              #    1.81  insns per cycle          ( +-  0.08% )
    10,265,531,804      branches                  # 1374.352 M/sec                    ( +-  0.07% )
       173,020,681      branch-misses             #    1.69% of all branches          ( +-  0.10% )

       7.483359063 seconds time elapsed                                          ( +-  0.08% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7185.036730      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.11% )
    30,303,501,143      cycles                    #    4.218 GHz                      ( +-  0.11% )
    54,198,386,487      instructions              #    1.79  insns per cycle          ( +-  0.08% )
     9,726,518,945      branches                  # 1353.719 M/sec                    ( +-  0.08% )
       167,082,307      branch-misses             #    1.72% of all branches          ( +-  0.08% )

       7.195597842 seconds time elapsed                                          ( +-  0.11% )

That is, a 3.8% improvement.

2. System boot + shutdown, ubuntu 18.04 x86_64:

- Before (tb-lock-v3):
Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):

      49971.036482      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.62% )
   210,766,077,140      cycles                    #    4.218 GHz                      ( +-  1.63% )
   428,829,830,790      instructions              #    2.03  insns per cycle          ( +-  0.75% )
    77,313,384,038      branches                  # 1547.164 M/sec                    ( +-  0.54% )
       835,610,706      branch-misses             #    1.08% of all branches          ( +-  2.97% )

      50.003855102 seconds time elapsed                                          ( +-  1.61% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):

      50118.124477      task-clock (msec)         #    0.999 CPUs utilized            ( +-  4.30% )
           132,396      context-switches          #    0.003 M/sec                    ( +-  1.20% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
           167,754      page-faults               #    0.003 M/sec                    ( +-  0.06% )
   211,414,701,601      cycles                    #    4.218 GHz                      ( +-  4.30% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   431,618,818,597      instructions              #    2.04  insns per cycle          ( +-  6.40% )
    80,197,256,524      branches                  # 1600.165 M/sec                    ( +-  8.59% )
       794,830,352      branch-misses             #    0.99% of all branches          ( +-  2.05% )

      50.177077175 seconds time elapsed                                          ( +-  4.23% )

No improvement (within noise range).

3. x86_64 SPEC06int:
                              SPEC06int (test set)
                         [ Y axis: speedup over master ]
  8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+
    |                                                                       |
    |                                                   tlb-lock-v3         |
  7 +-+..................$$$...........................+indirection       +-+
    |                    $ $                              +resizing         |
    |                    $ $                                                |
  6 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |                    $ $                                                |
  5 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |                    $ $                                                |
  4 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |          +++       $ $                                                |
  3 +-+........$$+.......$.$..............................................+-+
    |          $$        $ $                                                |
    |          $$        $ $                                 $$$            |
  2 +-+........$$........$.$.................................$.$..........+-+
    |          $$        $ $                                 $ $       +$$  |
    |          $$   $$+  $ $  $$$       +$$                  $ $  $$$   $$  |
  1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+
    |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
    |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
  0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+
     401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean
png: https://imgur.com/a/b1wn3wc

That is, a 1.53x average speedup over master, with a max speedup of 7.13x.

Note that "indirection" (i.e. the first patch in this series) incurs
no overhead, on average.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h |  1 +
 accel/tcg/cputlb.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 27b9433976..4d1d6b2b8b 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -145,6 +145,7 @@ typedef struct CPUTLBDesc {
     size_t size;
     size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
     size_t used;
+    size_t n_flushes_low_rate;
 } CPUTLBDesc;
 
 #define CPU_COMMON_TLB  \
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 1ca71ecfc4..afb61e9c2b 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -85,6 +85,7 @@ void tlb_init(CPUState *cpu)
         desc->size = MIN_CPU_TLB_SIZE;
         desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
         desc->used = 0;
+        desc->n_flushes_low_rate = 0;
         env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
         env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
     }
@@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
     return count;
 }
 
+/* Call with tlb_lock held */
+static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
+{
+    CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
+    size_t rate = desc->used * 100 / desc->size;
+    size_t new_size = desc->size;
+
+    if (rate == 100) {
+        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
+    } else if (rate > 70) {
+        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
+    } else if (rate < 30) {
+        desc->n_flushes_low_rate++;
+        if (desc->n_flushes_low_rate == 100) {
+            new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
+            desc->n_flushes_low_rate = 0;
+        }
+    }
+
+    if (new_size == desc->size) {
+        return;
+    }
+
+    g_free(env->tlb_table[mmu_idx]);
+    g_free(env->iotlb[mmu_idx]);
+
+    desc->size = new_size;
+    desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
+    desc->n_flushes_low_rate = 0;
+    env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, desc->size);
+    env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, desc->size);
+}
+
 /* This is OK because CPU architectures generally permit an
  * implementation to drop entries from the TLB at any time, so
  * flushing more entries than required is only an efficiency issue,
@@ -151,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
      */
     qemu_spin_lock(&env->tlb_lock);
     for (i = 0; i < NB_MMU_MODES; i++) {
+        tlb_mmu_resize_locked(env, i);
         memset(env->tlb_table[i], -1,
                env->tlb_desc[i].size * sizeof(CPUTLBEntry));
         env->tlb_desc[i].used = 0;
@@ -215,6 +250,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
         if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
             tlb_debug("%d\n", mmu_idx);
 
+            tlb_mmu_resize_locked(env, mmu_idx);
             memset(env->tlb_table[mmu_idx], -1,
                    env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
             memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
  2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
@ 2018-10-07 17:37   ` Philippe Mathieu-Daudé
  2018-10-08  1:48     ` Emilio G. Cota
  2018-10-08  3:21   ` Richard Henderson
  1 sibling, 1 reply; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2018-10-07 17:37 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Alex Bennée, Pranith Kumar, Richard Henderson

Hi Emilio,

On 10/6/18 11:45 PM, Emilio G. Cota wrote:
> Perform the resizing only on flushes, otherwise we'd
> have to take a perf hit by either rehashing the array
> or unnecessarily flushing it.
> 
> We grow the array aggressively, and reduce the size more
> slowly. This accommodates mixed workloads, where some
> processes might be memory-heavy while others are not.
> 
> As the following experiments show, this a net perf gain,
> particularly for memory-heavy workloads. Experiments
> are run on an Intel i7-6700K CPU @ 4.00GHz.
> 
> 1. System boot + shudown, debian aarch64:
> 
> - Before (tb-lock-v3):
>  Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
> 
>        7469.363393      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.07% )
>     31,507,707,190      cycles                    #    4.218 GHz                      ( +-  0.07% )
>     57,101,577,452      instructions              #    1.81  insns per cycle          ( +-  0.08% )
>     10,265,531,804      branches                  # 1374.352 M/sec                    ( +-  0.07% )
>        173,020,681      branch-misses             #    1.69% of all branches          ( +-  0.10% )
> 
>        7.483359063 seconds time elapsed                                          ( +-  0.08% )
> 
> - After:
>  Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
> 
>        7185.036730      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.11% )
>     30,303,501,143      cycles                    #    4.218 GHz                      ( +-  0.11% )
>     54,198,386,487      instructions              #    1.79  insns per cycle          ( +-  0.08% )
>      9,726,518,945      branches                  # 1353.719 M/sec                    ( +-  0.08% )
>        167,082,307      branch-misses             #    1.72% of all branches          ( +-  0.08% )
> 
>        7.195597842 seconds time elapsed                                          ( +-  0.11% )
> 
> That is, a 3.8% improvement.
> 
> 2. System boot + shutdown, ubuntu 18.04 x86_64:

You can also run the VM tests to build QEMU:

$ make vm-test
vm-test: Test QEMU in preconfigured virtual machines

  vm-build-ubuntu.i386            - Build QEMU in ubuntu i386 VM
  vm-build-freebsd                - Build QEMU in FreeBSD VM
  vm-build-netbsd                 - Build QEMU in NetBSD VM
  vm-build-openbsd                - Build QEMU in OpenBSD VM
  vm-build-centos                 - Build QEMU in CentOS VM, with Docker

> 
> - Before (tb-lock-v3):
> Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
> 
>       49971.036482      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.62% )
>    210,766,077,140      cycles                    #    4.218 GHz                      ( +-  1.63% )
>    428,829,830,790      instructions              #    2.03  insns per cycle          ( +-  0.75% )
>     77,313,384,038      branches                  # 1547.164 M/sec                    ( +-  0.54% )
>        835,610,706      branch-misses             #    1.08% of all branches          ( +-  2.97% )
> 
>       50.003855102 seconds time elapsed                                          ( +-  1.61% )
> 
> - After:
>  Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
> 
>       50118.124477      task-clock (msec)         #    0.999 CPUs utilized            ( +-  4.30% )
>            132,396      context-switches          #    0.003 M/sec                    ( +-  1.20% )
>                  0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
>            167,754      page-faults               #    0.003 M/sec                    ( +-  0.06% )
>    211,414,701,601      cycles                    #    4.218 GHz                      ( +-  4.30% )
>    <not supported>      stalled-cycles-frontend
>    <not supported>      stalled-cycles-backend
>    431,618,818,597      instructions              #    2.04  insns per cycle          ( +-  6.40% )
>     80,197,256,524      branches                  # 1600.165 M/sec                    ( +-  8.59% )
>        794,830,352      branch-misses             #    0.99% of all branches          ( +-  2.05% )
> 
>       50.177077175 seconds time elapsed                                          ( +-  4.23% )
> 
> No improvement (within noise range).
> 
> 3. x86_64 SPEC06int:
>                               SPEC06int (test set)
>                          [ Y axis: speedup over master ]
>   8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+
>     |                                                                       |
>     |                                                   tlb-lock-v3         |
>   7 +-+..................$$$...........................+indirection       +-+
>     |                    $ $                              +resizing         |
>     |                    $ $                                                |
>   6 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |                    $ $                                                |
>   5 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |                    $ $                                                |
>   4 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |          +++       $ $                                                |
>   3 +-+........$$+.......$.$..............................................+-+
>     |          $$        $ $                                                |
>     |          $$        $ $                                 $$$            |
>   2 +-+........$$........$.$.................................$.$..........+-+
>     |          $$        $ $                                 $ $       +$$  |
>     |          $$   $$+  $ $  $$$       +$$                  $ $  $$$   $$  |
>   1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+
>     |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
>     |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
>   0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+
>      401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean

This description line is hard to read ;)

> png: https://imgur.com/a/b1wn3wc
> 
> That is, a 1.53x average speedup over master, with a max speedup of 7.13x.
> 
> Note that "indirection" (i.e. the first patch in this series) incurs
> no overhead, on average.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/exec/cpu-defs.h |  1 +
>  accel/tcg/cputlb.c      | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 27b9433976..4d1d6b2b8b 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -145,6 +145,7 @@ typedef struct CPUTLBDesc {
>      size_t size;
>      size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
>      size_t used;
> +    size_t n_flushes_low_rate;
>  } CPUTLBDesc;
>  
>  #define CPU_COMMON_TLB  \
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 1ca71ecfc4..afb61e9c2b 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -85,6 +85,7 @@ void tlb_init(CPUState *cpu)
>          desc->size = MIN_CPU_TLB_SIZE;
>          desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
>          desc->used = 0;
> +        desc->n_flushes_low_rate = 0;
>          env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
>          env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
>      }
> @@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
>      return count;
>  }
>  
> +/* Call with tlb_lock held */
> +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
> +{
> +    CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
> +    size_t rate = desc->used * 100 / desc->size;
> +    size_t new_size = desc->size;
> +
> +    if (rate == 100) {
> +        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate > 70) {
> +        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate < 30) {

I wonder if those thresholds might be per TCG_TARGET.

Btw the paper used 40% here, did you tried it too?

Regards,

Phil.

> +        desc->n_flushes_low_rate++;
> +        if (desc->n_flushes_low_rate == 100) {
> +            new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
> +            desc->n_flushes_low_rate = 0;
> +        }
> +    }
> +
> +    if (new_size == desc->size) {
> +        return;
> +    }
> +
> +    g_free(env->tlb_table[mmu_idx]);
> +    g_free(env->iotlb[mmu_idx]);
> +
> +    desc->size = new_size;
> +    desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
> +    desc->n_flushes_low_rate = 0;
> +    env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, desc->size);
> +    env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, desc->size);
> +}
> +
>  /* This is OK because CPU architectures generally permit an
>   * implementation to drop entries from the TLB at any time, so
>   * flushing more entries than required is only an efficiency issue,
> @@ -151,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
>       */
>      qemu_spin_lock(&env->tlb_lock);
>      for (i = 0; i < NB_MMU_MODES; i++) {
> +        tlb_mmu_resize_locked(env, i);
>          memset(env->tlb_table[i], -1,
>                 env->tlb_desc[i].size * sizeof(CPUTLBEntry));
>          env->tlb_desc[i].used = 0;
> @@ -215,6 +250,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
>          if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
>              tlb_debug("%d\n", mmu_idx);
>  
> +            tlb_mmu_resize_locked(env, mmu_idx);
>              memset(env->tlb_table[mmu_idx], -1,
>                     env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
>              memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing
  2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
@ 2018-10-08  1:47   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08  1:47 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée

On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> -#define CPU_COMMON_TLB \
> +typedef struct CPUTLBDesc {
> +    size_t size;
> +    size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
> +} CPUTLBDesc;

I think you don't need both size and mask.  Size is (or ought to be) used
infrequently enough that I think it can be computed on demand.  But on a
related note, if you remember the out-of-line tlb changes, it'll be easier for
x86 if you index an array of scalars.


> -    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +    int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);

I just posted a patch to functionalize this.  But I imagine

static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx,
                                  target_ulong addr)
{
    uintptr_t ofs = (addr >> TARGET_PAGE_BITS) & env->tlb_mask[mmu_idx];
    return ofs >> CPU_TLB_BITS;
}

static inline CPUTLBEntry *tlb_entry(CPUArchState *env, uintptr_t mmu_idx,
                                     target_ulong addr)
{
    uintptr_t ofs = (addr >> TARGET_PAGE_BITS) & env->tlb_mask[mmu_idx];
    return (void *)&env->tlb_table[mmu_idx][0] + ofs;
}

In the few places where we use both of these functions, the compiler ought to
be able to pull out the common sub-expression.


> @@ -139,7 +149,10 @@ static void tlb_flush_nocheck(CPUState *cpu)
>       * that do not hold the lock are performed by the same owner thread.
>       */
>      qemu_spin_lock(&env->tlb_lock);
> -    memset(env->tlb_table, -1, sizeof(env->tlb_table));
> +    for (i = 0; i < NB_MMU_MODES; i++) {
> +        memset(env->tlb_table[i], -1,
> +               env->tlb_desc[i].size * sizeof(CPUTLBEntry));

Here we can use something like

static inline uintptr_t sizeof_tlb(CPUArchState *env, uintptr_t mmu_idx)
{
    return env->tlb_mask[mmu_idx] + (1 << CPU_TLB_BITS);
}

memset(env->tlb_table[i], -1, sizeof_tlb(env, i));

> -        for (i = 0; i < CPU_TLB_SIZE; i++) {
> +        for (i = 0; i < env->tlb_desc[mmu_idx].size; i++) {

This seems to be one of the few places where you need the number of entries
rather than the size.  Perhaps

   for (i = sizeof_tlb(env, mmu_idx) >> CPU_TLB_BITS; i-- > 0; ) {

Or if you can think of a name for the function of the same effect...


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
  2018-10-07 17:37   ` Philippe Mathieu-Daudé
@ 2018-10-08  1:48     ` Emilio G. Cota
  2018-10-08 13:46       ` Emilio G. Cota
  0 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-08  1:48 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: qemu-devel, Alex Bennée, Pranith Kumar, Richard Henderson

On Sun, Oct 07, 2018 at 19:37:50 +0200, Philippe Mathieu-Daudé wrote:
> On 10/6/18 11:45 PM, Emilio G. Cota wrote:
> > 2. System boot + shutdown, ubuntu 18.04 x86_64:
> 
> You can also run the VM tests to build QEMU:
> 
> $ make vm-test

Thanks, will give that a look.

> > +    if (rate == 100) {
> > +        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> > +    } else if (rate > 70) {
> > +        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> > +    } else if (rate < 30) {
> 
> I wonder if those thresholds might be per TCG_TARGET.

Do you mean to tune the growth rate based on each TCG target?
(max and min are already determined by the TCG target).
The optimal growth rate is mostly dependent on the guest
workload, so I wouldn't expect the TCG target to matter
much.

That said, we could spend quite some time tweaking the
TLB sizing algorithm. But with this RFC I wanted to see
(a) whether this approach is a good idea at all, and (b) show
what 'easy' speedups might look like (because converting
all TCG targets is a pain, so it better be justified).

> Btw the paper used 40% here, did you tried it too?

Yes, I tried several alternatives including what the
paper describes, i.e. (skipping the min/max checks
for simplicity):

	if (rate > 70) {
		new_size = 2 * old_size;
	} else if (rate < 40) {
		new_size = old_size / 2;
	}

But that didn't give great speedups (see "resizing-paper"
set):
  https://imgur.com/a/w3AqHP7

A few points stand out to me:

- We get very different speedups even if we implement
  the algorithm they describe (not sure that's exactly
  what they implemented though). But there are many
  variables that could explain that, e.g. different guest
  images (and therefore different TLB flush rates) and
  different QEMU baselines (ours is faster than the paper's,
  so getting speedups is harder).

- 70/40% use rate for growing/shrinking the TLB does not
  seem a great choice, if one wants to avoid a pathological
  case that can induce constant resizing. Imagine we got
  exactly 70% use rate, and all TLB misses were compulsory
  (i.e. a direct-mapped TLB would have not prevented a
  single miss). We'd then double the TLB size:
    size_new = 2*size_old
  But then the use rate will halve:
    use_new = 0.7/2 = 0.35
  So we'd then end up in a grow-shrink loop!
  Picking a "shrink threshold" below 0.70/2=0.35 avoids this.

- Aggressively increasing the TLB size when usage is high
  makes sense. However, reducing the size at the same
  rate does not make much sense. Imagine the following
  scenario with two processes being scheduled: one process
  uses a lot of memory, and the other one uses little, but
  both are CPU-intensive and therefore being assigned similar
  time slices by the scheduler. Ideally you'd resize the TLB
  to meet each process' memory demands. However, at flush
  time we don't even know what process is running or about
  to run, so we have to size the TLB exclusively based on
  recent use rates. In this scenario you're probably close
  to optimal if you size the TLB to meet the demands of the
  most memory-hungry process. You'll lose some extra time
  flushing the (now larger) TLB, but your net gain is likely
  to be positive given the TLB fills you won't have to do
  when the memory-heavy process is scheduled in.

So to me it's quite likely that in the paper they
could have gotten even better results by reducing the
shrink rate, like we did.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
  2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
@ 2018-10-08  2:09   ` Richard Henderson
  2018-10-08 14:42     ` Emilio G. Cota
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2018-10-08  2:09 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée

On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> Currently we evict an entry to the victim TLB when it doesn't match
> the current address. But it could be that there's no match because
> the current entry is invalid. Do not evict the entry to the vtlb
> in that case.
> 
> This change will help us keep track of the TLB's use rate.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/exec/cpu-all.h | 14 ++++++++++++++
>  accel/tcg/cputlb.c     |  2 +-
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> index 117d2fbbca..d938dedafc 100644
> --- a/include/exec/cpu-all.h
> +++ b/include/exec/cpu-all.h
> @@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
>      return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
>  }
>  
> +/**
> + * tlb_is_valid - return true if at least one of the addresses is valid
> + * @te: pointer to CPUTLBEntry
> + *
> + * This is useful when we don't have a particular address to compare against,
> + * and we just want to know whether any entry holds valid data.
> + */
> +static inline bool tlb_is_valid(const CPUTLBEntry *te)
> +{
> +    return !(te->addr_read & TLB_INVALID_MASK) ||
> +           !(te->addr_write & TLB_INVALID_MASK) ||
> +           !(te->addr_code & TLB_INVALID_MASK);
> +}

No, I think you misunderstand.

First, TLB_INVALID_MASK is only ever set for addr_write, in response to
PAGE_WRITE_INV.  Second, an entry that is invalid for write is still valid for
read+exec.  So there is benefit to swapping it out to the victim cache.

This is used by the s390x target to make the "lowpage" read-only, which is a
special architected 512 byte range within pages 0 and 1.  This is done by
forcing writes, but not reads, back through tlb_fill.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates
  2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
@ 2018-10-08  2:54   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08  2:54 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée

On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> @@ -753,6 +763,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>      }
>  
>      copy_tlb_helper_locked(te, &tn);
> +    env->tlb_desc[mmu_idx].used++;
>      qemu_spin_unlock(&env->tlb_lock);
>  }

So.. you're computing what?  Total TLB occupancy?
Perhaps a better name than "used"?

Otherwise the patch looks plausible.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS
  2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
@ 2018-10-08  2:56   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08  2:56 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée

On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> From: Pranith Kumar <bobby.prani@gmail.com>
> 
> This paves the way for implementing a dynamically-sized softmmu.
> 
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---

There's no point in this, since the original constraint was due to encoding
immediates in the and instruction.  Now we're using tlb_mask.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE
  2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
@ 2018-10-08  3:01   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08  3:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée

On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> @@ -89,7 +89,7 @@ typedef uint64_t target_ulong;
>   * 0x18 (the offset of the addend field in each TLB entry) plus the offset
>   * of tlb_table inside env (which is non-trivial but not huge).
>   */
> -#define CPU_TLB_BITS                                             \
> +#define MIN_CPU_TLB_BITS                                         \
>      MIN(8,                                                       \
>          TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
>          (NB_MMU_MODES <= 1 ? 0 :                                 \

There's no point in this either, since the original constraint was due to the
immediate offset into an add instruction.  Now we're loading the base address
from an array.  The actual size of the tlb is immaterial now, since the size of
the tlb does not affect the size of CPUArchState.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
  2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
  2018-10-07 17:37   ` Philippe Mathieu-Daudé
@ 2018-10-08  3:21   ` Richard Henderson
  1 sibling, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08  3:21 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée

On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> @@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
>      return count;
>  }
>  
> +/* Call with tlb_lock held */
> +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
> +{
> +    CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
> +    size_t rate = desc->used * 100 / desc->size;
> +    size_t new_size = desc->size;
> +
> +    if (rate == 100) {
> +        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate > 70) {
> +        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate < 30) {
> +        desc->n_flushes_low_rate++;
> +        if (desc->n_flushes_low_rate == 100) {
> +            new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
> +            desc->n_flushes_low_rate = 0;
> +        }
> +    }
> +
> +    if (new_size == desc->size) {

s/desc->size/old_size/g
Otherwise it looks plausible as a first cut.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
  2018-10-08  1:48     ` Emilio G. Cota
@ 2018-10-08 13:46       ` Emilio G. Cota
  0 siblings, 0 replies; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-08 13:46 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: qemu-devel, Alex Bennée, Pranith Kumar, Richard Henderson

On Sun, Oct 07, 2018 at 21:48:34 -0400, Emilio G. Cota wrote:
> - 70/40% use rate for growing/shrinking the TLB does not
>   seem a great choice, if one wants to avoid a pathological
>   case that can induce constant resizing. Imagine we got
>   exactly 70% use rate, and all TLB misses were compulsory
>   (i.e. a direct-mapped TLB would have not prevented a
            ^^^
>   single miss). We'd then double the TLB size:

I meant fully associative.

		E.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
  2018-10-08  2:09   ` Richard Henderson
@ 2018-10-08 14:42     ` Emilio G. Cota
  2018-10-08 19:46       ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-08 14:42 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Pranith Kumar, Alex Bennée

On Sun, Oct 07, 2018 at 19:09:01 -0700, Richard Henderson wrote:
> On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> > Currently we evict an entry to the victim TLB when it doesn't match
> > the current address. But it could be that there's no match because
> > the current entry is invalid. Do not evict the entry to the vtlb
> > in that case.
> > 
> > This change will help us keep track of the TLB's use rate.
> > 
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  include/exec/cpu-all.h | 14 ++++++++++++++
> >  accel/tcg/cputlb.c     |  2 +-
> >  2 files changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> > index 117d2fbbca..d938dedafc 100644
> > --- a/include/exec/cpu-all.h
> > +++ b/include/exec/cpu-all.h
> > @@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
> >      return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
> >  }
> >  
> > +/**
> > + * tlb_is_valid - return true if at least one of the addresses is valid
> > + * @te: pointer to CPUTLBEntry
> > + *
> > + * This is useful when we don't have a particular address to compare against,
> > + * and we just want to know whether any entry holds valid data.
> > + */
> > +static inline bool tlb_is_valid(const CPUTLBEntry *te)
> > +{
> > +    return !(te->addr_read & TLB_INVALID_MASK) ||
> > +           !(te->addr_write & TLB_INVALID_MASK) ||
> > +           !(te->addr_code & TLB_INVALID_MASK);
> > +}
> 
> No, I think you misunderstand.
> 
> First, TLB_INVALID_MASK is only ever set for addr_write, in response to
> PAGE_WRITE_INV.  Second, an entry that is invalid for write is still valid for
> read+exec.  So there is benefit to swapping it out to the victim cache.
> 
> This is used by the s390x target to make the "lowpage" read-only, which is a
> special architected 512 byte range within pages 0 and 1.  This is done by
> forcing writes, but not reads, back through tlb_fill.

Aah I see. The point is to avoid pushing to the victim cache
an entry that is all invalid, not just partially invalid.

Thanks for the clarification!

		Emilio

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
  2018-10-08 14:42     ` Emilio G. Cota
@ 2018-10-08 19:46       ` Richard Henderson
  2018-10-08 20:23         ` Emilio G. Cota
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2018-10-08 19:46 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Pranith Kumar, Alex Bennée

On 10/8/18 7:42 AM, Emilio G. Cota wrote:
> On Sun, Oct 07, 2018 at 19:09:01 -0700, Richard Henderson wrote:
>> On 10/6/18 2:45 PM, Emilio G. Cota wrote:
>>> Currently we evict an entry to the victim TLB when it doesn't match
>>> the current address. But it could be that there's no match because
>>> the current entry is invalid. Do not evict the entry to the vtlb
>>> in that case.
>>>
>>> This change will help us keep track of the TLB's use rate.
>>>
>>> Signed-off-by: Emilio G. Cota <cota@braap.org>
>>> ---
>>>  include/exec/cpu-all.h | 14 ++++++++++++++
>>>  accel/tcg/cputlb.c     |  2 +-
>>>  2 files changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
>>> index 117d2fbbca..d938dedafc 100644
>>> --- a/include/exec/cpu-all.h
>>> +++ b/include/exec/cpu-all.h
>>> @@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
>>>      return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
>>>  }
>>>  
>>> +/**
>>> + * tlb_is_valid - return true if at least one of the addresses is valid
>>> + * @te: pointer to CPUTLBEntry
>>> + *
>>> + * This is useful when we don't have a particular address to compare against,
>>> + * and we just want to know whether any entry holds valid data.
>>> + */
>>> +static inline bool tlb_is_valid(const CPUTLBEntry *te)
>>> +{
>>> +    return !(te->addr_read & TLB_INVALID_MASK) ||
>>> +           !(te->addr_write & TLB_INVALID_MASK) ||
>>> +           !(te->addr_code & TLB_INVALID_MASK);
>>> +}
>>
>> No, I think you misunderstand.
>>
>> First, TLB_INVALID_MASK is only ever set for addr_write, in response to
>> PAGE_WRITE_INV.  Second, an entry that is invalid for write is still valid for
>> read+exec.  So there is benefit to swapping it out to the victim cache.
>>
>> This is used by the s390x target to make the "lowpage" read-only, which is a
>> special architected 512 byte range within pages 0 and 1.  This is done by
>> forcing writes, but not reads, back through tlb_fill.
> 
> Aah I see. The point is to avoid pushing to the victim cache
> an entry that is all invalid, not just partially invalid.

I've slightly misspoken here.  It is (ab)used for the s390 thing, but that bit
is also set by memset -1.  Perhaps you might check

static inline bool tlb_is_invalid(const CPUTLBEntry *te)
{
    return te->addr_read & te->addr_write & te->addr_code & TLB_INVALID_MASK;
}

That is, all forms of access have the INVALID bit set.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
  2018-10-08 19:46       ` Richard Henderson
@ 2018-10-08 20:23         ` Emilio G. Cota
  0 siblings, 0 replies; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-08 20:23 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Pranith Kumar, Alex Bennée

On Mon, Oct 08, 2018 at 12:46:26 -0700, Richard Henderson wrote:
> On 10/8/18 7:42 AM, Emilio G. Cota wrote:
> > On Sun, Oct 07, 2018 at 19:09:01 -0700, Richard Henderson wrote:
> >> On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> >>> Currently we evict an entry to the victim TLB when it doesn't match
> >>> the current address. But it could be that there's no match because
> >>> the current entry is invalid. Do not evict the entry to the vtlb
> >>> in that case.
> >>>
> >>> This change will help us keep track of the TLB's use rate.
> >>>
> >>> Signed-off-by: Emilio G. Cota <cota@braap.org>
> >>> ---
> >>>  include/exec/cpu-all.h | 14 ++++++++++++++
> >>>  accel/tcg/cputlb.c     |  2 +-
> >>>  2 files changed, 15 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> >>> index 117d2fbbca..d938dedafc 100644
> >>> --- a/include/exec/cpu-all.h
> >>> +++ b/include/exec/cpu-all.h
> >>> @@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
> >>>      return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
> >>>  }
> >>>  
> >>> +/**
> >>> + * tlb_is_valid - return true if at least one of the addresses is valid
> >>> + * @te: pointer to CPUTLBEntry
> >>> + *
> >>> + * This is useful when we don't have a particular address to compare against,
> >>> + * and we just want to know whether any entry holds valid data.
> >>> + */
> >>> +static inline bool tlb_is_valid(const CPUTLBEntry *te)
> >>> +{
> >>> +    return !(te->addr_read & TLB_INVALID_MASK) ||
> >>> +           !(te->addr_write & TLB_INVALID_MASK) ||
> >>> +           !(te->addr_code & TLB_INVALID_MASK);
> >>> +}
> >>
> >> No, I think you misunderstand.
> >>
> >> First, TLB_INVALID_MASK is only ever set for addr_write, in response to
> >> PAGE_WRITE_INV.  Second, an entry that is invalid for write is still valid for
> >> read+exec.  So there is benefit to swapping it out to the victim cache.
> >>
> >> This is used by the s390x target to make the "lowpage" read-only, which is a
> >> special architected 512 byte range within pages 0 and 1.  This is done by
> >> forcing writes, but not reads, back through tlb_fill.
> > 
> > Aah I see. The point is to avoid pushing to the victim cache
> > an entry that is all invalid, not just partially invalid.
> 
> I've slightly misspoken here.  It is (ab)used for the s390 thing, but that bit
> is also set by memset -1.  Perhaps you might check
> 
> static inline bool tlb_is_invalid(const CPUTLBEntry *te)
> {
>     return te->addr_read & te->addr_write & te->addr_code & TLB_INVALID_MASK;
> }
> 
> That is, all forms of access have the INVALID bit set.

I see. I just ran a few tests and the below gives the best
performance, so I'll go with it:

static inline bool tlb_is_empty(const CPUTLBEntry *te)
{
    return te->addr_read == -1 && te->addr_write == -1 && te->addr_code == -1;
}

I renamed it from "invalid" to "empty" to avoid even thinking about
the invalid flag.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-10-08 20:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
2018-10-08  1:47   ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
2018-10-08  2:09   ` Richard Henderson
2018-10-08 14:42     ` Emilio G. Cota
2018-10-08 19:46       ` Richard Henderson
2018-10-08 20:23         ` Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
2018-10-08  2:54   ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
2018-10-08  2:56   ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
2018-10-08  3:01   ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
2018-10-07 17:37   ` Philippe Mathieu-Daudé
2018-10-08  1:48     ` Emilio G. Cota
2018-10-08 13:46       ` Emilio G. Cota
2018-10-08  3:21   ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.