* [Qemu-devel] [RFC 0/6] Dynamic TLB sizing
@ 2018-10-06 21:45 Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
` (5 more replies)
0 siblings, 6 replies; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée
After reading this paper [1], I wondered whether how far one
could push the idea of dynamic TLB resizing. We discussed
it briefly in this thread:
https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02340.html
Since then, (1) rth helped me (thanks!) with TCG backend code,
and (2) I've abandoned the idea of substituting malloc
for memset, and instead focused on dynamically resizing the
TLBs. The rationale is that if a process touches a lot of
memory, having a large TLB will pay off, since the perf
gains will dwarf the increased cost of flushing via memset.
This series shows that the indirection necessary to do this
does not cause a perf decrease, at least for x86_64 hosts.
This series is incomplete, since it only implements changes
to the i386 backend, and it probably only works on x86_64.
But the whole point is to (1) see whether the performance gains
are worth it, and (2) discuss how crazy this approach is. I was
looking for things to break badly, but so far I've found no obvious
issues. But there might be some assumptions about the TLB size
baked in the code that I might have missed, so please point those
out if they exist.
Performance numbers are in the last patch.
You can fetch this series from:
https://github.com/cota/qemu/tree/tlb-dyn
Note that it applies on top of my tlb-lock-v3 series:
https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01087.html
Thanks,
Emilio
[1] "Optimizing Memory Translation Emulation in Full System Emulators",
Tong et al, TACO'15 https://dl.acm.org/citation.cfm?id=2686034
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
2018-10-08 1:47 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
` (4 subsequent siblings)
5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée
No dynamic sizing yet, but the indirection is there.
XXX:
- convert other TCG backends
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
accel/tcg/softmmu_template.h | 14 +++++----
include/exec/cpu-defs.h | 14 ++++++---
include/exec/cpu_ldst.h | 2 +-
include/exec/cpu_ldst_template.h | 6 ++--
accel/tcg/cputlb.c | 49 +++++++++++++++++++++-----------
tcg/i386/tcg-target.inc.c | 26 ++++++++---------
6 files changed, 68 insertions(+), 43 deletions(-)
diff --git a/accel/tcg/softmmu_template.h b/accel/tcg/softmmu_template.h
index 1e50263871..3f5a0d4017 100644
--- a/accel/tcg/softmmu_template.h
+++ b/accel/tcg/softmmu_template.h
@@ -112,7 +112,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr,
TCGMemOpIdx oi, uintptr_t retaddr)
{
unsigned mmu_idx = get_mmuidx(oi);
- int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
unsigned a_bits = get_alignment_bits(get_memop(oi));
uintptr_t haddr;
@@ -180,7 +180,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr,
TCGMemOpIdx oi, uintptr_t retaddr)
{
unsigned mmu_idx = get_mmuidx(oi);
- int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
unsigned a_bits = get_alignment_bits(get_memop(oi));
uintptr_t haddr;
@@ -276,7 +276,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
{
unsigned mmu_idx = get_mmuidx(oi);
- int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
target_ulong tlb_addr =
atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
unsigned a_bits = get_alignment_bits(get_memop(oi));
@@ -322,7 +322,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
is already guaranteed to be filled, and that the second page
cannot evict the first. */
page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK;
- index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ index2 = (page2 >> TARGET_PAGE_BITS) &
+ (env->tlb_desc[mmu_idx].size - 1);
tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write);
if (!tlb_hit_page(tlb_addr2, page2)
&& !VICTIM_TLB_HIT(addr_write, page2)) {
@@ -355,7 +356,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
{
unsigned mmu_idx = get_mmuidx(oi);
- int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
target_ulong tlb_addr =
atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
unsigned a_bits = get_alignment_bits(get_memop(oi));
@@ -401,7 +402,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
is already guaranteed to be filled, and that the second page
cannot evict the first. */
page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK;
- index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ index2 = (page2 >> TARGET_PAGE_BITS) &
+ (env->tlb_desc[mmu_idx].size - 1);
tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write);
if (!tlb_hit_page(tlb_addr2, page2)
&& !VICTIM_TLB_HIT(addr_write, page2)) {
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 4ff62f32bf..fa95a4257e 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -141,13 +141,19 @@ typedef struct CPUIOTLBEntry {
MemTxAttrs attrs;
} CPUIOTLBEntry;
-#define CPU_COMMON_TLB \
+typedef struct CPUTLBDesc {
+ size_t size;
+ size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
+} CPUTLBDesc;
+
+#define CPU_COMMON_TLB \
/* The meaning of the MMU modes is defined in the target code. */ \
- /* tlb_lock serializes updates to tlb_table and tlb_v_table */ \
+ /* tlb_lock serializes updates to tlb_desc, tlb_table and tlb_v_table */ \
QemuSpin tlb_lock; \
- CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE]; \
+ CPUTLBDesc tlb_desc[NB_MMU_MODES]; \
+ CPUTLBEntry *tlb_table[NB_MMU_MODES]; \
CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE]; \
- CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE]; \
+ CPUIOTLBEntry *iotlb[NB_MMU_MODES]; \
CPUIOTLBEntry iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE]; \
size_t tlb_flush_count; \
target_ulong tlb_flush_addr; \
diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index 9581587ce1..df452f5977 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -416,7 +416,7 @@ static inline void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr,
#if defined(CONFIG_USER_ONLY)
return g2h(addr);
#else
- int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
CPUTLBEntry *tlbentry = &env->tlb_table[mmu_idx][index];
abi_ptr tlb_addr;
uintptr_t haddr;
diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h
index ba7a11123c..6ab9909f46 100644
--- a/include/exec/cpu_ldst_template.h
+++ b/include/exec/cpu_ldst_template.h
@@ -94,8 +94,8 @@ glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
#endif
addr = ptr;
- page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
mmu_idx = CPU_MMU_INDEX;
+ page_index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ !=
(addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
oi = make_memop_idx(SHIFT, mmu_idx);
@@ -132,8 +132,8 @@ glue(glue(glue(cpu_lds, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
#endif
addr = ptr;
- page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
mmu_idx = CPU_MMU_INDEX;
+ page_index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ !=
(addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
oi = make_memop_idx(SHIFT, mmu_idx);
@@ -174,8 +174,8 @@ glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
#endif
addr = ptr;
- page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
mmu_idx = CPU_MMU_INDEX;
+ page_index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
if (unlikely(atomic_read(&env->tlb_table[mmu_idx][page_index].addr_write) !=
(addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
oi = make_memop_idx(SHIFT, mmu_idx);
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index c7608ccdf8..0b51efc374 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -76,8 +76,17 @@ QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16);
void tlb_init(CPUState *cpu)
{
CPUArchState *env = cpu->env_ptr;
+ int i;
qemu_spin_init(&env->tlb_lock);
+ for (i = 0; i < NB_MMU_MODES; i++) {
+ CPUTLBDesc *desc = &env->tlb_desc[i];
+
+ desc->size = CPU_TLB_SIZE;
+ desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
+ env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
+ env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
+ }
}
/* flush_all_helper: run fn across all cpus
@@ -120,6 +129,7 @@ size_t tlb_flush_count(void)
static void tlb_flush_nocheck(CPUState *cpu)
{
CPUArchState *env = cpu->env_ptr;
+ int i;
/* The QOM tests will trigger tlb_flushes without setting up TCG
* so we bug out here in that case.
@@ -139,7 +149,10 @@ static void tlb_flush_nocheck(CPUState *cpu)
* that do not hold the lock are performed by the same owner thread.
*/
qemu_spin_lock(&env->tlb_lock);
- memset(env->tlb_table, -1, sizeof(env->tlb_table));
+ for (i = 0; i < NB_MMU_MODES; i++) {
+ memset(env->tlb_table[i], -1,
+ env->tlb_desc[i].size * sizeof(CPUTLBEntry));
+ }
memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
qemu_spin_unlock(&env->tlb_lock);
@@ -200,7 +213,8 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
tlb_debug("%d\n", mmu_idx);
- memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
+ memset(env->tlb_table[mmu_idx], -1,
+ env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
}
}
@@ -286,7 +300,6 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data)
{
CPUArchState *env = cpu->env_ptr;
target_ulong addr = (target_ulong) data.target_ptr;
- int i;
int mmu_idx;
assert_cpu_is_self(cpu);
@@ -304,9 +317,10 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data)
}
addr &= TARGET_PAGE_MASK;
- i = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
qemu_spin_lock(&env->tlb_lock);
for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+ int i = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
+
tlb_flush_entry_locked(&env->tlb_table[mmu_idx][i], addr);
tlb_flush_vtlb_page_locked(env, mmu_idx, addr);
}
@@ -339,16 +353,17 @@ static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu,
target_ulong addr_and_mmuidx = (target_ulong) data.target_ptr;
target_ulong addr = addr_and_mmuidx & TARGET_PAGE_MASK;
unsigned long mmu_idx_bitmap = addr_and_mmuidx & ALL_MMUIDX_BITS;
- int page = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
int mmu_idx;
assert_cpu_is_self(cpu);
- tlb_debug("page:%d addr:"TARGET_FMT_lx" mmu_idx:0x%lx\n",
- page, addr, mmu_idx_bitmap);
+ tlb_debug("addr: "TARGET_FMT_lx" mmu_idx:0x%lx\n", addr, mmu_idx_bitmap);
qemu_spin_lock(&env->tlb_lock);
for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+ int page;
+
+ page = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
if (test_bit(mmu_idx, &mmu_idx_bitmap)) {
tlb_flush_entry_locked(&env->tlb_table[mmu_idx][page], addr);
tlb_flush_vtlb_page_locked(env, mmu_idx, addr);
@@ -524,7 +539,7 @@ void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length)
for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
unsigned int i;
- for (i = 0; i < CPU_TLB_SIZE; i++) {
+ for (i = 0; i < env->tlb_desc[mmu_idx].size; i++) {
tlb_reset_dirty_range_locked(&env->tlb_table[mmu_idx][i], start1,
length);
}
@@ -551,15 +566,15 @@ static inline void tlb_set_dirty1_locked(CPUTLBEntry *tlb_entry,
void tlb_set_dirty(CPUState *cpu, target_ulong vaddr)
{
CPUArchState *env = cpu->env_ptr;
- int i;
int mmu_idx;
assert_cpu_is_self(cpu);
vaddr &= TARGET_PAGE_MASK;
- i = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
qemu_spin_lock(&env->tlb_lock);
for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+ int i = (vaddr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
+
tlb_set_dirty1_locked(&env->tlb_table[mmu_idx][i], vaddr);
}
@@ -660,7 +675,8 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
iotlb = memory_region_section_get_iotlb(cpu, section, vaddr_page,
paddr_page, xlat, prot, &address);
- index = (vaddr_page >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ index = (vaddr_page >> TARGET_PAGE_BITS) &
+ (env->tlb_desc[mmu_idx].size - 1);
te = &env->tlb_table[mmu_idx][index];
/*
@@ -788,7 +804,7 @@ static uint64_t io_readx(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
tlb_fill(cpu, addr, size, MMU_DATA_LOAD, mmu_idx, retaddr);
- index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
tlb_addr = env->tlb_table[mmu_idx][index].addr_read;
if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) {
/* RAM access */
@@ -855,7 +871,7 @@ static void io_writex(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
tlb_fill(cpu, addr, size, MMU_DATA_STORE, mmu_idx, retaddr);
- index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) {
/* RAM access */
@@ -943,8 +959,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr)
int mmu_idx, index;
void *p;
- index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
mmu_idx = cpu_mmu_index(env, true);
+ index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
if (unlikely(!tlb_hit(env->tlb_table[mmu_idx][index].addr_code, addr))) {
if (!VICTIM_TLB_HIT(addr_code, addr)) {
tlb_fill(ENV_GET_CPU(env), addr, 0, MMU_INST_FETCH, mmu_idx, 0);
@@ -978,7 +994,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr)
void probe_write(CPUArchState *env, target_ulong addr, int size, int mmu_idx,
uintptr_t retaddr)
{
- int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
target_ulong tlb_addr =
atomic_read(&env->tlb_table[mmu_idx][index].addr_write);
@@ -998,7 +1014,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
NotDirtyInfo *ndi)
{
size_t mmu_idx = get_mmuidx(oi);
- size_t index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+ size_t index = (addr >> TARGET_PAGE_BITS) &
+ (env->tlb_desc[mmu_idx].size - 1);
CPUTLBEntry *tlbe = &env->tlb_table[mmu_idx][index];
target_ulong tlb_addr = atomic_read(&tlbe->addr_write);
TCGMemOp mop = get_memop(oi);
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 436195894b..fce6a94e22 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -330,6 +330,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
#define OPC_ARITH_GvEv (0x03) /* ... plus (ARITH_FOO << 3) */
#define OPC_ANDN (0xf2 | P_EXT38)
#define OPC_ADD_GvEv (OPC_ARITH_GvEv | (ARITH_ADD << 3))
+#define OPC_AND_GvEv (OPC_ARITH_GvEv | (ARITH_AND << 3))
#define OPC_BLENDPS (0x0c | P_EXT3A | P_DATA16)
#define OPC_BSF (0xbc | P_EXT)
#define OPC_BSR (0xbd | P_EXT)
@@ -1633,6 +1634,15 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
}
tcg_out_mov(s, tlbtype, r0, addrlo);
+ tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
+ TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+
+ tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, r0, TCG_AREG0,
+ offsetof(CPUArchState, tlb_desc[mem_index].mask));
+
+ tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r0, TCG_AREG0,
+ offsetof(CPUArchState, tlb_table[mem_index]));
+
/* If the required alignment is at least as large as the access, simply
copy the address and mask. For lesser alignments, check that we don't
cross pages for the complete access. */
@@ -1642,20 +1652,10 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
tcg_out_modrm_offset(s, OPC_LEA + trexw, r1, addrlo, s_mask - a_mask);
}
tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-
- tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
- TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-
tgen_arithi(s, ARITH_AND + trexw, r1, tlb_mask, 0);
- tgen_arithi(s, ARITH_AND + tlbrexw, r0,
- (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS, 0);
-
- tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, 0,
- offsetof(CPUArchState, tlb_table[mem_index][0])
- + which);
/* cmp 0(r0), r1 */
- tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, 0);
+ tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which);
/* Prepare for both the fast path add of the tlb addend, and the slow
path function argument setup. There are two cases worth note:
@@ -1672,7 +1672,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
/* cmp 4(r0), addrhi */
- tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, 4);
+ tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, which + 4);
/* jne slow_path */
tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
@@ -1684,7 +1684,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
/* add addend(r0), r1 */
tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r1, r0,
- offsetof(CPUTLBEntry, addend) - which);
+ offsetof(CPUTLBEntry, addend));
}
/*
--
2.17.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
2018-10-08 2:09 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
` (3 subsequent siblings)
5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée
Currently we evict an entry to the victim TLB when it doesn't match
the current address. But it could be that there's no match because
the current entry is invalid. Do not evict the entry to the vtlb
in that case.
This change will help us keep track of the TLB's use rate.
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
include/exec/cpu-all.h | 14 ++++++++++++++
accel/tcg/cputlb.c | 2 +-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index 117d2fbbca..d938dedafc 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
}
+/**
+ * tlb_is_valid - return true if at least one of the addresses is valid
+ * @te: pointer to CPUTLBEntry
+ *
+ * This is useful when we don't have a particular address to compare against,
+ * and we just want to know whether any entry holds valid data.
+ */
+static inline bool tlb_is_valid(const CPUTLBEntry *te)
+{
+ return !(te->addr_read & TLB_INVALID_MASK) ||
+ !(te->addr_write & TLB_INVALID_MASK) ||
+ !(te->addr_code & TLB_INVALID_MASK);
+}
+
void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
#endif /* !CONFIG_USER_ONLY */
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 0b51efc374..0e2c149d6b 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -695,7 +695,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
* Only evict the old entry to the victim tlb if it's for a
* different page; otherwise just overwrite the stale data.
*/
- if (!tlb_hit_page_anyprot(te, vaddr_page)) {
+ if (!tlb_hit_page_anyprot(te, vaddr_page) && tlb_is_valid(te)) {
unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
CPUTLBEntry *tv = &env->tlb_v_table[mmu_idx][vidx];
--
2.17.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
2018-10-08 2:54 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
` (2 subsequent siblings)
5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée
This paves the way for implementing a dynamically-sized softmmu.
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
include/exec/cpu-defs.h | 1 +
accel/tcg/cputlb.c | 17 ++++++++++++++---
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index fa95a4257e..af9fe04b0b 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -144,6 +144,7 @@ typedef struct CPUIOTLBEntry {
typedef struct CPUTLBDesc {
size_t size;
size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
+ size_t used;
} CPUTLBDesc;
#define CPU_COMMON_TLB \
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 0e2c149d6b..ed19ac0e40 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -84,6 +84,7 @@ void tlb_init(CPUState *cpu)
desc->size = CPU_TLB_SIZE;
desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
+ desc->used = 0;
env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
}
@@ -152,6 +153,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
for (i = 0; i < NB_MMU_MODES; i++) {
memset(env->tlb_table[i], -1,
env->tlb_desc[i].size * sizeof(CPUTLBEntry));
+ env->tlb_desc[i].used = 0;
}
memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
qemu_spin_unlock(&env->tlb_lock);
@@ -216,6 +218,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
memset(env->tlb_table[mmu_idx], -1,
env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+ env->tlb_desc[mmu_idx].used = 0;
}
}
qemu_spin_unlock(&env->tlb_lock);
@@ -276,12 +279,14 @@ static inline bool tlb_hit_page_anyprot(CPUTLBEntry *tlb_entry,
}
/* Called with tlb_lock held */
-static inline void tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
+static inline bool tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
target_ulong page)
{
if (tlb_hit_page_anyprot(tlb_entry, page)) {
memset(tlb_entry, -1, sizeof(*tlb_entry));
+ return true;
}
+ return false;
}
/* Called with tlb_lock held */
@@ -321,7 +326,9 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data)
for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
int i = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
- tlb_flush_entry_locked(&env->tlb_table[mmu_idx][i], addr);
+ if (tlb_flush_entry_locked(&env->tlb_table[mmu_idx][i], addr)) {
+ env->tlb_desc[mmu_idx].used--;
+ }
tlb_flush_vtlb_page_locked(env, mmu_idx, addr);
}
qemu_spin_unlock(&env->tlb_lock);
@@ -365,7 +372,9 @@ static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu,
page = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
if (test_bit(mmu_idx, &mmu_idx_bitmap)) {
- tlb_flush_entry_locked(&env->tlb_table[mmu_idx][page], addr);
+ if (tlb_flush_entry_locked(&env->tlb_table[mmu_idx][page], addr)) {
+ env->tlb_desc[mmu_idx].used--;
+ }
tlb_flush_vtlb_page_locked(env, mmu_idx, addr);
}
}
@@ -702,6 +711,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
/* Evict the old entry into the victim tlb. */
copy_tlb_helper_locked(tv, te);
env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
+ env->tlb_desc[mmu_idx].used--;
}
/* refill the tlb */
@@ -753,6 +763,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
}
copy_tlb_helper_locked(te, &tn);
+ env->tlb_desc[mmu_idx].used++;
qemu_spin_unlock(&env->tlb_lock);
}
--
2.17.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
` (2 preceding siblings ...)
2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
2018-10-08 2:56 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée
From: Pranith Kumar <bobby.prani@gmail.com>
This paves the way for implementing a dynamically-sized softmmu.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
tcg/aarch64/tcg-target.h | 1 +
tcg/arm/tcg-target.h | 1 +
tcg/i386/tcg-target.h | 2 ++
tcg/mips/tcg-target.h | 2 ++
tcg/ppc/tcg-target.h | 1 +
tcg/s390/tcg-target.h | 1 +
tcg/sparc/tcg-target.h | 1 +
tcg/tci/tcg-target.h | 1 +
8 files changed, 10 insertions(+)
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 9aea1d1771..55af43d55f 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -15,6 +15,7 @@
#define TCG_TARGET_INSN_UNIT_SIZE 4
#define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
#undef TCG_TARGET_STACK_GROWSUP
typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 94b3578c55..0cd07906b3 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -60,6 +60,7 @@ extern int arm_arch;
#undef TCG_TARGET_STACK_GROWSUP
#define TCG_TARGET_INSN_UNIT_SIZE 4
#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 8
typedef enum {
TCG_REG_R0 = 0,
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 9fdf37f23c..4e79e0a550 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -200,6 +200,8 @@ extern bool have_avx2;
# define TCG_AREG0 TCG_REG_EBP
#endif
+#define TCG_TARGET_TLB_MAX_INDEX_BITS (32 - CPU_TLB_ENTRY_BITS)
+
static inline void flush_icache_range(uintptr_t start, uintptr_t stop)
{
}
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index a8222476f0..b791e2b4cd 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -39,6 +39,8 @@
#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
#define TCG_TARGET_NB_REGS 32
+#define TCG_TARGET_TLB_MAX_INDEX_BITS (16 - CPU_TLB_ENTRY_BITS)
+
typedef enum {
TCG_REG_ZERO = 0,
TCG_REG_AT,
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index be52ad1d2e..e0ad7c122d 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -34,6 +34,7 @@
#define TCG_TARGET_NB_REGS 32
#define TCG_TARGET_INSN_UNIT_SIZE 4
#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
typedef enum {
TCG_REG_R0, TCG_REG_R1, TCG_REG_R2, TCG_REG_R3,
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index 6f2b06a7d1..a1e25e13b3 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -27,6 +27,7 @@
#define TCG_TARGET_INSN_UNIT_SIZE 2
#define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
typedef enum TCGReg {
TCG_REG_R0 = 0,
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index d8339bf010..72ace760d5 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -29,6 +29,7 @@
#define TCG_TARGET_INSN_UNIT_SIZE 4
#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 12
#define TCG_TARGET_NB_REGS 32
typedef enum {
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index 26140d78cb..3f28219afc 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -43,6 +43,7 @@
#define TCG_TARGET_INTERPRETER 1
#define TCG_TARGET_INSN_UNIT_SIZE 1
#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
#if UINTPTR_MAX == UINT32_MAX
# define TCG_TARGET_REG_BITS 32
--
2.17.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
` (3 preceding siblings ...)
2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
2018-10-08 3:01 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
5 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
include/exec/cpu-defs.h | 6 +++---
accel/tcg/cputlb.c | 2 +-
tcg/i386/tcg-target.inc.c | 3 ++-
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index af9fe04b0b..27b9433976 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -67,7 +67,7 @@ typedef uint64_t target_ulong;
#define CPU_TLB_ENTRY_BITS 5
#endif
-/* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
+/* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in MIN_CPU_TLB_BITS to ensure that
* the TLB is not unnecessarily small, but still small enough for the
* TLB lookup instruction sequence used by the TCG target.
*
@@ -89,7 +89,7 @@ typedef uint64_t target_ulong;
* 0x18 (the offset of the addend field in each TLB entry) plus the offset
* of tlb_table inside env (which is non-trivial but not huge).
*/
-#define CPU_TLB_BITS \
+#define MIN_CPU_TLB_BITS \
MIN(8, \
TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS - \
(NB_MMU_MODES <= 1 ? 0 : \
@@ -97,7 +97,7 @@ typedef uint64_t target_ulong;
NB_MMU_MODES <= 4 ? 2 : \
NB_MMU_MODES <= 8 ? 3 : 4))
-#define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
+#define MIN_CPU_TLB_SIZE (1 << MIN_CPU_TLB_BITS)
typedef struct CPUTLBEntry {
/* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index ed19ac0e40..1ca71ecfc4 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -82,7 +82,7 @@ void tlb_init(CPUState *cpu)
for (i = 0; i < NB_MMU_MODES; i++) {
CPUTLBDesc *desc = &env->tlb_desc[i];
- desc->size = CPU_TLB_SIZE;
+ desc->size = MIN_CPU_TLB_SIZE;
desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
desc->used = 0;
env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index fce6a94e22..60d8ed5264 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -1626,7 +1626,8 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
}
if (TCG_TYPE_PTR == TCG_TYPE_I64) {
hrexw = P_REXW;
- if (TARGET_PAGE_BITS + CPU_TLB_BITS > 32) {
+ /* XXX the size here is variable */
+ if (TARGET_PAGE_BITS + MIN_CPU_TLB_BITS > 32) {
tlbtype = TCG_TYPE_I64;
tlbrexw = P_REXW;
}
--
2.17.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
` (4 preceding siblings ...)
2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
@ 2018-10-06 21:45 ` Emilio G. Cota
2018-10-07 17:37 ` Philippe Mathieu-Daudé
2018-10-08 3:21 ` Richard Henderson
5 siblings, 2 replies; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-06 21:45 UTC (permalink / raw)
To: qemu-devel; +Cc: Pranith Kumar, Richard Henderson, Alex Bennée
Perform the resizing only on flushes, otherwise we'd
have to take a perf hit by either rehashing the array
or unnecessarily flushing it.
We grow the array aggressively, and reduce the size more
slowly. This accommodates mixed workloads, where some
processes might be memory-heavy while others are not.
As the following experiments show, this a net perf gain,
particularly for memory-heavy workloads. Experiments
are run on an Intel i7-6700K CPU @ 4.00GHz.
1. System boot + shudown, debian aarch64:
- Before (tb-lock-v3):
Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
7469.363393 task-clock (msec) # 0.998 CPUs utilized ( +- 0.07% )
31,507,707,190 cycles # 4.218 GHz ( +- 0.07% )
57,101,577,452 instructions # 1.81 insns per cycle ( +- 0.08% )
10,265,531,804 branches # 1374.352 M/sec ( +- 0.07% )
173,020,681 branch-misses # 1.69% of all branches ( +- 0.10% )
7.483359063 seconds time elapsed ( +- 0.08% )
- After:
Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
7185.036730 task-clock (msec) # 0.999 CPUs utilized ( +- 0.11% )
30,303,501,143 cycles # 4.218 GHz ( +- 0.11% )
54,198,386,487 instructions # 1.79 insns per cycle ( +- 0.08% )
9,726,518,945 branches # 1353.719 M/sec ( +- 0.08% )
167,082,307 branch-misses # 1.72% of all branches ( +- 0.08% )
7.195597842 seconds time elapsed ( +- 0.11% )
That is, a 3.8% improvement.
2. System boot + shutdown, ubuntu 18.04 x86_64:
- Before (tb-lock-v3):
Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
49971.036482 task-clock (msec) # 0.999 CPUs utilized ( +- 1.62% )
210,766,077,140 cycles # 4.218 GHz ( +- 1.63% )
428,829,830,790 instructions # 2.03 insns per cycle ( +- 0.75% )
77,313,384,038 branches # 1547.164 M/sec ( +- 0.54% )
835,610,706 branch-misses # 1.08% of all branches ( +- 2.97% )
50.003855102 seconds time elapsed ( +- 1.61% )
- After:
Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
50118.124477 task-clock (msec) # 0.999 CPUs utilized ( +- 4.30% )
132,396 context-switches # 0.003 M/sec ( +- 1.20% )
0 cpu-migrations # 0.000 K/sec ( +-100.00% )
167,754 page-faults # 0.003 M/sec ( +- 0.06% )
211,414,701,601 cycles # 4.218 GHz ( +- 4.30% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
431,618,818,597 instructions # 2.04 insns per cycle ( +- 6.40% )
80,197,256,524 branches # 1600.165 M/sec ( +- 8.59% )
794,830,352 branch-misses # 0.99% of all branches ( +- 2.05% )
50.177077175 seconds time elapsed ( +- 4.23% )
No improvement (within noise range).
3. x86_64 SPEC06int:
SPEC06int (test set)
[ Y axis: speedup over master ]
8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+
| |
| tlb-lock-v3 |
7 +-+..................$$$...........................+indirection +-+
| $ $ +resizing |
| $ $ |
6 +-+..................$.$..............................................+-+
| $ $ |
| $ $ |
5 +-+..................$.$..............................................+-+
| $ $ |
| $ $ |
4 +-+..................$.$..............................................+-+
| $ $ |
| +++ $ $ |
3 +-+........$$+.......$.$..............................................+-+
| $$ $ $ |
| $$ $ $ $$$ |
2 +-+........$$........$.$.................................$.$..........+-+
| $$ $ $ $ $ +$$ |
| $$ $$+ $ $ $$$ +$$ $ $ $$$ $$ |
1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+
| * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ |
| * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ |
0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+
401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean
png: https://imgur.com/a/b1wn3wc
That is, a 1.53x average speedup over master, with a max speedup of 7.13x.
Note that "indirection" (i.e. the first patch in this series) incurs
no overhead, on average.
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
include/exec/cpu-defs.h | 1 +
accel/tcg/cputlb.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 27b9433976..4d1d6b2b8b 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -145,6 +145,7 @@ typedef struct CPUTLBDesc {
size_t size;
size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
size_t used;
+ size_t n_flushes_low_rate;
} CPUTLBDesc;
#define CPU_COMMON_TLB \
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 1ca71ecfc4..afb61e9c2b 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -85,6 +85,7 @@ void tlb_init(CPUState *cpu)
desc->size = MIN_CPU_TLB_SIZE;
desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
desc->used = 0;
+ desc->n_flushes_low_rate = 0;
env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
}
@@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
return count;
}
+/* Call with tlb_lock held */
+static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
+{
+ CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
+ size_t rate = desc->used * 100 / desc->size;
+ size_t new_size = desc->size;
+
+ if (rate == 100) {
+ new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
+ } else if (rate > 70) {
+ new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
+ } else if (rate < 30) {
+ desc->n_flushes_low_rate++;
+ if (desc->n_flushes_low_rate == 100) {
+ new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
+ desc->n_flushes_low_rate = 0;
+ }
+ }
+
+ if (new_size == desc->size) {
+ return;
+ }
+
+ g_free(env->tlb_table[mmu_idx]);
+ g_free(env->iotlb[mmu_idx]);
+
+ desc->size = new_size;
+ desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
+ desc->n_flushes_low_rate = 0;
+ env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, desc->size);
+ env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, desc->size);
+}
+
/* This is OK because CPU architectures generally permit an
* implementation to drop entries from the TLB at any time, so
* flushing more entries than required is only an efficiency issue,
@@ -151,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
*/
qemu_spin_lock(&env->tlb_lock);
for (i = 0; i < NB_MMU_MODES; i++) {
+ tlb_mmu_resize_locked(env, i);
memset(env->tlb_table[i], -1,
env->tlb_desc[i].size * sizeof(CPUTLBEntry));
env->tlb_desc[i].used = 0;
@@ -215,6 +250,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
tlb_debug("%d\n", mmu_idx);
+ tlb_mmu_resize_locked(env, mmu_idx);
memset(env->tlb_table[mmu_idx], -1,
env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
--
2.17.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
@ 2018-10-07 17:37 ` Philippe Mathieu-Daudé
2018-10-08 1:48 ` Emilio G. Cota
2018-10-08 3:21 ` Richard Henderson
1 sibling, 1 reply; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2018-10-07 17:37 UTC (permalink / raw)
To: Emilio G. Cota, qemu-devel
Cc: Alex Bennée, Pranith Kumar, Richard Henderson
Hi Emilio,
On 10/6/18 11:45 PM, Emilio G. Cota wrote:
> Perform the resizing only on flushes, otherwise we'd
> have to take a perf hit by either rehashing the array
> or unnecessarily flushing it.
>
> We grow the array aggressively, and reduce the size more
> slowly. This accommodates mixed workloads, where some
> processes might be memory-heavy while others are not.
>
> As the following experiments show, this a net perf gain,
> particularly for memory-heavy workloads. Experiments
> are run on an Intel i7-6700K CPU @ 4.00GHz.
>
> 1. System boot + shudown, debian aarch64:
>
> - Before (tb-lock-v3):
> Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
>
> 7469.363393 task-clock (msec) # 0.998 CPUs utilized ( +- 0.07% )
> 31,507,707,190 cycles # 4.218 GHz ( +- 0.07% )
> 57,101,577,452 instructions # 1.81 insns per cycle ( +- 0.08% )
> 10,265,531,804 branches # 1374.352 M/sec ( +- 0.07% )
> 173,020,681 branch-misses # 1.69% of all branches ( +- 0.10% )
>
> 7.483359063 seconds time elapsed ( +- 0.08% )
>
> - After:
> Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
>
> 7185.036730 task-clock (msec) # 0.999 CPUs utilized ( +- 0.11% )
> 30,303,501,143 cycles # 4.218 GHz ( +- 0.11% )
> 54,198,386,487 instructions # 1.79 insns per cycle ( +- 0.08% )
> 9,726,518,945 branches # 1353.719 M/sec ( +- 0.08% )
> 167,082,307 branch-misses # 1.72% of all branches ( +- 0.08% )
>
> 7.195597842 seconds time elapsed ( +- 0.11% )
>
> That is, a 3.8% improvement.
>
> 2. System boot + shutdown, ubuntu 18.04 x86_64:
You can also run the VM tests to build QEMU:
$ make vm-test
vm-test: Test QEMU in preconfigured virtual machines
vm-build-ubuntu.i386 - Build QEMU in ubuntu i386 VM
vm-build-freebsd - Build QEMU in FreeBSD VM
vm-build-netbsd - Build QEMU in NetBSD VM
vm-build-openbsd - Build QEMU in OpenBSD VM
vm-build-centos - Build QEMU in CentOS VM, with Docker
>
> - Before (tb-lock-v3):
> Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
>
> 49971.036482 task-clock (msec) # 0.999 CPUs utilized ( +- 1.62% )
> 210,766,077,140 cycles # 4.218 GHz ( +- 1.63% )
> 428,829,830,790 instructions # 2.03 insns per cycle ( +- 0.75% )
> 77,313,384,038 branches # 1547.164 M/sec ( +- 0.54% )
> 835,610,706 branch-misses # 1.08% of all branches ( +- 2.97% )
>
> 50.003855102 seconds time elapsed ( +- 1.61% )
>
> - After:
> Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
>
> 50118.124477 task-clock (msec) # 0.999 CPUs utilized ( +- 4.30% )
> 132,396 context-switches # 0.003 M/sec ( +- 1.20% )
> 0 cpu-migrations # 0.000 K/sec ( +-100.00% )
> 167,754 page-faults # 0.003 M/sec ( +- 0.06% )
> 211,414,701,601 cycles # 4.218 GHz ( +- 4.30% )
> <not supported> stalled-cycles-frontend
> <not supported> stalled-cycles-backend
> 431,618,818,597 instructions # 2.04 insns per cycle ( +- 6.40% )
> 80,197,256,524 branches # 1600.165 M/sec ( +- 8.59% )
> 794,830,352 branch-misses # 0.99% of all branches ( +- 2.05% )
>
> 50.177077175 seconds time elapsed ( +- 4.23% )
>
> No improvement (within noise range).
>
> 3. x86_64 SPEC06int:
> SPEC06int (test set)
> [ Y axis: speedup over master ]
> 8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+
> | |
> | tlb-lock-v3 |
> 7 +-+..................$$$...........................+indirection +-+
> | $ $ +resizing |
> | $ $ |
> 6 +-+..................$.$..............................................+-+
> | $ $ |
> | $ $ |
> 5 +-+..................$.$..............................................+-+
> | $ $ |
> | $ $ |
> 4 +-+..................$.$..............................................+-+
> | $ $ |
> | +++ $ $ |
> 3 +-+........$$+.......$.$..............................................+-+
> | $$ $ $ |
> | $$ $ $ $$$ |
> 2 +-+........$$........$.$.................................$.$..........+-+
> | $$ $ $ $ $ +$$ |
> | $$ $$+ $ $ $$$ +$$ $ $ $$$ $$ |
> 1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+
> | * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ |
> | * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ |
> 0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+
> 401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean
This description line is hard to read ;)
> png: https://imgur.com/a/b1wn3wc
>
> That is, a 1.53x average speedup over master, with a max speedup of 7.13x.
>
> Note that "indirection" (i.e. the first patch in this series) incurs
> no overhead, on average.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> include/exec/cpu-defs.h | 1 +
> accel/tcg/cputlb.c | 36 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 37 insertions(+)
>
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 27b9433976..4d1d6b2b8b 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -145,6 +145,7 @@ typedef struct CPUTLBDesc {
> size_t size;
> size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
> size_t used;
> + size_t n_flushes_low_rate;
> } CPUTLBDesc;
>
> #define CPU_COMMON_TLB \
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 1ca71ecfc4..afb61e9c2b 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -85,6 +85,7 @@ void tlb_init(CPUState *cpu)
> desc->size = MIN_CPU_TLB_SIZE;
> desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
> desc->used = 0;
> + desc->n_flushes_low_rate = 0;
> env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
> env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
> }
> @@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
> return count;
> }
>
> +/* Call with tlb_lock held */
> +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
> +{
> + CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
> + size_t rate = desc->used * 100 / desc->size;
> + size_t new_size = desc->size;
> +
> + if (rate == 100) {
> + new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> + } else if (rate > 70) {
> + new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> + } else if (rate < 30) {
I wonder if those thresholds might be per TCG_TARGET.
Btw the paper used 40% here, did you tried it too?
Regards,
Phil.
> + desc->n_flushes_low_rate++;
> + if (desc->n_flushes_low_rate == 100) {
> + new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
> + desc->n_flushes_low_rate = 0;
> + }
> + }
> +
> + if (new_size == desc->size) {
> + return;
> + }
> +
> + g_free(env->tlb_table[mmu_idx]);
> + g_free(env->iotlb[mmu_idx]);
> +
> + desc->size = new_size;
> + desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
> + desc->n_flushes_low_rate = 0;
> + env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, desc->size);
> + env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, desc->size);
> +}
> +
> /* This is OK because CPU architectures generally permit an
> * implementation to drop entries from the TLB at any time, so
> * flushing more entries than required is only an efficiency issue,
> @@ -151,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
> */
> qemu_spin_lock(&env->tlb_lock);
> for (i = 0; i < NB_MMU_MODES; i++) {
> + tlb_mmu_resize_locked(env, i);
> memset(env->tlb_table[i], -1,
> env->tlb_desc[i].size * sizeof(CPUTLBEntry));
> env->tlb_desc[i].used = 0;
> @@ -215,6 +250,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
> if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
> tlb_debug("%d\n", mmu_idx);
>
> + tlb_mmu_resize_locked(env, mmu_idx);
> memset(env->tlb_table[mmu_idx], -1,
> env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
> memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing
2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
@ 2018-10-08 1:47 ` Richard Henderson
0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08 1:47 UTC (permalink / raw)
To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée
On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> -#define CPU_COMMON_TLB \
> +typedef struct CPUTLBDesc {
> + size_t size;
> + size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
> +} CPUTLBDesc;
I think you don't need both size and mask. Size is (or ought to be) used
infrequently enough that I think it can be computed on demand. But on a
related note, if you remember the out-of-line tlb changes, it'll be easier for
x86 if you index an array of scalars.
> - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> + int index = (addr >> TARGET_PAGE_BITS) & (env->tlb_desc[mmu_idx].size - 1);
I just posted a patch to functionalize this. But I imagine
static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx,
target_ulong addr)
{
uintptr_t ofs = (addr >> TARGET_PAGE_BITS) & env->tlb_mask[mmu_idx];
return ofs >> CPU_TLB_BITS;
}
static inline CPUTLBEntry *tlb_entry(CPUArchState *env, uintptr_t mmu_idx,
target_ulong addr)
{
uintptr_t ofs = (addr >> TARGET_PAGE_BITS) & env->tlb_mask[mmu_idx];
return (void *)&env->tlb_table[mmu_idx][0] + ofs;
}
In the few places where we use both of these functions, the compiler ought to
be able to pull out the common sub-expression.
> @@ -139,7 +149,10 @@ static void tlb_flush_nocheck(CPUState *cpu)
> * that do not hold the lock are performed by the same owner thread.
> */
> qemu_spin_lock(&env->tlb_lock);
> - memset(env->tlb_table, -1, sizeof(env->tlb_table));
> + for (i = 0; i < NB_MMU_MODES; i++) {
> + memset(env->tlb_table[i], -1,
> + env->tlb_desc[i].size * sizeof(CPUTLBEntry));
Here we can use something like
static inline uintptr_t sizeof_tlb(CPUArchState *env, uintptr_t mmu_idx)
{
return env->tlb_mask[mmu_idx] + (1 << CPU_TLB_BITS);
}
memset(env->tlb_table[i], -1, sizeof_tlb(env, i));
> - for (i = 0; i < CPU_TLB_SIZE; i++) {
> + for (i = 0; i < env->tlb_desc[mmu_idx].size; i++) {
This seems to be one of the few places where you need the number of entries
rather than the size. Perhaps
for (i = sizeof_tlb(env, mmu_idx) >> CPU_TLB_BITS; i-- > 0; ) {
Or if you can think of a name for the function of the same effect...
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
2018-10-07 17:37 ` Philippe Mathieu-Daudé
@ 2018-10-08 1:48 ` Emilio G. Cota
2018-10-08 13:46 ` Emilio G. Cota
0 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-08 1:48 UTC (permalink / raw)
To: Philippe Mathieu-Daudé
Cc: qemu-devel, Alex Bennée, Pranith Kumar, Richard Henderson
On Sun, Oct 07, 2018 at 19:37:50 +0200, Philippe Mathieu-Daudé wrote:
> On 10/6/18 11:45 PM, Emilio G. Cota wrote:
> > 2. System boot + shutdown, ubuntu 18.04 x86_64:
>
> You can also run the VM tests to build QEMU:
>
> $ make vm-test
Thanks, will give that a look.
> > + if (rate == 100) {
> > + new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> > + } else if (rate > 70) {
> > + new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> > + } else if (rate < 30) {
>
> I wonder if those thresholds might be per TCG_TARGET.
Do you mean to tune the growth rate based on each TCG target?
(max and min are already determined by the TCG target).
The optimal growth rate is mostly dependent on the guest
workload, so I wouldn't expect the TCG target to matter
much.
That said, we could spend quite some time tweaking the
TLB sizing algorithm. But with this RFC I wanted to see
(a) whether this approach is a good idea at all, and (b) show
what 'easy' speedups might look like (because converting
all TCG targets is a pain, so it better be justified).
> Btw the paper used 40% here, did you tried it too?
Yes, I tried several alternatives including what the
paper describes, i.e. (skipping the min/max checks
for simplicity):
if (rate > 70) {
new_size = 2 * old_size;
} else if (rate < 40) {
new_size = old_size / 2;
}
But that didn't give great speedups (see "resizing-paper"
set):
https://imgur.com/a/w3AqHP7
A few points stand out to me:
- We get very different speedups even if we implement
the algorithm they describe (not sure that's exactly
what they implemented though). But there are many
variables that could explain that, e.g. different guest
images (and therefore different TLB flush rates) and
different QEMU baselines (ours is faster than the paper's,
so getting speedups is harder).
- 70/40% use rate for growing/shrinking the TLB does not
seem a great choice, if one wants to avoid a pathological
case that can induce constant resizing. Imagine we got
exactly 70% use rate, and all TLB misses were compulsory
(i.e. a direct-mapped TLB would have not prevented a
single miss). We'd then double the TLB size:
size_new = 2*size_old
But then the use rate will halve:
use_new = 0.7/2 = 0.35
So we'd then end up in a grow-shrink loop!
Picking a "shrink threshold" below 0.70/2=0.35 avoids this.
- Aggressively increasing the TLB size when usage is high
makes sense. However, reducing the size at the same
rate does not make much sense. Imagine the following
scenario with two processes being scheduled: one process
uses a lot of memory, and the other one uses little, but
both are CPU-intensive and therefore being assigned similar
time slices by the scheduler. Ideally you'd resize the TLB
to meet each process' memory demands. However, at flush
time we don't even know what process is running or about
to run, so we have to size the TLB exclusively based on
recent use rates. In this scenario you're probably close
to optimal if you size the TLB to meet the demands of the
most memory-hungry process. You'll lose some extra time
flushing the (now larger) TLB, but your net gain is likely
to be positive given the TLB fills you won't have to do
when the memory-heavy process is scheduled in.
So to me it's quite likely that in the paper they
could have gotten even better results by reducing the
shrink rate, like we did.
Thanks,
Emilio
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
@ 2018-10-08 2:09 ` Richard Henderson
2018-10-08 14:42 ` Emilio G. Cota
0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2018-10-08 2:09 UTC (permalink / raw)
To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée
On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> Currently we evict an entry to the victim TLB when it doesn't match
> the current address. But it could be that there's no match because
> the current entry is invalid. Do not evict the entry to the vtlb
> in that case.
>
> This change will help us keep track of the TLB's use rate.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> include/exec/cpu-all.h | 14 ++++++++++++++
> accel/tcg/cputlb.c | 2 +-
> 2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> index 117d2fbbca..d938dedafc 100644
> --- a/include/exec/cpu-all.h
> +++ b/include/exec/cpu-all.h
> @@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
> return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
> }
>
> +/**
> + * tlb_is_valid - return true if at least one of the addresses is valid
> + * @te: pointer to CPUTLBEntry
> + *
> + * This is useful when we don't have a particular address to compare against,
> + * and we just want to know whether any entry holds valid data.
> + */
> +static inline bool tlb_is_valid(const CPUTLBEntry *te)
> +{
> + return !(te->addr_read & TLB_INVALID_MASK) ||
> + !(te->addr_write & TLB_INVALID_MASK) ||
> + !(te->addr_code & TLB_INVALID_MASK);
> +}
No, I think you misunderstand.
First, TLB_INVALID_MASK is only ever set for addr_write, in response to
PAGE_WRITE_INV. Second, an entry that is invalid for write is still valid for
read+exec. So there is benefit to swapping it out to the victim cache.
This is used by the s390x target to make the "lowpage" read-only, which is a
special architected 512 byte range within pages 0 and 1. This is done by
forcing writes, but not reads, back through tlb_fill.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates
2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
@ 2018-10-08 2:54 ` Richard Henderson
0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08 2:54 UTC (permalink / raw)
To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée
On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> @@ -753,6 +763,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
> }
>
> copy_tlb_helper_locked(te, &tn);
> + env->tlb_desc[mmu_idx].used++;
> qemu_spin_unlock(&env->tlb_lock);
> }
So.. you're computing what? Total TLB occupancy?
Perhaps a better name than "used"?
Otherwise the patch looks plausible.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS
2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
@ 2018-10-08 2:56 ` Richard Henderson
0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08 2:56 UTC (permalink / raw)
To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée
On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> From: Pranith Kumar <bobby.prani@gmail.com>
>
> This paves the way for implementing a dynamically-sized softmmu.
>
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
There's no point in this, since the original constraint was due to encoding
immediates in the and instruction. Now we're using tlb_mask.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE
2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
@ 2018-10-08 3:01 ` Richard Henderson
0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08 3:01 UTC (permalink / raw)
To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée
On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> @@ -89,7 +89,7 @@ typedef uint64_t target_ulong;
> * 0x18 (the offset of the addend field in each TLB entry) plus the offset
> * of tlb_table inside env (which is non-trivial but not huge).
> */
> -#define CPU_TLB_BITS \
> +#define MIN_CPU_TLB_BITS \
> MIN(8, \
> TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS - \
> (NB_MMU_MODES <= 1 ? 0 : \
There's no point in this either, since the original constraint was due to the
immediate offset into an add instruction. Now we're loading the base address
from an array. The actual size of the tlb is immaterial now, since the size of
the tlb does not affect the size of CPUArchState.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
2018-10-07 17:37 ` Philippe Mathieu-Daudé
@ 2018-10-08 3:21 ` Richard Henderson
1 sibling, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2018-10-08 3:21 UTC (permalink / raw)
To: Emilio G. Cota, qemu-devel; +Cc: Pranith Kumar, Alex Bennée
On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> @@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
> return count;
> }
>
> +/* Call with tlb_lock held */
> +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
> +{
> + CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
> + size_t rate = desc->used * 100 / desc->size;
> + size_t new_size = desc->size;
> +
> + if (rate == 100) {
> + new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> + } else if (rate > 70) {
> + new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> + } else if (rate < 30) {
> + desc->n_flushes_low_rate++;
> + if (desc->n_flushes_low_rate == 100) {
> + new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
> + desc->n_flushes_low_rate = 0;
> + }
> + }
> +
> + if (new_size == desc->size) {
s/desc->size/old_size/g
Otherwise it looks plausible as a first cut.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
2018-10-08 1:48 ` Emilio G. Cota
@ 2018-10-08 13:46 ` Emilio G. Cota
0 siblings, 0 replies; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-08 13:46 UTC (permalink / raw)
To: Philippe Mathieu-Daudé
Cc: qemu-devel, Alex Bennée, Pranith Kumar, Richard Henderson
On Sun, Oct 07, 2018 at 21:48:34 -0400, Emilio G. Cota wrote:
> - 70/40% use rate for growing/shrinking the TLB does not
> seem a great choice, if one wants to avoid a pathological
> case that can induce constant resizing. Imagine we got
> exactly 70% use rate, and all TLB misses were compulsory
> (i.e. a direct-mapped TLB would have not prevented a
^^^
> single miss). We'd then double the TLB size:
I meant fully associative.
E.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
2018-10-08 2:09 ` Richard Henderson
@ 2018-10-08 14:42 ` Emilio G. Cota
2018-10-08 19:46 ` Richard Henderson
0 siblings, 1 reply; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-08 14:42 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, Pranith Kumar, Alex Bennée
On Sun, Oct 07, 2018 at 19:09:01 -0700, Richard Henderson wrote:
> On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> > Currently we evict an entry to the victim TLB when it doesn't match
> > the current address. But it could be that there's no match because
> > the current entry is invalid. Do not evict the entry to the vtlb
> > in that case.
> >
> > This change will help us keep track of the TLB's use rate.
> >
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> > include/exec/cpu-all.h | 14 ++++++++++++++
> > accel/tcg/cputlb.c | 2 +-
> > 2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> > index 117d2fbbca..d938dedafc 100644
> > --- a/include/exec/cpu-all.h
> > +++ b/include/exec/cpu-all.h
> > @@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
> > return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
> > }
> >
> > +/**
> > + * tlb_is_valid - return true if at least one of the addresses is valid
> > + * @te: pointer to CPUTLBEntry
> > + *
> > + * This is useful when we don't have a particular address to compare against,
> > + * and we just want to know whether any entry holds valid data.
> > + */
> > +static inline bool tlb_is_valid(const CPUTLBEntry *te)
> > +{
> > + return !(te->addr_read & TLB_INVALID_MASK) ||
> > + !(te->addr_write & TLB_INVALID_MASK) ||
> > + !(te->addr_code & TLB_INVALID_MASK);
> > +}
>
> No, I think you misunderstand.
>
> First, TLB_INVALID_MASK is only ever set for addr_write, in response to
> PAGE_WRITE_INV. Second, an entry that is invalid for write is still valid for
> read+exec. So there is benefit to swapping it out to the victim cache.
>
> This is used by the s390x target to make the "lowpage" read-only, which is a
> special architected 512 byte range within pages 0 and 1. This is done by
> forcing writes, but not reads, back through tlb_fill.
Aah I see. The point is to avoid pushing to the victim cache
an entry that is all invalid, not just partially invalid.
Thanks for the clarification!
Emilio
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
2018-10-08 14:42 ` Emilio G. Cota
@ 2018-10-08 19:46 ` Richard Henderson
2018-10-08 20:23 ` Emilio G. Cota
0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2018-10-08 19:46 UTC (permalink / raw)
To: Emilio G. Cota; +Cc: qemu-devel, Pranith Kumar, Alex Bennée
On 10/8/18 7:42 AM, Emilio G. Cota wrote:
> On Sun, Oct 07, 2018 at 19:09:01 -0700, Richard Henderson wrote:
>> On 10/6/18 2:45 PM, Emilio G. Cota wrote:
>>> Currently we evict an entry to the victim TLB when it doesn't match
>>> the current address. But it could be that there's no match because
>>> the current entry is invalid. Do not evict the entry to the vtlb
>>> in that case.
>>>
>>> This change will help us keep track of the TLB's use rate.
>>>
>>> Signed-off-by: Emilio G. Cota <cota@braap.org>
>>> ---
>>> include/exec/cpu-all.h | 14 ++++++++++++++
>>> accel/tcg/cputlb.c | 2 +-
>>> 2 files changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
>>> index 117d2fbbca..d938dedafc 100644
>>> --- a/include/exec/cpu-all.h
>>> +++ b/include/exec/cpu-all.h
>>> @@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
>>> return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
>>> }
>>>
>>> +/**
>>> + * tlb_is_valid - return true if at least one of the addresses is valid
>>> + * @te: pointer to CPUTLBEntry
>>> + *
>>> + * This is useful when we don't have a particular address to compare against,
>>> + * and we just want to know whether any entry holds valid data.
>>> + */
>>> +static inline bool tlb_is_valid(const CPUTLBEntry *te)
>>> +{
>>> + return !(te->addr_read & TLB_INVALID_MASK) ||
>>> + !(te->addr_write & TLB_INVALID_MASK) ||
>>> + !(te->addr_code & TLB_INVALID_MASK);
>>> +}
>>
>> No, I think you misunderstand.
>>
>> First, TLB_INVALID_MASK is only ever set for addr_write, in response to
>> PAGE_WRITE_INV. Second, an entry that is invalid for write is still valid for
>> read+exec. So there is benefit to swapping it out to the victim cache.
>>
>> This is used by the s390x target to make the "lowpage" read-only, which is a
>> special architected 512 byte range within pages 0 and 1. This is done by
>> forcing writes, but not reads, back through tlb_fill.
>
> Aah I see. The point is to avoid pushing to the victim cache
> an entry that is all invalid, not just partially invalid.
I've slightly misspoken here. It is (ab)used for the s390 thing, but that bit
is also set by memset -1. Perhaps you might check
static inline bool tlb_is_invalid(const CPUTLBEntry *te)
{
return te->addr_read & te->addr_write & te->addr_code & TLB_INVALID_MASK;
}
That is, all forms of access have the INVALID bit set.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb
2018-10-08 19:46 ` Richard Henderson
@ 2018-10-08 20:23 ` Emilio G. Cota
0 siblings, 0 replies; 19+ messages in thread
From: Emilio G. Cota @ 2018-10-08 20:23 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, Pranith Kumar, Alex Bennée
On Mon, Oct 08, 2018 at 12:46:26 -0700, Richard Henderson wrote:
> On 10/8/18 7:42 AM, Emilio G. Cota wrote:
> > On Sun, Oct 07, 2018 at 19:09:01 -0700, Richard Henderson wrote:
> >> On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> >>> Currently we evict an entry to the victim TLB when it doesn't match
> >>> the current address. But it could be that there's no match because
> >>> the current entry is invalid. Do not evict the entry to the vtlb
> >>> in that case.
> >>>
> >>> This change will help us keep track of the TLB's use rate.
> >>>
> >>> Signed-off-by: Emilio G. Cota <cota@braap.org>
> >>> ---
> >>> include/exec/cpu-all.h | 14 ++++++++++++++
> >>> accel/tcg/cputlb.c | 2 +-
> >>> 2 files changed, 15 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> >>> index 117d2fbbca..d938dedafc 100644
> >>> --- a/include/exec/cpu-all.h
> >>> +++ b/include/exec/cpu-all.h
> >>> @@ -362,6 +362,20 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr)
> >>> return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK);
> >>> }
> >>>
> >>> +/**
> >>> + * tlb_is_valid - return true if at least one of the addresses is valid
> >>> + * @te: pointer to CPUTLBEntry
> >>> + *
> >>> + * This is useful when we don't have a particular address to compare against,
> >>> + * and we just want to know whether any entry holds valid data.
> >>> + */
> >>> +static inline bool tlb_is_valid(const CPUTLBEntry *te)
> >>> +{
> >>> + return !(te->addr_read & TLB_INVALID_MASK) ||
> >>> + !(te->addr_write & TLB_INVALID_MASK) ||
> >>> + !(te->addr_code & TLB_INVALID_MASK);
> >>> +}
> >>
> >> No, I think you misunderstand.
> >>
> >> First, TLB_INVALID_MASK is only ever set for addr_write, in response to
> >> PAGE_WRITE_INV. Second, an entry that is invalid for write is still valid for
> >> read+exec. So there is benefit to swapping it out to the victim cache.
> >>
> >> This is used by the s390x target to make the "lowpage" read-only, which is a
> >> special architected 512 byte range within pages 0 and 1. This is done by
> >> forcing writes, but not reads, back through tlb_fill.
> >
> > Aah I see. The point is to avoid pushing to the victim cache
> > an entry that is all invalid, not just partially invalid.
>
> I've slightly misspoken here. It is (ab)used for the s390 thing, but that bit
> is also set by memset -1. Perhaps you might check
>
> static inline bool tlb_is_invalid(const CPUTLBEntry *te)
> {
> return te->addr_read & te->addr_write & te->addr_code & TLB_INVALID_MASK;
> }
>
> That is, all forms of access have the INVALID bit set.
I see. I just ran a few tests and the below gives the best
performance, so I'll go with it:
static inline bool tlb_is_empty(const CPUTLBEntry *te)
{
return te->addr_read == -1 && te->addr_write == -1 && te->addr_code == -1;
}
I renamed it from "invalid" to "empty" to avoid even thinking about
the invalid flag.
Thanks,
Emilio
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-10-08 20:28 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
2018-10-08 1:47 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
2018-10-08 2:09 ` Richard Henderson
2018-10-08 14:42 ` Emilio G. Cota
2018-10-08 19:46 ` Richard Henderson
2018-10-08 20:23 ` Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
2018-10-08 2:54 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
2018-10-08 2:56 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
2018-10-08 3:01 ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
2018-10-07 17:37 ` Philippe Mathieu-Daudé
2018-10-08 1:48 ` Emilio G. Cota
2018-10-08 13:46 ` Emilio G. Cota
2018-10-08 3:21 ` Richard Henderson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.