All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] qemumips: speeding up
@ 2020-10-07 20:38 Victor Kamensky
  2020-10-07 20:38 ` [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type Victor Kamensky
  2020-10-07 20:38 ` [PATCH 2/2] qemumips: use 34Kf-64tlb CPU emulation Victor Kamensky
  0 siblings, 2 replies; 16+ messages in thread
From: Victor Kamensky @ 2020-10-07 20:38 UTC (permalink / raw)
  To: openembedded-core; +Cc: Richard Purdie, Khem Raj

Hi Folks,

I was looking at Yocto Project RP 13992 (qemumips
testimage keeps failing) [0]. My approach was to compare
and analyze qemumips vs qemumips64 machine when they
run OE do_testimage load. Overall, it seems
that OE qemumips is around twice slower then qemumips64.

Using perf, gdb, SystemTap and additional qemu instrumentation
I observed that soft mmu in case of qemumips takes significantly
more time. The difference in part could be explained by
different CPU memory 32 bit vs 64 bit layout that handled by different
code paths in qemu. MIPS64 layout is more optimal, and it does not
seem we can do much about it. But another significant difference
that in case of qemumips64 emulated CPU MIPS64R2-generic has 64
TLBs, but in case of qemumips emulated CPU 34Kf has just only
16 TLBs. Naturally, in qemumips case TLB is trashed more (in my
tests 16x more TLB misses) and since in MIPS case TLB refill
handle in software it causes more code to run.

The idea of my fix that is implemented by two patches that
follow this cover letter is to introduce new  fictitious cpu
type, 34Kf-64tlb, that would be identical to 34Kf but would
have 64 TLBs instead of original 16 TLBs. After all, adding
more TLBs to software MMU is very easy :).

With this approach in my limited tests I see that execution
time of core-image-full-cmdline:do_testimage improves by
40%.

I understand that it is not ideal to use fictitious
CPU type, that is not present out there in the wild, but
given significant gains it produces, IMO it is worth to go
this route.

For those who is interested in notes of my investigation
journey and how/why I did come up with this idea, please find
them below.

[0] https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992

Thanks,
Victor

Slow qemumips investigation notes
=================================

As PR was reported against poky autobuilder run. I've pulled
poky tree, adjusted config to match autobuilder case as much
as possible and build both qemumips and qemumips64 machines
images. Idea is to look at differences between this two cases.

Starting Point
--------------

Running 'bitbake core-image-full-cmdline:do_testimage' many
tests are skipped but it looks like significant enough load
to investigate.

mips64:

real	3m51.953s
user	0m1.099s
sys	0m0.098s

mips:

real	8m29.485s
user	0m1.187s
sys	0m0.113s

runqemu qemu CPU time:

mips64:

kamensky 26058 25963 93 10:05 pts/10   00:01:28 /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-mips64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:02 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -drive file=/wd3/yocto/20201002/build-qemumips64/tmp/deploy/images/qemumips64/core-image-full-cmdline-qemumips64-20201002212824.rootfs.ext4,if=virtio,format=raw -usb -device usb-tablet -vga std -machine malta -cpu MIPS64R2-generic -m 256 -serial mon:vc -serial null -kernel /wd3/yocto/20201002/build-qemumips64/tmp/deploy/images/qemumips64/vmlinux--5.8.9+git0+ffbfe61a19_4faa049b6b-r0-qemumips64-20201002212824.bin -append root=/dev/vda rw  ip=192.168.7.2::192.168.7.1:255.255.255.0 console=ttyS0 console=tty

mips:

kamensky 25599 25547 98 09:58 pts/11   00:04:20 /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-mips -device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:02 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -drive file=/wd3/yocto/20201002/build-qemumips/tmp/deploy/images/qemumips/core-image-full-cmdline-qemumips-20201003013835.rootfs.ext4,if=virtio,format=raw -usb -device usb-tablet -vga std -machine malta -cpu 34Kf -m 256 -serial mon:vc -serial null -kernel /wd3/yocto/20201002/build-qemumips/tmp/deploy/images/qemumips/vmlinux--5.8.9+git0+ffbfe61a19_93d29a7089-r0-qemumips-20201003013835.bin -append root=/dev/vda rw  ip=192.168.7.2::192.168.7.1:255.255.255.0 console=ttyS0 console=tty

/wd3/yocto/20201002/build-qemumips/tmp/deploy/images/qemumips/vmlinux--5.8.9+git0+ffbfe61a19_93d29a7089-r0-qemumips-20201003013835.bin

Just get rid of impact of graphics handling
'runqemu serial nographic':

mips64:

kamensky 26402 26347 94 10:12 pts/10   00:00:45 /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-mips64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:02 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -drive file=/wd3/yocto/20201002/build-qemumips64/tmp/deploy/images/qemumips64/core-image-full-cmdline-qemumips64-20201002212824.rootfs.ext4,if=virtio,format=raw -usb -device usb-tablet -vga std -nographic -machine malta -cpu MIPS64R2-generic -m 256 -serial mon:stdio -serial null -kernel /wd3/yocto/20201002/build-qemumips64/tmp/deploy/images/qemumips64/vmlinux--5.8.9+git0+ffbfe61a19_4faa049b6b-r0-qemumips64-20201002212824.bin -append root=/dev/vda rw  console=ttyS0 console=ttyS0 ip=192.168.7.2::192.168.7.1:255.255.255.0 console=ttyS0 console=tty

mips:

kamensky 26728 26667 96 10:14 pts/11   00:01:24 /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-mips -device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:02 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -drive file=/wd3/yocto/20201002/build-qemumips/tmp/deploy/images/qemumips/core-image-full-cmdline-qemumips-20201003013835.rootfs.ext4,if=virtio,format=raw -usb -device usb-tablet -vga std -nographic -machine malta -cpu 34Kf -m 256 -serial mon:stdio -serial null -kernel /wd3/yocto/20201002/build-qemumips/tmp/deploy/images/qemumips/vmlinux--5.8.9+git0+ffbfe61a19_93d29a7089-r0-qemumips-20201003013835.bin -append root=/dev/vda rw  console=ttyS0 ip=192.168.7.2::192.168.7.1:255.255.255.0 console=ttyS0 console=tty

Running qemu under perf
-----------------------

Built qemu-system-native with symbols
(added DEBUG_BUILD_class-native = '1' in conf/local.conf)

'perf record -a' during 'runqemu serial nographic'

'perf report' top contributers snippets:

mips64:

   3.53%  qemu-system-mip  qemu-system-mips64       [.] helper_lookup_tb_ptr
   2.18%  qemu-system-mip  qemu-system-mips64       [.] r4k_map_address
   1.49%  qemu-system-mip  qemu-system-mips64       [.] qht_lookup_custom
   1.02%  qemu-system-mip  qemu-system-mips64       [.] la_func_end
   0.86%  qemu-system-mip  qemu-system-mips64       [.] tcg_optimize
   0.84%  qemu-system-mip  qemu-system-mips64       [.] tlb_set_page_with_attrs
   0.76%  qemu-system-mip  qemu-system-mips64       [.] cpu_exec
   0.64%  qemu-system-mip  qemu-system-mips64       [.] liveness_pass_1
   0.62%  qemu-system-mip  qemu-system-mips64       [.] la_bb_end
   0.62%  qemu-system-mip  qemu-system-mips64       [.] tb_htable_lookup
   0.56%  qemu-system-mip  qemu-system-mips64       [.] victim_tlb_hit
   0.52%  qemu-system-mip  qemu-system-mips64       [.] get_page_addr_code_hostp
   0.52%  qemu-system-mip  qemu-system-mips64       [.] tlb_flush_page_locked

mips:

   8.84%  qemu-system-mip  qemu-system-mips         [.] r4k_map_address
   4.41%  qemu-system-mip  qemu-system-mips         [.] tlb_flush_page_locked
   2.78%  qemu-system-mip  qemu-system-mips         [.] tb_jmp_cache_clear_page
   2.02%  qemu-system-mip  qemu-system-mips         [.] helper_lookup_tb_ptr
   1.82%  qemu-system-mip  qemu-system-mips         [.] tlb_set_page_with_attrs
   1.51%  qemu-system-mip  qemu-system-mips         [.] qht_lookup_custom
   1.27%  qemu-system-mip  qemu-system-mips         [.] ptr_cmp_tb_tc
   1.16%  qemu-system-mip  libglib-2.0.so.0.6400.5  [.] g_tree_find_node
   0.99%  qemu-system-mip  qemu-system-mips         [.] cpu_exec

Look as siginificant difference wrt how much soft mmu code
contribute into execution time. Note r4k_map_address,
tlb_flush_page_locked dominates report in mips case. Its
contribution in mips64 noticeably smaller. Need to dig into
this.

stap function counter script during 'runqemu serial nographic'
--------------------------------------------------------------

Just to get another view how much r4k_map_address
contributes into qemu execution and get difference between
mips64 and mips wrt how many times functions were called
added the following SysteTap script get proper counters.
Use case is boot to login when 'runqemu serail nographic'
is executed.

SystemTap script:

[root@coreos-lnx2 systemtap]# cat qemu_func_count1.stp
global r4k_map_address_count = 0;
global la_func_end_count = 0;
global tcg_optimize_count = 0;

probe process(@1).function("r4k_map_address").call {
  r4k_map_address_count++;
}

probe process(@1).function("la_func_end").call {
  la_func_end_count++;
}

probe process(@1).function("tcg_optimize").call {
  tcg_optimize_count++;
}

probe end {
  printf("r4k_map_address = %d\n", r4k_map_address_count);
  printf("la_func_end = %d\n", la_func_end_count);
  printf("tcg_optimize = %d\n", tcg_optimize_count);
}

SystemTap invocation example:

stap -v qemu_func_count1.stp /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-mips

Results

mips64:

r4k_map_address = 5029890
la_func_end = 2610665
tcg_optimize = 544187

mips:

r4k_map_address = 55255391 = 10.98 * 5029890
la_func_end = 2725631
tcg_optimize = 567154


Debugging qemu under gdb
------------------------

Learning more about r4k_map_address function by attaching
gdb to qemu native and stepping through code.

Example breakpoint at r4k_map_address

(gdb) bt
#0  0x00000000005164de in r4k_map_address (env=0x14d9830, physical=0x7f9dbfdfe300, prot=0x7f9dbfdfe2fc, address=2138944944, rw=1, access_type=32)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:73
#1  0x000000000051588e in get_seg_physical_address (env=env@entry=0x14d9830, physical=physical@entry=0x7f9dbfdfe300, prot=prot@entry=0x7f9dbfdfe2fc, 
    real_address=real_address@entry=2138944944, rw=rw@entry=1, access_type=access_type@entry=32, mmu_idx=0, am=3, eu=true, segmask=1073741823, 
    physical_base=1073741824) at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:192
#2  0x0000000000515907 in get_segctl_physical_address (env=env@entry=0x14d9830, physical=physical@entry=0x7f9dbfdfe300, prot=prot@entry=0x7f9dbfdfe2fc, 
    real_address=real_address@entry=2138944944, rw=rw@entry=1, access_type=access_type@entry=32, mmu_idx=0, segctl=1082, segmask=1073741823)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:211
#3  0x0000000000515996 in get_physical_address (env=env@entry=0x14d9830, physical=physical@entry=0x7f9dbfdfe300, prot=prot@entry=0x7f9dbfdfe2fc, 
    real_address=real_address@entry=2138944944, rw=rw@entry=1, access_type=access_type@entry=32, mmu_idx=0)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:264
#4  0x0000000000517a83 in mips_cpu_tlb_fill (cs=0x14d0e30, address=2138944944, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, 
    probe=<optimized out>, retaddr=140316022715390)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:909
#5  0x0000000000461596 in tlb_fill (cpu=cpu@entry=0x14d0e30, addr=addr@entry=2138944944, size=size@entry=4, access_type=access_type@entry=MMU_DATA_STORE, 
    mmu_idx=mmu_idx@entry=0, retaddr=retaddr@entry=140316022715390)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:1032
#6  0x0000000000467667 in store_helper (op=MO_BEUL, retaddr=140316022715390, oi=160, val=0, addr=2138944944, env=0x14d9830)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:2035
#7  helper_be_stl_mmu (env=0x14d9830, addr=2138944944, val=0, oi=160, retaddr=140316022715390)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:2192
#8  0x00007f9ddeb0b3fe in code_gen_buffer ()
#9  0x000000000047258a in cpu_tb_exec (itb=<optimized out>, cpu=<optimized out>)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:172
#10 cpu_loop_exec_tb (tb_exit=<synthetic pointer>, last_tb=<synthetic pointer>, tb=<optimized out>, cpu=<optimized out>)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:636
#11 cpu_exec (cpu=cpu@entry=0x14d0e30)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:749
#12 0x00000000004be149 in tcg_cpu_exec (cpu=cpu@entry=0x14d0e30)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/softmmu/cpus.c:1356
#13 0x00000000004bfa51 in qemu_tcg_cpu_thread_fn (arg=arg@entry=0x14d0e30)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/softmmu/cpus.c:1664
#14 0x00000000007f2f6b in qemu_thread_start (args=0x14e9f90)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/util/qemu-thread-posix.c:521
#15 0x00007f9e16320e5e in ?? () from /wd3/yocto/20201002/build-qemumips/tmp/sysroots-uninative/x86_64-linux/lib/libpthread.so.0
#16 0x00007f9e1624e64f in clone () from /wd3/yocto/20201002/build-qemumips/tmp/sysroots-uninative/x86_64-linux/lib/libc.so.6

What is inside of soft mmu data structure for given
cpu environment mips case

*env->tlb
---------
$8 = {
  nb_tlb = 16, 
  tlb_in_use = 128, 
  map_address = 0x5164dd <r4k_map_address>, 
  helper_tlbwi = 0x518b7a <r4k_helper_tlbwi>, 
  helper_tlbwr = 0x518d94 <r4k_helper_tlbwr>, 
  helper_tlbp = 0x518dc4 <r4k_helper_tlbp>, 
  helper_tlbr = 0x518f88 <r4k_helper_tlbr>, 
  helper_tlbinv = 0x518a82 <r4k_helper_tlbinv>, 
  helper_tlbinvf = 0x518b39 <r4k_helper_tlbinvf>, 
  mmu = {

Looking at get_seg_physical_address and get_physical_address
------------------------------------------------------------

In get get_seg_physical_address there are two major
case when address is mapped it does not go into
r4k_map_address and it in case where it is not mapped
it calls it through env->tlb->map_address

(gdb) s
get_seg_physical_address (env=env@entry=0x14d9830, physical=physical@entry=0x7f9dbfdfe790, prot=prot@entry=0x7f9dbfdfe78c, 
    real_address=real_address@entry=2168782848, rw=rw@entry=2, access_type=access_type@entry=32, mmu_idx=0, am=0, eu=false, segmask=536870911, 
    physical_base=0) at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:184
184	{
(gdb) n
185	    int mapped = is_seg_am_mapped(am, eu, mmu_idx);
(gdb) list
180	                                    int rw, int access_type, int mmu_idx,
181	                                    unsigned int am, bool eu,
182	                                    target_ulong segmask,
183	                                    hwaddr physical_base)
184	{
185	    int mapped = is_seg_am_mapped(am, eu, mmu_idx);
186	
187	    if (mapped < 0) {
188	        /* is_seg_am_mapped can report TLBRET_BADADDR */
189	        return mapped;
190	    } else if (mapped) {
191	        /* The segment is TLB mapped */
192	        return env->tlb->map_address(env, physical, prot, real_address, rw,
193	                                     access_type);
194	    } else {
195	        /* The segment is unmapped */
196	        *physical = physical_base | (real_address & segmask);
197	        *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
198	        return TLBRET_MATCH;
199	    }

Complete listing from source file:

static int get_seg_physical_address(CPUMIPSState *env, hwaddr *physical,
                                    int *prot, target_ulong real_address,
                                    int rw, int access_type, int mmu_idx,
                                    unsigned int am, bool eu,
                                    target_ulong segmask,
                                    hwaddr physical_base)
{
    int mapped = is_seg_am_mapped(am, eu, mmu_idx);

    if (mapped < 0) {
        /* is_seg_am_mapped can report TLBRET_BADADDR */
        return mapped;
    } else if (mapped) {
        /* The segment is TLB mapped */
        return env->tlb->map_address(env, physical, prot, real_address, rw,
                                     access_type);
    } else {
        /* The segment is unmapped */
        *physical = physical_base | (real_address & segmask);
        *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
        return TLBRET_MATCH;
    }
}

tlb helper instructions count during 'runqemu serial nographic'
---------------------------------------------------------------

Similar to previous collected counter run SystemTap script and
collect number of r4k_map_address calls and different r4k_helper_tlb*
functions. Note r4k_helper_xxxx correspond to emulation of tlb xxxx
instructions.

mips64:

r4k_map_address = 4879624
r4k_helper_tlbinv = 0
r4k_helper_tlbinvf = 0
r4k_helper_tlbp = 156024
r4k_helper_tlbr = 0
r4k_helper_tlbwi = 101065
r4k_helper_tlbwr = 1678316

mips:

r4k_map_address = 58798121 = 12.04 * 4879624
r4k_helper_tlbinv = 0
r4k_helper_tlbinvf = 0
r4k_helper_tlbp = 162568
r4k_helper_tlbr = 0
r4k_helper_tlbwi = 80785
r4k_helper_tlbwr = 26343867 = 15.69 * 1678316

Note compare to mips64 that r4k_map_address is
called 12 times more oftern and target issues
tlbwr instruction almost 16 times more often.

Note tlbwr instruction means write ('w') new TLB and
randomly ('r') replace one of existing ones. Typically
tlbwr would be called from 'TLB refill' exception
handling. Note on MIPS TBL refill is handled in software
unlike on other CPUs like x86 and ARM.

Experiment with mips kernel that disables CONFIG_HIGHMEM
--------------------------------------------------------

with disabled CONFIG_HIGHMEM

r4k_map_address = 89238250
r4k_helper_tlbinv = 0
r4k_helper_tlbinvf = 0
r4k_helper_tlbp = 175916
r4k_helper_tlbr = 0
r4k_helper_tlbwi = 87341
r4k_helper_tlbwr = 40213923

It does not look better, discarding this path

/proc/cpuinfo
-------------

mips64:

root@qemumips64:~# cat /proc/cpuinfo 
system type		: MIPS Malta
machine			: mti,malta
processor		: 0
cpu model		: MIPS GENERIC QEMU V0.0  FPU V0.0
BogoMIPS		: 835.58
wait instruction	: yes
microsecond timers	: yes
tlb_entries		: 64 <--------------------------------
extra interrupt vector	: yes
hardware watchpoint	: yes, count: 1, address/irw mask: [0x0ff8]
isa			: mips1 mips2 mips3 mips4 mips5 mips32r1 mips32r2 mips64r1 mips64r2
ASEs implemented	: mips3d
shadow register sets	: 1
kscratch registers	: 0
package			: 0
core			: 0
VCED exceptions		: not available
VCEI exceptions		: not available

mips:

root@qemumips:~# cat /proc/cpuinfo 
system type		: MIPS Malta
machine			: mti,malta
processor		: 0
cpu model		: MIPS 34Kc V0.0  FPU V0.0
BogoMIPS		: 801.17
wait instruction	: yes
microsecond timers	: yes
tlb_entries		: 16 <--------------------------------
extra interrupt vector	: yes
hardware watchpoint	: yes, count: 1, address/irw mask: [0x0ff8]
isa			: mips1 mips2 mips32r1 mips32r2
ASEs implemented	: mips16 dsp mt
shadow register sets	: 16
kscratch registers	: 0
package			: 0
core			: 0
VPE			: 0
VCED exceptions		: not available
VCEI exceptions		: not available

Later note: unfortunately the first time, it was an operator
error, and I captured output from mips64 case thinking that
I am capturing mips case. Corrected latter when with
instrumentation described below I realized that in mips
case we just have 16 TLBs.


Looking where tlbwr instructions used
-------------------------------------

Looked at places in kernel where tlbwr instruction
is used. It is in __update_tlb and generated TLB
refill exception handler. Besides 32 bit 64 bit
difference there is nothing much.

... removed my notes since it was dead branch in
the investigation ...

just for the reference kept mips TLB refill
exception handler code as example. It is executed
every TLB miss to update TLB with new entries
from page tables:


(gdb) x /30i ebase
   0x82890000:	mfc0	k1,c0_context
   0x82890004:	lui	k0,0x8112
   0x82890008:	srl	k1,k1,0x17
   0x8289000c:	addu	k1,k0,k1
   0x82890010:	mfc0	k0,c0_badvaddr
   0x82890014:	lw	k1,10064(k1)
   0x82890018:	srl	k0,k0,0x16
   0x8289001c:	sll	k0,k0,0x2
   0x82890020:	addu	k1,k1,k0
   0x82890024:	mfc0	k0,c0_context
   0x82890028:	lw	k1,0(k1)
   0x8289002c:	srl	k0,k0,0x1
   0x82890030:	andi	k0,k0,0xff8
   0x82890034:	addu	k1,k1,k0
   0x82890038:	lw	k0,0(k1)
   0x8289003c:	lw	k1,4(k1)
   0x82890040:	srl	k0,k0,0x6
   0x82890044:	mtc0	k0,c0_entrylo0
   0x82890048:	srl	k1,k1,0x6
   0x8289004c:	mtc0	k1,c0_entrylo1
   0x82890050:	ehb
=> 0x82890054:	tlbwr
   0x82890058:	eret

mips hits above after qemu detects TLB miss
in tlb_fill code path and it generates TLB miss exception
as per this backtrace:

(gdb) bt
#0  raise_mmu_exception (env=env@entry=0x2a2dc60, address=address@entry=1434011948, rw=rw@entry=0, tlb_error=tlb_error@entry=-2)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:467
#1  0x0000000000517c1e in mips_cpu_tlb_fill (cs=0x2a25260, address=1434011948, size=<optimized out>, access_type=MMU_DATA_LOAD, mmu_idx=2, 
    probe=<optimized out>, retaddr=140590720516388)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:957
#2  0x0000000000461596 in tlb_fill (cpu=cpu@entry=0x2a25260, addr=addr@entry=1434011948, size=size@entry=4, 
    access_type=access_type@entry=MMU_DATA_LOAD, mmu_idx=mmu_idx@entry=2, retaddr=retaddr@entry=140590720516388)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:1032
#3  0x0000000000462f19 in load_helper (full_load=0x462d91 <full_be_ldul_mmu>, code_read=false, op=MO_BEUL, retaddr=140590720516388, oi=162, 
    addr=1434011948, env=0x2a2dc60)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:1583
#4  full_be_ldul_mmu (env=0x2a2dc60, addr=1434011948, oi=162, retaddr=140590720516388)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:1724
#5  0x0000000000465a8f in helper_be_ldul_mmu (env=<optimized out>, addr=<optimized out>, oi=<optimized out>, retaddr=<optimized out>)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:1731
#6  0x00007fddd3f4819c in code_gen_buffer ()
#7  0x000000000047258a in cpu_tb_exec (itb=<optimized out>, cpu=<optimized out>)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:172
#8  cpu_loop_exec_tb (tb_exit=<synthetic pointer>, last_tb=<synthetic pointer>, tb=<optimized out>, cpu=<optimized out>)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:636
#9  cpu_exec (cpu=cpu@entry=0x2a25260)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:749
#10 0x00000000004be149 in tcg_cpu_exec (cpu=cpu@entry=0x2a25260)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/softmmu/cpus.c:1356
#11 0x00000000004bfa51 in qemu_tcg_cpu_thread_fn (arg=arg@entry=0x2a25260)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/softmmu/cpus.c:1664
#12 0x00000000007f2f6b in qemu_thread_start (args=0x2a3e3c0)
    at /wd3/yocto/20201002/build-qemumips/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/util/qemu-thread-posix.c:521
#13 0x00007fde13d73e5e in ?? () from /wd3/yocto/20201002/build-qemumips/tmp/sysroots-uninative/x86_64-linux/lib/libpthread.so.0
#14 0x00007fde13ca164f in clone () from /wd3/yocto/20201002/build-qemumips/tmp/sysroots-uninative/x86_64-linux/lib/libc.so.6


Need more tools and use case in target image
--------------------------------------------

Switch image under investigation from core-image-full-cmdline
to core-image-sato-sdk wanted to have more development tools
in target image.

Tried perf run in target unfortunately qemu does not implement
CPU h/w events counters so it was not much usefull.

Trying to understand soft mmu code path between mips and mips64
---------------------------------------------------------------

In order to get insights about typical code path of qemu mips
soft mmu handling rebuilt qemu-system with -fno-omit-frame-pointer
option so I could do 'perf -g' (i.e capturing full backtrace in
perf events).

Run both cases under 'perf -g' studied results. Here is
couple different cases highlighting how soft mmu works
differently between two cases mips and mips64. The difference
largely explained because of different CPU memory layout
differences.

helper_ret_ldub_mmu

mips:

     3.37%     0.02%  qemu-system-mip  qemu-system-mips                       [.] helper_ret_ldub_mmu
            |          
             --3.35%--helper_ret_ldub_mmu
                       |          
                        --3.34%--full_ldub_mmu
                                  |          
                                   --3.11%--tlb_fill
                                             |          
                                              --3.04%--mips_cpu_tlb_fill
                                                        |          
                                                        |--1.85%--get_physical_address
                                                        |          |          
                                                        |           --1.79%--get_segctl_physical_address
                                                        |                     |          
                                                        |                      --1.74%--get_seg_physical_address
                                                        |                                |          
                                                        |                                 --1.72%--r4k_map_address
                                                        |          
                                                         --0.61%--tlb_set_page
                                                                   |          
                                                                    --0.58%--tlb_set_page_with_attrs

mips64:

    35.00%     0.00%  qemu-system-mip  [unknown]                              [.] 0x0000000000000001

               |--1.88%--0x7f0faedc6916
               |          |          
               |           --1.87%--helper_ret_ldub_mmu
               |                     |          
               |                      --1.87%--full_ldub_mmu
               |                                |          
               |                                 --1.77%--tlb_fill
               |                                           |          
               |                                            --1.69%--mips_cpu_tlb_fill
               |                                                      |          
               |                                                       --1.23%--get_physical_address
               |                                                                 |          
               |                                                                  --1.19%--r4k_map_address


why in mips64 get_physical_address jumps to r4k_map_address directly

(gdb) bt
#0  r4k_map_address (env=0x2886f90, physical=0x7f0ff24a42e0, prot=0x7f0ff24a42dc, address=733015402918, rw=1, access_type=32)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:73
#1  0x0000000000525504 in get_physical_address (env=env@entry=0x2886f90, physical=physical@entry=0x7f0ff24a42e0, prot=prot@entry=0x7f0ff24a42dc, 
    real_address=real_address@entry=733015402918, rw=rw@entry=1, access_type=access_type@entry=32, mmu_idx=0)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:271
#2  0x00000000005271ad in mips_cpu_tlb_fill (cs=0x287e570, address=733015402918, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, 
    probe=<optimized out>, retaddr=139705288065153)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:909
#3  0x00000000004629b4 in tlb_fill (cpu=cpu@entry=0x287e570, addr=addr@entry=733015402918, size=size@entry=2, 
    access_type=access_type@entry=MMU_DATA_STORE, mmu_idx=mmu_idx@entry=0, retaddr=retaddr@entry=139705288065153)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:1032
#4  0x0000000000468008 in store_helper (op=MO_BEUW, retaddr=139705288065153, oi=144, val=0, addr=733015402918, env=0x2886f90)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:2035
#5  helper_be_stw_mmu (env=0x2886f90, addr=733015402918, val=0, oi=144, retaddr=139705288065153)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cputlb.c:2180
#6  0x00007f0fac118081 in code_gen_buffer ()
#7  0x00000000004742c6 in cpu_tb_exec (itb=<optimized out>, cpu=0x287e570)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:172
#8  cpu_loop_exec_tb (tb_exit=<synthetic pointer>, last_tb=<synthetic pointer>, tb=<optimized out>, cpu=0x287e570)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:636
#9  cpu_exec (cpu=cpu@entry=0x287e570)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/accel/tcg/cpu-exec.c:749
#10 0x00000000004c349f in tcg_cpu_exec (cpu=cpu@entry=0x287e570)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/softmmu/cpus.c:1356
#11 0x00000000004c4bd4 in qemu_tcg_rr_cpu_thread_fn (arg=arg@entry=0x287e570)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/softmmu/cpus.c:1458
#12 0x000000000081df64 in qemu_thread_start (args=0x2898a00)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/util/qemu-thread-posix.c:521
#13 0x00007f0ff456de5e in ?? () from /wd3/yocto/20201002/build-qemumips64/tmp/sysroots-uninative/x86_64-linux/lib/libpthread.so.0
#14 0x00007f0ff449b64f in clone () from /wd3/yocto/20201002/build-qemumips64/tmp/sysroots-uninative/x86_64-linux/lib/libc.so.6
(gdb) up
#1  0x0000000000525504 in get_physical_address (env=env@entry=0x2886f90, physical=physical@entry=0x7f0ff24a42e0, prot=prot@entry=0x7f0ff24a42dc, 
    real_address=real_address@entry=733015402918, rw=rw@entry=1, access_type=access_type@entry=32, mmu_idx=0)
    at /wd3/yocto/20201002/build-qemumips64/tmp/work/x86_64-linux/qemu-system-native/5.1.0-r0/qemu-5.1.0/target/mips/helper.c:271
271	            ret = env->tlb->map_address(env, physical, prot,
(gdb) list
266	                                          mmu_idx, segctl, 0x3FFFFFFF);
267	#if defined(TARGET_MIPS64)
268	    } else if (address < 0x4000000000000000ULL) {
269	        /* xuseg */
270	        if (UX && address <= (0x3FFFFFFFFFFFFFFFULL & env->SEGMask)) {
271	            ret = env->tlb->map_address(env, physical, prot,                 // <------------------
272	                                        real_address, rw, access_type);
273	        } else {
274	            ret = TLBRET_BADADDR;
275	        }

another example helper_be_ldul_mmu fucnction that calls tlb_fill

mips:

     7.49%     0.18%  qemu-system-mip  qemu-system-mips                       [.] helper_be_ldul_mmu
            |          
             --7.30%--helper_be_ldul_mmu
                       |          
                        --7.28%--full_be_ldul_mmu
                                  |          
                                   --6.46%--tlb_fill
                                             |          
                                              --6.22%--mips_cpu_tlb_fill
                                                        |          
                                                        |--2.98%--get_physical_address
                                                        |          |          
                                                        |           --2.77%--get_segctl_physical_address
                                                        |                     |          
                                                        |                      --2.63%--get_seg_physical_address
                                                        |                                |          
                                                        |                                 --2.57%--r4k_map_address
                                                        |          
                                                        |--1.94%--tlb_set_page
                                                        |          |          
                                                        |           --1.84%--tlb_set_page_with_attrs
                                                        |          
                                                         --0.87%--do_raise_exception_err
                                                                   |          
                                                                    --0.84%--cpu_loop_exit_restore
                                                                              |          
                                                                               --0.75%--cpu_restore_state

mips64:

     1.25%     0.04%  qemu-system-mip  qemu-system-mips64                     [.] helper_be_ldul_mmu
            |          
             --1.22%--helper_be_ldul_mmu
                       |          
                        --1.20%--full_be_ldul_mmu
                                  |          
                                   --0.94%--tlb_fill
                                             |          
                                              --0.91%--mips_cpu_tlb_fill
                                                        |          
                                                         --0.50%--get_physical_address


Deeper dive into soft mmu behavior differences
----------------------------------------------

memory stats investigation of soft mmu behavior
with the following quick and dirty instrumentation
patch as follows. Cannot use SystemTap because is
super high rate of events.

Use case boot of core-image-full-cmdline-sdk image
in "serial nographic" mode.

Results of counters captured in gdb after attaching
to qemu process after image boot.

[kamensky@coreos-lnx2 qemu-5.1.0]$ cat patches/mips_debugging_stats.patch
Index: qemu-5.1.0/target/mips/helper.c
===================================================================
--- qemu-5.1.0.orig/target/mips/helper.c
+++ qemu-5.1.0/target/mips/helper.c
@@ -175,6 +175,12 @@ static int is_seg_am_mapped(unsigned int
     };
 }
 
+struct {
+    unsigned long long mapped_negative;
+    unsigned long long mapped_positive;
+    unsigned long long mapped_zero;
+} get_seg_physical_address_stats;
+
 static int get_seg_physical_address(CPUMIPSState *env, hwaddr *physical,
                                     int *prot, target_ulong real_address,
                                     int rw, int access_type, int mmu_idx,
@@ -185,13 +191,16 @@ static int get_seg_physical_address(CPUM
     int mapped = is_seg_am_mapped(am, eu, mmu_idx);
 
     if (mapped < 0) {
+        get_seg_physical_address_stats.mapped_negative++;
         /* is_seg_am_mapped can report TLBRET_BADADDR */
         return mapped;
     } else if (mapped) {
+        get_seg_physical_address_stats.mapped_positive++;
         /* The segment is TLB mapped */
         return env->tlb->map_address(env, physical, prot, real_address, rw,
                                      access_type);
     } else {
+        get_seg_physical_address_stats.mapped_zero++;
         /* The segment is unmapped */
         *physical = physical_base | (real_address & segmask);
         *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
@@ -213,6 +222,17 @@ static int get_segctl_physical_address(C
                                     pa & ~(hwaddr)segmask);
 }
 
+struct {
+    unsigned long long int useg_limit;
+    unsigned long long int xuseg;
+    unsigned long long int xsseg;
+    unsigned long long int xkphys;
+    unsigned long long int xkseg;
+    unsigned long long int kseg0;
+    unsigned long long int kseg1;
+    unsigned long long int kseg2;
+} get_physical_address_stats;
+
 static int get_physical_address(CPUMIPSState *env, hwaddr *physical,
                                 int *prot, target_ulong real_address,
                                 int rw, int access_type, int mmu_idx)
@@ -264,6 +284,7 @@ static int get_physical_address(CPUMIPSS
         ret = get_segctl_physical_address(env, physical, prot,
                                           real_address, rw, access_type,
                                           mmu_idx, segctl, 0x3FFFFFFF);
+        get_physical_address_stats.useg_limit++;
 #if defined(TARGET_MIPS64)
     } else if (address < 0x4000000000000000ULL) {
         /* xuseg */
@@ -273,6 +294,7 @@ static int get_physical_address(CPUMIPSS
         } else {
             ret = TLBRET_BADADDR;
         }
+        get_physical_address_stats.xuseg++;
     } else if (address < 0x8000000000000000ULL) {
         /* xsseg */
         if ((supervisor_mode || kernel_mode) &&
@@ -282,6 +304,7 @@ static int get_physical_address(CPUMIPSS
         } else {
             ret = TLBRET_BADADDR;
         }
+        get_physical_address_stats.xsseg++;
     } else if (address < 0xC000000000000000ULL) {
         /* xkphys */
         if ((address & 0x07FFFFFFFFFFFFFFULL) <= env->PAMask) {
@@ -314,6 +337,7 @@ static int get_physical_address(CPUMIPSS
         } else {
             ret = TLBRET_BADADDR;
         }
+        get_physical_address_stats.xkphys++;
     } else if (address < 0xFFFFFFFF80000000ULL) {
         /* xkseg */
         if (kernel_mode && KX &&
@@ -323,22 +347,26 @@ static int get_physical_address(CPUMIPSS
         } else {
             ret = TLBRET_BADADDR;
         }
+        get_physical_address_stats.xkseg++;
 #endif
     } else if (address < KSEG1_BASE) {
         /* kseg0 */
         ret = get_segctl_physical_address(env, physical, prot, real_address, rw,
                                           access_type, mmu_idx,
                                           env->CP0_SegCtl1 >> 16, 0x1FFFFFFF);
+        get_physical_address_stats.kseg0++;
     } else if (address < KSEG2_BASE) {
         /* kseg1 */
         ret = get_segctl_physical_address(env, physical, prot, real_address, rw,
                                           access_type, mmu_idx,
                                           env->CP0_SegCtl1, 0x1FFFFFFF);
+        get_physical_address_stats.kseg1++;
     } else if (address < KSEG3_BASE) {
         /* sseg (kseg2) */
         ret = get_segctl_physical_address(env, physical, prot, real_address, rw,
                                           access_type, mmu_idx,
                                           env->CP0_SegCtl0 >> 16, 0x1FFFFFFF);
+        get_physical_address_stats.kseg2++;
     } else {
         /*
          * kseg3
Index: qemu-5.1.0/target/mips/op_helper.c
===================================================================
--- qemu-5.1.0.orig/target/mips/op_helper.c
+++ qemu-5.1.0/target/mips/op_helper.c
@@ -734,10 +734,19 @@ void r4k_helper_tlbwi(CPUMIPSState *env)
     r4k_fill_tlb(env, idx);
 }
 
+unsigned long long tlb_wr_index[128];
+unsigned long long tlb_wr_outside;
+
 void r4k_helper_tlbwr(CPUMIPSState *env)
 {
     int r = cpu_mips_get_random(env);
 
+    if (r < 128) {
+        tlb_wr_index[r]++;
+    } else {
+        tlb_wr_outside++;
+    }
+    
     r4k_invalidate_tlb(env, r, 1);
     r4k_fill_tlb(env, r);
 }

Analyzing intrumentation results
--------------------------------

mips64:

(gdb) p get_seg_physical_address_stats
$1 = {mapped_negative = 0, mapped_positive = 9880, mapped_zero = 9956435}
(gdb) p get_physical_address_stats
$2 = {useg_limit = 2, xuseg = 5703053, xsseg = 0, xkphys = 5760602, xkseg = 423503, kseg0 = 4195824, kseg1 = 9, kseg2 = 16}

get_seg_physical_address_stats

mapped_positive =             9880
mapped_zero =              9956435
----------------------------------
total =                    9966315

get_physical_address_stats

useg_limit =                     2
xuseg =                    5703053
xsseg =                          0
xkphys =                   5760602
xkseg =                     423503
kseg0 =                    4195824
kseg1 =                          9
kseg2 =                         16
----------------------------------
total =                   16083009


mips:

(gdb) p get_seg_physical_address_stats
$1 = {mapped_negative = 0, mapped_positive = 18501772, mapped_zero = 11727856}
(gdb) p get_physical_address_stats
$2 = {useg_limit = 18008359, xuseg = 0, xsseg = 0, xkphys = 0, xkseg = 0, kseg0 = 11524734, kseg1 = 203122, kseg2 = 355583}

get_seg_physical_address

mapped_negative =                0
mapped_positive =         18501772
mapped_zero =             11727856
----------------------------------
total =                   30229628

get_physical_address

useg_limit =              18008359
xuseg =                          0
xsseg =                          0
xkphys =                         0
xkseg =                          0
kseg0 =                   11524734
kseg1 =                     203122
kseg2 =                     355583
----------------------------------
total =                   30091798

mips (after TLB number bump to 64)

It was added later to this section after idea for the fix
materialized, so it would be easy to compare with base
line.

(gdb) p get_seg_physical_address_stats
$1 = {mapped_negative = 0, mapped_positive = 7873129, mapped_zero = 14561039}
(gdb) p get_physical_address_stats
$2 = {useg_limit = 7312564, xuseg = 0, xsseg = 0, xkphys = 0, xkseg = 0, kseg0 = 14351746, kseg1 = 209293, kseg2 = 353834}

get_seg_physical_address

mapped_negative =                0
mapped_positive =          7873129
mapped_zero =             14561039
----------------------------------
total =                   22434168 


get_physical_address

useg_limit =               7312564
xuseg =                          0
xsseg =                          0
xkphys =                         0
xkseg =                          0
kseg0 =                   14351746
kseg1 =                     209293
kseg2 =                     353834
----------------------------------
total =                   22227437

Instrumentation of r4k_helper_tlbwr function
--------------------------------------------

misp (16 TLB original case)

(gdb) p tlb_wr_index
$1 = {514054, 514256, 514005, 514149, 514100, 514067, 513906, 513965, 514025, 514076, 514243, 513932, 514119, 514000, 514059, 514191, 0 <repeats 112 times>}

total = 514054 + 514256 + 514005 + 514149 + 514100 + 514067 + 513906 + 513965 + 514025 + 514076 + 514243 + 513932 + 514119 + 514000 + 514059 + 514191 = 8225147

At this point I just came to realization that in mips case
we have just 16 TLBs. And idea to bump it up came up.

Running experiment in mips with CPU identical to original
one, but slightly changed to have 64 soft mmu TLBs.

mips (64 TLB)

(gdb) p tlb_wr_index
$3 = {40034, 40318, 39982, 39981, 40028, 40010, 40109, 40315, 40237, 40178, 40293, 39995, 40210, 40073, 40088, 40100, 40172, 40011, 40182, 40190, 40096, 
  40244, 40151, 40171, 39916, 40245, 40302, 40136, 40026, 40255, 40006, 40395, 40079, 40029, 40204, 40171, 40171, 40089, 40215, 39991, 39961, 39912, 40122, 
  40255, 40025, 40274, 40168, 40051, 40165, 40220, 40015, 40125, 40267, 40037, 40048, 39932, 40295, 39960, 39887, 40035, 40118, 39936, 40200, 40069, 
  0 <repeats 64 times>}

total = 40034 + 40318 + 39982 + 39981 + 40028 + 40010 + 40109 + 40315 + 40237 + 40178 + 40293 + 39995 + 40210 + 40073 + 40088 + 40100 + 40172 + 40011 + 40182 + 40190 + 40096 + 40244 + 40151 + 40171 + 39916 + 40245 + 40302 + 40136 + 40026 + 40255 + 40006 + 40395 + 40079 + 40029 + 40204 + 40171 + 40171 + 40089 + 40215 + 39991 + 39961 + 39912 + 40122 + 40255 + 40025 + 40274 + 40168 + 40051 + 40165 + 40220 + 40015 + 40125 + 40267 + 40037 + 40048 + 39932 + 40295 + 39960 + 39887 + 40035 + 40118 + 39936 + 40200 + 40069 = 2567475

It looks like number of TLB missed goes down siginificantly.
Means qemu needs to execute less instruction in mips software
TLB refill function.

Now back to testing new CPU type with 64 TLBs under do_testimage
----------------------------------------------------------------

mips with 34Kf cpu (original)
-----------------------------

[kamensky@coreos-lnx2 build-qemumips]$ time bitbake core-image-full-cmdline:do_testimage >& t0.txt; time bitbake core-image-full-cmdline:do_testimage >& t1.txt; time bitbake core-image-full-cmdline:do_testimage >& t2.txt; time bitbake core-image-full-cmdline:do_testimage >& t3.txt

real	7m33.815s
user	0m1.009s
sys	0m0.089s

real	6m53.100s
user	0m1.019s
sys	0m0.086s

real	8m33.223s
user	0m1.052s
sys	0m0.080s

real	7m16.333s
user	0m1.030s
sys	0m0.085s

discarding first "warm up" case

real avg = (413 + 513 + 436) / 3 = 454

mips with 34Kf-64tlb cpu
------------------------

[kamensky@coreos-lnx2 build-qemumips]$ time bitbake core-image-full-cmdline:do_testimage >& t1.txt; time bitbake core-image-full-cmdline:do_testimage >& t2.txt; time bitbake core-image-full-cmdline:do_testimage >& t3.txt

real	4m38.909s
user	0m0.983s
sys	0m0.095s

real	4m34.124s
user	0m0.962s
sys	0m0.084s

real	4m13.451s
user	0m0.952s
sys	0m0.094s

real avg = (278 + 274 + 253) / 3 = 268

Good improvement
----------------

Overall looks like 40% or so improvement.

Victor Kamensky (2):
  qemu: add 34Kf-64tlb fictitious cpu type
  qemumips: use 34Kf-64tlb CPU emulation

 meta/conf/machine/qemumips.conf                    |   2 +-
 meta/recipes-devtools/qemu/qemu.inc                |   1 +
 ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
 3 files changed, 120 insertions(+), 1 deletion(-)
 create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch

-- 
2.14.5


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 20:38 [PATCH 0/2] qemumips: speeding up Victor Kamensky
@ 2020-10-07 20:38 ` Victor Kamensky
  2020-10-07 20:46   ` [OE-core] " Paul Barker
                     ` (2 more replies)
  2020-10-07 20:38 ` [PATCH 2/2] qemumips: use 34Kf-64tlb CPU emulation Victor Kamensky
  1 sibling, 3 replies; 16+ messages in thread
From: Victor Kamensky @ 2020-10-07 20:38 UTC (permalink / raw)
  To: openembedded-core; +Cc: Richard Purdie, Khem Raj

In Yocto Project PR 13992 it was reported that qemumips
in autobuilder runs almost twice slower then qemumips64 and
some times hit time out.

Upon investigations of qemu-system with perf, gdb, and
SystemTap and comparing qemumips and qemumips64 machines
behavior it was noticed that qemu soft mmu code behaves
quite different and in case if qemumips tlbwr instruction
called 16 times more oftern. It happens that in qemumips64
case qemu runs with cpu type that contains 64 TLB, but in case
of qemumips qemu runs with cpu type that contains only
16 TLBs.

The idea of proposed qemu patch is to introduce fictitious
34Kf-64tlb cpu type that defined exactly as 34Kf but has
64 TLBs, instead of original 16 TLBs.

Testing of core-image-full-cmdline:do_testimage with
34Kf-64tlb shows 40% or so test execution real time
improvement.

Note for future porters of the patch: easiest way to update
the patch and be in sync with 34Kf definition is to copy
34Kf machine definition and apply the following changes to
it (just change 15 to 63 of CP0C1_MMU bits value)

[kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
2c2
<         .name = "34Kf",
>         .name = "34Kf-64tlb",
6c6
<         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
>         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |

Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992

Upstream Status: Inappropriate

Signed-off-by: Victor Kamensky <kamensky@cisco.com>
---
 meta/recipes-devtools/qemu/qemu.inc                |   1 +
 ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
 2 files changed, 119 insertions(+)
 create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch

diff --git a/meta/recipes-devtools/qemu/qemu.inc b/meta/recipes-devtools/qemu/qemu.inc
index bbb9038961..6c0edcb706 100644
--- a/meta/recipes-devtools/qemu/qemu.inc
+++ b/meta/recipes-devtools/qemu/qemu.inc
@@ -31,6 +31,7 @@ SRC_URI = "https://download.qemu.org/${BPN}-${PV}.tar.xz \
            file://0001-qemu-Do-not-include-file-if-not-exists.patch \
            file://find_datadir.patch \
            file://usb-fix-setup_len-init.patch \
+           file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
            "
 UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
 
diff --git a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
new file mode 100644
index 0000000000..b6312e1543
--- /dev/null
+++ b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
@@ -0,0 +1,118 @@
+From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
+From: Victor Kamensky <kamensky@cisco.com>
+Date: Wed, 7 Oct 2020 10:19:42 -0700
+Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but with
+ 64 TLBs
+
+In Yocto Project CI runs it was observed that test run
+of 32 bit mips image takes almost twice longer than 64 bit
+mips image with the same logical load and CI execution
+hits timeout.
+
+See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
+
+Yocto project uses 34Kf cpu type to run 32 bit mips image,
+and MIPS64R2-generic cpu type to run 64 bit mips64 image.
+
+Upon qemu behavior differences investigation between mips
+and mips64 two prominent observations came up: under
+logically similar load (same definition and configuration
+of user-land image) in case of mips get_physical_address
+function is called almost twice more often, meaning
+twice more memory accesses involved in this case. Also
+number of tlbwr instruction executed (r4k_helper_tlbwr
+qemu function) almost 16 time bigger in mips case than in
+mips64.
+
+It turns out that 34Kf cpu has 16 TLBs, but in case of
+MIPS64R2-generic it is 64 TLBs. So that explains why
+some many more tlbwr had to be execute by kernel TLB refill
+handler in case of 32 bit misp.
+
+The idea of the fix is to come up with new 34Kf-64tlb fictitious
+cpu type, that would behave exactly as 34Kf but it would
+contain 64 TLBs to reduce TLB trashing. After all, adding
+more TLBs to soft mmu is easy.
+
+Experiment with some significant non-trvial load in Yocto
+environment by running do_testimage load shows that 34Kf-64tlb
+cpu performs 40% or so better than original 34Kf cpu wrt test
+execution real time.
+
+It is not ideal to have cpu type that does not exist in the
+wild but given performance gains it seems to be justified.
+
+Signed-off-by: Victor Kamensky <kamensky@cisco.com>
+---
+ target/mips/translate_init.inc.c | 55 ++++++++++++++++++++++++++++++++++++++++
+ 1 file changed, 55 insertions(+)
+
+diff --git a/target/mips/translate_init.inc.c b/target/mips/translate_init.inc.c
+index 637caccd89..b73ab48231 100644
+--- a/target/mips/translate_init.inc.c
++++ b/target/mips/translate_init.inc.c
+@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
+         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
+         .mmu_type = MMU_TYPE_R4000,
+     },
++    /*
++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB entries
++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to improve
++     * performance by reducing number of TLB refill exceptions and
++     * eliminating need to run all corresponding TLB refill handling
++     * instructions.
++     */
++    {
++        .name = "34Kf-64tlb",
++        .CP0_PRid = 0x00019500,
++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
++                       (MMU_TYPE_R4000 << CP0C0_MT),
++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 << CP0C1_IA) |
++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 << CP0C1_DA) |
++                       (1 << CP0C1_CA),
++        .CP0_Config2 = MIPS_CONFIG2,
++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 << CP0C3_MT) |
++                       (1 << CP0C3_DSPP),
++        .CP0_LLAddr_rw_bitmask = 0,
++        .CP0_LLAddr_shift = 0,
++        .SYNCI_Step = 32,
++        .CCRes = 2,
++        .CP0_Status_rw_bitmask = 0x3778FF1F,
++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 << CP0TCSt_TCU2) |
++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
++                    (0xff << CP0TCSt_TASID),
++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 << FCR0_PRID),
++        .CP1_fcr31 = 0,
++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe << CP0SRSC0_SRS1),
++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe << CP0SRSC1_SRS4),
++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe << CP0SRSC2_SRS7),
++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12) |
++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe << CP0SRSC3_SRS10),
++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe << CP0SRSC4_SRS13),
++        .SEGBITS = 32,
++        .PABITS = 32,
++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
++        .mmu_type = MMU_TYPE_R4000,
++    },
+     {
+         .name = "74Kf",
+         .CP0_PRid = 0x00019700,
+-- 
+2.14.5
+
-- 
2.14.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] qemumips: use 34Kf-64tlb CPU emulation
  2020-10-07 20:38 [PATCH 0/2] qemumips: speeding up Victor Kamensky
  2020-10-07 20:38 ` [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type Victor Kamensky
@ 2020-10-07 20:38 ` Victor Kamensky
  1 sibling, 0 replies; 16+ messages in thread
From: Victor Kamensky @ 2020-10-07 20:38 UTC (permalink / raw)
  To: openembedded-core; +Cc: Richard Purdie, Khem Raj

In order to improve performance of qemumips autobuilder
test runs, let's use 34Kf-64tlb cpu type that was introduced
in OE version of qemu. 34Kf-64tlb cpu type is identical to
34Kf but it has 64 TLBs configure vs original 16 TLBs.

The change in number of TLBs in emulated CPU reduces
TLB trash and reduces number of times TLB refill kernel
code runs, and therefore siginificantly improves tests
execution time.

Note 34Kf-64tlb qemu cpu type does not exist in upstream,
so far it is added as patch in OE. See qemu 34Kf-64tlb
cpu addition commit for more details.

Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992

Signed-off-by: Victor Kamensky <kamensky@cisco.com>
---
 meta/conf/machine/qemumips.conf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meta/conf/machine/qemumips.conf b/meta/conf/machine/qemumips.conf
index 1373e4cba0..b8c80f02ef 100644
--- a/meta/conf/machine/qemumips.conf
+++ b/meta/conf/machine/qemumips.conf
@@ -15,4 +15,4 @@ SERIAL_CONSOLES ?= "115200;ttyS0 115200;ttyS1"
 
 QB_SYSTEM_NAME = "qemu-system-mips"
 
-QB_CPU = "-cpu 34Kf"
+QB_CPU = "-cpu 34Kf-64tlb"
-- 
2.14.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 20:38 ` [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type Victor Kamensky
@ 2020-10-07 20:46   ` Paul Barker
  2020-10-07 21:52     ` Victor Kamensky
                       ` (2 more replies)
  2020-10-07 22:05   ` Khem Raj
  2020-10-08  7:29   ` [OE-core] " Ross Burton
  2 siblings, 3 replies; 16+ messages in thread
From: Paul Barker @ 2020-10-07 20:46 UTC (permalink / raw)
  To: kamensky; +Cc: openembedded-core, Richard Purdie, Khem Raj

On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
wrote:
>
> In Yocto Project PR 13992 it was reported that qemumips
> in autobuilder runs almost twice slower then qemumips64 and
> some times hit time out.
>
> Upon investigations of qemu-system with perf, gdb, and
> SystemTap and comparing qemumips and qemumips64 machines
> behavior it was noticed that qemu soft mmu code behaves
> quite different and in case if qemumips tlbwr instruction
> called 16 times more oftern. It happens that in qemumips64
> case qemu runs with cpu type that contains 64 TLB, but in case
> of qemumips qemu runs with cpu type that contains only
> 16 TLBs.
>
> The idea of proposed qemu patch is to introduce fictitious
> 34Kf-64tlb cpu type that defined exactly as 34Kf but has
> 64 TLBs, instead of original 16 TLBs.
>
> Testing of core-image-full-cmdline:do_testimage with
> 34Kf-64tlb shows 40% or so test execution real time
> improvement.
>
> Note for future porters of the patch: easiest way to update
> the patch and be in sync with 34Kf definition is to copy
> 34Kf machine definition and apply the following changes to
> it (just change 15 to 63 of CP0C1_MMU bits value)
>
> [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> 2c2
> <         .name = "34Kf",
> >         .name = "34Kf-64tlb",
> 6c6
> <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
> >         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>
> Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992

Forgive my ignorance as to the range of MIPS processors available but
does any real MIPS CPU have 64 TLBs? If such a CPU model exists
shouldn't we be using this instead of inventing a new model?

I'm a bit worried that targeting a unique, fictitious CPU model will
lead to us wasting time debugging obscure failures that other projects
have never seen and that would never occur on real hardware.

-- 
Paul Barker
Konsulko Group

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 20:46   ` [OE-core] " Paul Barker
@ 2020-10-07 21:52     ` Victor Kamensky
  2020-10-07 22:11       ` Khem Raj
  2020-10-07 22:04     ` Richard Purdie
  2020-10-07 22:15     ` Khem Raj
  2 siblings, 1 reply; 16+ messages in thread
From: Victor Kamensky @ 2020-10-07 21:52 UTC (permalink / raw)
  To: Paul Barker; +Cc: openembedded-core, Richard Purdie, Khem Raj

Hi Paul,

Please forgive my horrible email agent that I have at work.
Please look for 'kamensky>' for responses inline.

________________________________________
From: Paul Barker <pbarker@konsulko.com>
Sent: Wednesday, October 7, 2020 1:46 PM
To: Victor Kamensky (kamensky)
Cc: openembedded-core; Richard Purdie; Khem Raj
Subject: Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type

On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
wrote:
>
> In Yocto Project PR 13992 it was reported that qemumips
> in autobuilder runs almost twice slower then qemumips64 and
> some times hit time out.
>
> Upon investigations of qemu-system with perf, gdb, and
> SystemTap and comparing qemumips and qemumips64 machines
> behavior it was noticed that qemu soft mmu code behaves
> quite different and in case if qemumips tlbwr instruction
> called 16 times more oftern. It happens that in qemumips64
> case qemu runs with cpu type that contains 64 TLB, but in case
> of qemumips qemu runs with cpu type that contains only
> 16 TLBs.
>
> The idea of proposed qemu patch is to introduce fictitious
> 34Kf-64tlb cpu type that defined exactly as 34Kf but has
> 64 TLBs, instead of original 16 TLBs.
>
> Testing of core-image-full-cmdline:do_testimage with
> 34Kf-64tlb shows 40% or so test execution real time
> improvement.
>
> Note for future porters of the patch: easiest way to update
> the patch and be in sync with 34Kf definition is to copy
> 34Kf machine definition and apply the following changes to
> it (just change 15 to 63 of CP0C1_MMU bits value)
>
> [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> 2c2
> <         .name = "34Kf",
> >         .name = "34Kf-64tlb",
> 6c6
> <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
> >         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>
> Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992

Forgive my ignorance as to the range of MIPS processors available but
does any real MIPS CPU have 64 TLBs?

kamensky> I am not up to date wrt 32 bit only MIPS CPUs out there.
kamensky> I've experience with MIPS64 real CPUs that could
kamensky> operate both in 64 bit and 32 bit mode. In fact in our
kamensky> case even in time when we used 32bit user-land kernel
kamensky> itself was running 64 bit kernel (as OE MIPS multi-lib test
kamensky> case). In 32bit case h/w memory map quite unfriendly
kamensky> to linux kernel, kernel can see directly only 512Mb of
kamensky> unmapped phy memory KSEG0/KSEG1, and bigger amount would
kamensky> require CONFIG_HIGHMEM that drag additional mappings
kamensky> for kernel access of the memory.

kamensky> MIPS32 spec itself does allow 64 TLBs, for sure. 

If such a CPU model exists
shouldn't we be using this instead of inventing a new model?

kamensky> Note about other available MIPS 32bit existing CPU
kamensky> model with bigger number of TLBs below.

I'm a bit worried that targeting a unique, fictitious CPU model will
lead to us wasting time debugging obscure failures that other projects
have never seen and that would never occur on real hardware.

kamensky> I do share your concern up to some extent. But I think
kamensky> the risk is minimal.

kamensky> Please note in qemumips64 machine case we already
kamensky> use fictitious CPU type as "MIPS64R2-generic". AFAIK
kamensky> there is no such real CPU out there, it is generic thing
kamensky> inside of qemu. For sure, there is
kamensky> no such real combination as "MIPS64R2-generic" cpu type
kamensky> and machine "malta". So up to some extent a lot things
kamensky> under qemu are fictitious. 

kamensky> I've looked at other MIPS only CPUs that qemu supports.
kamensky> There is definition of "mips32r6-generic". Another fictitious
kamensky> CPU type but defined in qemu already. That one configured
kamensky> with 32 TLBs, but when I've tried it with qemumips it flat out
kamensky> did not boot. Which brings us to other practical issue: proper
kamensky> match between qemu cpu type and our qemumips kernel
kamensky> configuration, question about qemu CPU emulation correctness
kamensky> and whether it is bug free, etc, etc ..

kamensky> By modifying 34Kf cpu type that worked for us before, and doing
kamensky> minimal low risk changes in one limited small aspect of the
kamensky> system, that brings us performance gains, IMO is worth doing.

Thanks,
Victor

--
Paul Barker
Konsulko Group

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 20:46   ` [OE-core] " Paul Barker
  2020-10-07 21:52     ` Victor Kamensky
@ 2020-10-07 22:04     ` Richard Purdie
  2020-10-07 22:15     ` Khem Raj
  2 siblings, 0 replies; 16+ messages in thread
From: Richard Purdie @ 2020-10-07 22:04 UTC (permalink / raw)
  To: Paul Barker, kamensky; +Cc: openembedded-core, Khem Raj

On Wed, 2020-10-07 at 21:46 +0100, Paul Barker wrote:
> On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
> lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
> wrote:
> > [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> > 2c2
> > <         .name = "34Kf",
> > >         .name = "34Kf-64tlb",
> > 6c6
> > <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 <<
> > CP0C1_MMU) |
> > >         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 <<
> > > CP0C1_MMU) |
> > 
> > Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> 
> Forgive my ignorance as to the range of MIPS processors available but
> does any real MIPS CPU have 64 TLBs? If such a CPU model exists
> shouldn't we be using this instead of inventing a new model?
> 
> I'm a bit worried that targeting a unique, fictitious CPU model will
> lead to us wasting time debugging obscure failures that other
> projects have never seen and that would never occur on real hardware.

In this case we're playing with a single configuration value which is
dynamically handled by the kernel and used at this range in other areas
like mips64 so whilst I don't like doing this, a 40% speedup for
qemumips is very very tempting.

I think the maintenance burden and risk is low, the win is significant.

I have to thank Victor for some great analysis and what I think is
quite an ingenious solution!

I'll put it in for some testing and see where things go...

Cheers,

Richard


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 20:38 ` [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type Victor Kamensky
  2020-10-07 20:46   ` [OE-core] " Paul Barker
@ 2020-10-07 22:05   ` Khem Raj
  2020-10-08  5:05     ` Victor Kamensky
  2020-10-08  7:29   ` [OE-core] " Ross Burton
  2 siblings, 1 reply; 16+ messages in thread
From: Khem Raj @ 2020-10-07 22:05 UTC (permalink / raw)
  To: Victor Kamensky, openembedded-core; +Cc: Richard Purdie

Hi Victor

Thanks for investigating it, these are hard problems to root cause.
I am fine with this patchset as it is. one comment/question I have is if 
you tried 32 TLBs since r6 implementation does allow 32 TLBs this would 
make it less fictition and perhaps any 32bit mips issues could be shared 
with mips32r6 implementation. I am curious if that would result
in better performance or 64 TLB emulation is better.

On 10/7/20 1:38 PM, Victor Kamensky wrote:
> In Yocto Project PR 13992 it was reported that qemumips
> in autobuilder runs almost twice slower then qemumips64 and
> some times hit time out.
> 
> Upon investigations of qemu-system with perf, gdb, and
> SystemTap and comparing qemumips and qemumips64 machines
> behavior it was noticed that qemu soft mmu code behaves
> quite different and in case if qemumips tlbwr instruction
> called 16 times more oftern. It happens that in qemumips64
> case qemu runs with cpu type that contains 64 TLB, but in case
> of qemumips qemu runs with cpu type that contains only
> 16 TLBs.
> 
> The idea of proposed qemu patch is to introduce fictitious
> 34Kf-64tlb cpu type that defined exactly as 34Kf but has
> 64 TLBs, instead of original 16 TLBs.
> 
> Testing of core-image-full-cmdline:do_testimage with
> 34Kf-64tlb shows 40% or so test execution real time
> improvement.
> 
> Note for future porters of the patch: easiest way to update
> the patch and be in sync with 34Kf definition is to copy
> 34Kf machine definition and apply the following changes to
> it (just change 15 to 63 of CP0C1_MMU bits value)
> 
> [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> 2c2
> <         .name = "34Kf",
>>          .name = "34Kf-64tlb",
> 6c6
> <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
>>          .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
> 
> Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> 
> Upstream Status: Inappropriate
> 
> Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> ---
>   meta/recipes-devtools/qemu/qemu.inc                |   1 +
>   ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
>   2 files changed, 119 insertions(+)
>   create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> 
> diff --git a/meta/recipes-devtools/qemu/qemu.inc b/meta/recipes-devtools/qemu/qemu.inc
> index bbb9038961..6c0edcb706 100644
> --- a/meta/recipes-devtools/qemu/qemu.inc
> +++ b/meta/recipes-devtools/qemu/qemu.inc
> @@ -31,6 +31,7 @@ SRC_URI = "https://download.qemu.org/${BPN}-${PV}.tar.xz \
>              file://0001-qemu-Do-not-include-file-if-not-exists.patch \
>              file://find_datadir.patch \
>              file://usb-fix-setup_len-init.patch \
> +           file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
>              "
>   UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
>   
> diff --git a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> new file mode 100644
> index 0000000000..b6312e1543
> --- /dev/null
> +++ b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> @@ -0,0 +1,118 @@
> +From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
> +From: Victor Kamensky <kamensky@cisco.com>
> +Date: Wed, 7 Oct 2020 10:19:42 -0700
> +Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but with
> + 64 TLBs
> +
> +In Yocto Project CI runs it was observed that test run
> +of 32 bit mips image takes almost twice longer than 64 bit
> +mips image with the same logical load and CI execution
> +hits timeout.
> +
> +See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> +
> +Yocto project uses 34Kf cpu type to run 32 bit mips image,
> +and MIPS64R2-generic cpu type to run 64 bit mips64 image.
> +
> +Upon qemu behavior differences investigation between mips
> +and mips64 two prominent observations came up: under
> +logically similar load (same definition and configuration
> +of user-land image) in case of mips get_physical_address
> +function is called almost twice more often, meaning
> +twice more memory accesses involved in this case. Also
> +number of tlbwr instruction executed (r4k_helper_tlbwr
> +qemu function) almost 16 time bigger in mips case than in
> +mips64.
> +
> +It turns out that 34Kf cpu has 16 TLBs, but in case of
> +MIPS64R2-generic it is 64 TLBs. So that explains why
> +some many more tlbwr had to be execute by kernel TLB refill
> +handler in case of 32 bit misp.
> +
> +The idea of the fix is to come up with new 34Kf-64tlb fictitious
> +cpu type, that would behave exactly as 34Kf but it would
> +contain 64 TLBs to reduce TLB trashing. After all, adding
> +more TLBs to soft mmu is easy.
> +
> +Experiment with some significant non-trvial load in Yocto
> +environment by running do_testimage load shows that 34Kf-64tlb
> +cpu performs 40% or so better than original 34Kf cpu wrt test
> +execution real time.
> +
> +It is not ideal to have cpu type that does not exist in the
> +wild but given performance gains it seems to be justified.
> +
> +Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> +---
> + target/mips/translate_init.inc.c | 55 ++++++++++++++++++++++++++++++++++++++++
> + 1 file changed, 55 insertions(+)
> +
> +diff --git a/target/mips/translate_init.inc.c b/target/mips/translate_init.inc.c
> +index 637caccd89..b73ab48231 100644
> +--- a/target/mips/translate_init.inc.c
> ++++ b/target/mips/translate_init.inc.c
> +@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
> +         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> +         .mmu_type = MMU_TYPE_R4000,
> +     },
> ++    /*
> ++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB entries
> ++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to improve
> ++     * performance by reducing number of TLB refill exceptions and
> ++     * eliminating need to run all corresponding TLB refill handling
> ++     * instructions.
> ++     */
> ++    {
> ++        .name = "34Kf-64tlb",
> ++        .CP0_PRid = 0x00019500,
> ++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
> ++                       (MMU_TYPE_R4000 << CP0C0_MT),
> ++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
> ++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 << CP0C1_IA) |
> ++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 << CP0C1_DA) |
> ++                       (1 << CP0C1_CA),
> ++        .CP0_Config2 = MIPS_CONFIG2,
> ++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 << CP0C3_MT) |
> ++                       (1 << CP0C3_DSPP),
> ++        .CP0_LLAddr_rw_bitmask = 0,
> ++        .CP0_LLAddr_shift = 0,
> ++        .SYNCI_Step = 32,
> ++        .CCRes = 2,
> ++        .CP0_Status_rw_bitmask = 0x3778FF1F,
> ++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 << CP0TCSt_TCU2) |
> ++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
> ++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
> ++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
> ++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
> ++                    (0xff << CP0TCSt_TASID),
> ++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
> ++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 << FCR0_PRID),
> ++        .CP1_fcr31 = 0,
> ++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
> ++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
> ++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
> ++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe << CP0SRSC0_SRS1),
> ++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
> ++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe << CP0SRSC1_SRS4),
> ++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
> ++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe << CP0SRSC2_SRS7),
> ++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12) |
> ++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe << CP0SRSC3_SRS10),
> ++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
> ++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe << CP0SRSC4_SRS13),
> ++        .SEGBITS = 32,
> ++        .PABITS = 32,
> ++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> ++        .mmu_type = MMU_TYPE_R4000,
> ++    },
> +     {
> +         .name = "74Kf",
> +         .CP0_PRid = 0x00019700,
> +--
> +2.14.5
> +
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 21:52     ` Victor Kamensky
@ 2020-10-07 22:11       ` Khem Raj
  0 siblings, 0 replies; 16+ messages in thread
From: Khem Raj @ 2020-10-07 22:11 UTC (permalink / raw)
  To: Victor Kamensky (kamensky), Paul Barker; +Cc: openembedded-core, Richard Purdie



On 10/7/20 2:52 PM, Victor Kamensky (kamensky) wrote:
> Hi Paul,
> 
> Please forgive my horrible email agent that I have at work.
> Please look for 'kamensky>' for responses inline.
> 
> ________________________________________
> From: Paul Barker <pbarker@konsulko.com>
> Sent: Wednesday, October 7, 2020 1:46 PM
> To: Victor Kamensky (kamensky)
> Cc: openembedded-core; Richard Purdie; Khem Raj
> Subject: Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
> 
> On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
> lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
> wrote:
>>
>> In Yocto Project PR 13992 it was reported that qemumips
>> in autobuilder runs almost twice slower then qemumips64 and
>> some times hit time out.
>>
>> Upon investigations of qemu-system with perf, gdb, and
>> SystemTap and comparing qemumips and qemumips64 machines
>> behavior it was noticed that qemu soft mmu code behaves
>> quite different and in case if qemumips tlbwr instruction
>> called 16 times more oftern. It happens that in qemumips64
>> case qemu runs with cpu type that contains 64 TLB, but in case
>> of qemumips qemu runs with cpu type that contains only
>> 16 TLBs.
>>
>> The idea of proposed qemu patch is to introduce fictitious
>> 34Kf-64tlb cpu type that defined exactly as 34Kf but has
>> 64 TLBs, instead of original 16 TLBs.
>>
>> Testing of core-image-full-cmdline:do_testimage with
>> 34Kf-64tlb shows 40% or so test execution real time
>> improvement.
>>
>> Note for future porters of the patch: easiest way to update
>> the patch and be in sync with 34Kf definition is to copy
>> 34Kf machine definition and apply the following changes to
>> it (just change 15 to 63 of CP0C1_MMU bits value)
>>
>> [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
>> 2c2
>> <         .name = "34Kf",
>>>          .name = "34Kf-64tlb",
>> 6c6
>> <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
>>>          .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>>
>> Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> 
> Forgive my ignorance as to the range of MIPS processors available but
> does any real MIPS CPU have 64 TLBs?
> 
> kamensky> I am not up to date wrt 32 bit only MIPS CPUs out there.
> kamensky> I've experience with MIPS64 real CPUs that could
> kamensky> operate both in 64 bit and 32 bit mode. In fact in our
> kamensky> case even in time when we used 32bit user-land kernel
> kamensky> itself was running 64 bit kernel (as OE MIPS multi-lib test
> kamensky> case). In 32bit case h/w memory map quite unfriendly
> kamensky> to linux kernel, kernel can see directly only 512Mb of
> kamensky> unmapped phy memory KSEG0/KSEG1, and bigger amount would
> kamensky> require CONFIG_HIGHMEM that drag additional mappings
> kamensky> for kernel access of the memory.
> 
> kamensky> MIPS32 spec itself does allow 64 TLBs, for sure.
> 
> If such a CPU model exists
> shouldn't we be using this instead of inventing a new model?
> 
> kamensky> Note about other available MIPS 32bit existing CPU
> kamensky> model with bigger number of TLBs below.
> 
> I'm a bit worried that targeting a unique, fictitious CPU model will
> lead to us wasting time debugging obscure failures that other projects
> have never seen and that would never occur on real hardware.
> 
> kamensky> I do share your concern up to some extent. But I think
> kamensky> the risk is minimal.
> 
> kamensky> Please note in qemumips64 machine case we already
> kamensky> use fictitious CPU type as "MIPS64R2-generic". AFAIK
> kamensky> there is no such real CPU out there, it is generic thing
> kamensky> inside of qemu. For sure, there is
> kamensky> no such real combination as "MIPS64R2-generic" cpu type
> kamensky> and machine "malta". So up to some extent a lot things
> kamensky> under qemu are fictitious.
> 
> kamensky> I've looked at other MIPS only CPUs that qemu supports.
> kamensky> There is definition of "mips32r6-generic". Another fictitious
> kamensky> CPU type but defined in qemu already. That one configured
> kamensky> with 32 TLBs, but when I've tried it with qemumips it flat out
> kamensky> did not boot. Which brings us to other practical issue: proper
> kamensky> match between qemu cpu type and our qemumips kernel
> kamensky> configuration, question about qemu CPU emulation correctness
> kamensky> and whether it is bug free, etc, etc ..

mips32r6 is ABI incompatible with prior mips32 implementations, we do 
have include/tune-mips32r6.inc with r6 tunes, so in order to use 
mips32r6 based CPU emulation we have to use proper defaultttune 
something like mipsisa32r6

having said that, we might run into other issues with the new machine 
type and moreover I am not sure how common it is. where as mips1 and 
mips32 are quite common when it comes to 32bit, therefore this patch is 
perhaps better.

> 
> kamensky> By modifying 34Kf cpu type that worked for us before, and doing
> kamensky> minimal low risk changes in one limited small aspect of the
> kamensky> system, that brings us performance gains, IMO is worth doing.
> 
> Thanks,
> Victor
> 
> --
> Paul Barker
> Konsulko Group
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 20:46   ` [OE-core] " Paul Barker
  2020-10-07 21:52     ` Victor Kamensky
  2020-10-07 22:04     ` Richard Purdie
@ 2020-10-07 22:15     ` Khem Raj
  2020-10-07 22:24       ` Paul Barker
  2 siblings, 1 reply; 16+ messages in thread
From: Khem Raj @ 2020-10-07 22:15 UTC (permalink / raw)
  To: Paul Barker, kamensky; +Cc: openembedded-core, Richard Purdie



On 10/7/20 1:46 PM, Paul Barker wrote:
> On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
> lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
> wrote:
>>
>> In Yocto Project PR 13992 it was reported that qemumips
>> in autobuilder runs almost twice slower then qemumips64 and
>> some times hit time out.
>>
>> Upon investigations of qemu-system with perf, gdb, and
>> SystemTap and comparing qemumips and qemumips64 machines
>> behavior it was noticed that qemu soft mmu code behaves
>> quite different and in case if qemumips tlbwr instruction
>> called 16 times more oftern. It happens that in qemumips64
>> case qemu runs with cpu type that contains 64 TLB, but in case
>> of qemumips qemu runs with cpu type that contains only
>> 16 TLBs.
>>
>> The idea of proposed qemu patch is to introduce fictitious
>> 34Kf-64tlb cpu type that defined exactly as 34Kf but has
>> 64 TLBs, instead of original 16 TLBs.
>>
>> Testing of core-image-full-cmdline:do_testimage with
>> 34Kf-64tlb shows 40% or so test execution real time
>> improvement.
>>
>> Note for future porters of the patch: easiest way to update
>> the patch and be in sync with 34Kf definition is to copy
>> 34Kf machine definition and apply the following changes to
>> it (just change 15 to 63 of CP0C1_MMU bits value)
>>
>> [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
>> 2c2
>> <         .name = "34Kf",
>>>          .name = "34Kf-64tlb",
>> 6c6
>> <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
>>>          .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>>
>> Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> 
> Forgive my ignorance as to the range of MIPS processors available but
> does any real MIPS CPU have 64 TLBs? If such a CPU model exists
> shouldn't we be using this instead of inventing a new model?
> 
> I'm a bit worried that targeting a unique, fictitious CPU model will
> lead to us wasting time debugging obscure failures that other projects
> have never seen and that would never occur on real hardware.
> 

there are many assumptions in various qemu machine emulations, this wont 
be the only one, so I am less worried about that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 22:15     ` Khem Raj
@ 2020-10-07 22:24       ` Paul Barker
  0 siblings, 0 replies; 16+ messages in thread
From: Paul Barker @ 2020-10-07 22:24 UTC (permalink / raw)
  To: Khem Raj; +Cc: kamensky, openembedded-core, Richard Purdie

On Wed, 7 Oct 2020 at 23:15, Khem Raj <raj.khem@gmail.com> wrote:
>
>
>
> On 10/7/20 1:46 PM, Paul Barker wrote:
> > On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
> > lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
> > wrote:
> >>
> >> In Yocto Project PR 13992 it was reported that qemumips
> >> in autobuilder runs almost twice slower then qemumips64 and
> >> some times hit time out.
> >>
> >> Upon investigations of qemu-system with perf, gdb, and
> >> SystemTap and comparing qemumips and qemumips64 machines
> >> behavior it was noticed that qemu soft mmu code behaves
> >> quite different and in case if qemumips tlbwr instruction
> >> called 16 times more oftern. It happens that in qemumips64
> >> case qemu runs with cpu type that contains 64 TLB, but in case
> >> of qemumips qemu runs with cpu type that contains only
> >> 16 TLBs.
> >>
> >> The idea of proposed qemu patch is to introduce fictitious
> >> 34Kf-64tlb cpu type that defined exactly as 34Kf but has
> >> 64 TLBs, instead of original 16 TLBs.
> >>
> >> Testing of core-image-full-cmdline:do_testimage with
> >> 34Kf-64tlb shows 40% or so test execution real time
> >> improvement.
> >>
> >> Note for future porters of the patch: easiest way to update
> >> the patch and be in sync with 34Kf definition is to copy
> >> 34Kf machine definition and apply the following changes to
> >> it (just change 15 to 63 of CP0C1_MMU bits value)
> >>
> >> [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> >> 2c2
> >> <         .name = "34Kf",
> >>>          .name = "34Kf-64tlb",
> >> 6c6
> >> <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
> >>>          .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
> >>
> >> Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> >
> > Forgive my ignorance as to the range of MIPS processors available but
> > does any real MIPS CPU have 64 TLBs? If such a CPU model exists
> > shouldn't we be using this instead of inventing a new model?
> >
> > I'm a bit worried that targeting a unique, fictitious CPU model will
> > lead to us wasting time debugging obscure failures that other projects
> > have never seen and that would never occur on real hardware.
> >
>
> there are many assumptions in various qemu machine emulations, this wont
> be the only one, so I am less worried about that.

I think between Khem, Victor and Richard's responses my concerns have
been answered.

Thanks for the contribution and analysis Victor.

-- 
Paul Barker
Konsulko Group

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 22:05   ` Khem Raj
@ 2020-10-08  5:05     ` Victor Kamensky
  2020-10-08  5:55       ` Khem Raj
  0 siblings, 1 reply; 16+ messages in thread
From: Victor Kamensky @ 2020-10-08  5:05 UTC (permalink / raw)
  To: Khem Raj, openembedded-core; +Cc: Richard Purdie

Hi Khem,

As you saw in my response to Paul and you offered explanation
why, mips32r6-generic did not work for me.

But I think impact of 32 TLBs vs 16 TLBs vs 64 TLBs is fair question.
It could be done with 34Kf as the base cpu model.

Here I've done another round of testings in my setup, after removing
all my instrumentation and removing -fno-omit-frame-pointer
options in qemu. I've tested core-image-full-cmdline:do_testimage sample
of 5 runs. In ratio column 1 unit it is execution time of qemumips64
machine. 34Kf-xxx of course runs qemumips. This tables summarizes
my results and shows difference between these options:

machine/cpu   avg (s)   stdev (s)    ratio

qemumips64        231          6      1.00

34Kf-16tlb        605         12      2.61

34Kf-32tlb        340         15      1.47

34Kf-64tlb        272         14      1.17

So, yes, 32 TLBs would give noticeable improvements as well,
but 64 TLBs still better.

Thanks,
Victor

________________________________________
From: Khem Raj <raj.khem@gmail.com>
Sent: Wednesday, October 7, 2020 3:05 PM
To: Victor Kamensky (kamensky); openembedded-core@lists.openembedded.org
Cc: Richard Purdie
Subject: Re: [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type

Hi Victor

Thanks for investigating it, these are hard problems to root cause.
I am fine with this patchset as it is. one comment/question I have is if
you tried 32 TLBs since r6 implementation does allow 32 TLBs this would
make it less fictition and perhaps any 32bit mips issues could be shared
with mips32r6 implementation. I am curious if that would result
in better performance or 64 TLB emulation is better.

On 10/7/20 1:38 PM, Victor Kamensky wrote:
> In Yocto Project PR 13992 it was reported that qemumips
> in autobuilder runs almost twice slower then qemumips64 and
> some times hit time out.
>
> Upon investigations of qemu-system with perf, gdb, and
> SystemTap and comparing qemumips and qemumips64 machines
> behavior it was noticed that qemu soft mmu code behaves
> quite different and in case if qemumips tlbwr instruction
> called 16 times more oftern. It happens that in qemumips64
> case qemu runs with cpu type that contains 64 TLB, but in case
> of qemumips qemu runs with cpu type that contains only
> 16 TLBs.
>
> The idea of proposed qemu patch is to introduce fictitious
> 34Kf-64tlb cpu type that defined exactly as 34Kf but has
> 64 TLBs, instead of original 16 TLBs.
>
> Testing of core-image-full-cmdline:do_testimage with
> 34Kf-64tlb shows 40% or so test execution real time
> improvement.
>
> Note for future porters of the patch: easiest way to update
> the patch and be in sync with 34Kf definition is to copy
> 34Kf machine definition and apply the following changes to
> it (just change 15 to 63 of CP0C1_MMU bits value)
>
> [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> 2c2
> <         .name = "34Kf",
>>          .name = "34Kf-64tlb",
> 6c6
> <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
>>          .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>
> Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
>
> Upstream Status: Inappropriate
>
> Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> ---
>   meta/recipes-devtools/qemu/qemu.inc                |   1 +
>   ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
>   2 files changed, 119 insertions(+)
>   create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>
> diff --git a/meta/recipes-devtools/qemu/qemu.inc b/meta/recipes-devtools/qemu/qemu.inc
> index bbb9038961..6c0edcb706 100644
> --- a/meta/recipes-devtools/qemu/qemu.inc
> +++ b/meta/recipes-devtools/qemu/qemu.inc
> @@ -31,6 +31,7 @@ SRC_URI = "https://download.qemu.org/${BPN}-${PV}.tar.xz \
>              file://0001-qemu-Do-not-include-file-if-not-exists.patch \
>              file://find_datadir.patch \
>              file://usb-fix-setup_len-init.patch \
> +           file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
>              "
>   UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
>
> diff --git a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> new file mode 100644
> index 0000000000..b6312e1543
> --- /dev/null
> +++ b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> @@ -0,0 +1,118 @@
> +From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
> +From: Victor Kamensky <kamensky@cisco.com>
> +Date: Wed, 7 Oct 2020 10:19:42 -0700
> +Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but with
> + 64 TLBs
> +
> +In Yocto Project CI runs it was observed that test run
> +of 32 bit mips image takes almost twice longer than 64 bit
> +mips image with the same logical load and CI execution
> +hits timeout.
> +
> +See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> +
> +Yocto project uses 34Kf cpu type to run 32 bit mips image,
> +and MIPS64R2-generic cpu type to run 64 bit mips64 image.
> +
> +Upon qemu behavior differences investigation between mips
> +and mips64 two prominent observations came up: under
> +logically similar load (same definition and configuration
> +of user-land image) in case of mips get_physical_address
> +function is called almost twice more often, meaning
> +twice more memory accesses involved in this case. Also
> +number of tlbwr instruction executed (r4k_helper_tlbwr
> +qemu function) almost 16 time bigger in mips case than in
> +mips64.
> +
> +It turns out that 34Kf cpu has 16 TLBs, but in case of
> +MIPS64R2-generic it is 64 TLBs. So that explains why
> +some many more tlbwr had to be execute by kernel TLB refill
> +handler in case of 32 bit misp.
> +
> +The idea of the fix is to come up with new 34Kf-64tlb fictitious
> +cpu type, that would behave exactly as 34Kf but it would
> +contain 64 TLBs to reduce TLB trashing. After all, adding
> +more TLBs to soft mmu is easy.
> +
> +Experiment with some significant non-trvial load in Yocto
> +environment by running do_testimage load shows that 34Kf-64tlb
> +cpu performs 40% or so better than original 34Kf cpu wrt test
> +execution real time.
> +
> +It is not ideal to have cpu type that does not exist in the
> +wild but given performance gains it seems to be justified.
> +
> +Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> +---
> + target/mips/translate_init.inc.c | 55 ++++++++++++++++++++++++++++++++++++++++
> + 1 file changed, 55 insertions(+)
> +
> +diff --git a/target/mips/translate_init.inc.c b/target/mips/translate_init.inc.c
> +index 637caccd89..b73ab48231 100644
> +--- a/target/mips/translate_init.inc.c
> ++++ b/target/mips/translate_init.inc.c
> +@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
> +         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> +         .mmu_type = MMU_TYPE_R4000,
> +     },
> ++    /*
> ++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB entries
> ++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to improve
> ++     * performance by reducing number of TLB refill exceptions and
> ++     * eliminating need to run all corresponding TLB refill handling
> ++     * instructions.
> ++     */
> ++    {
> ++        .name = "34Kf-64tlb",
> ++        .CP0_PRid = 0x00019500,
> ++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
> ++                       (MMU_TYPE_R4000 << CP0C0_MT),
> ++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
> ++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 << CP0C1_IA) |
> ++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 << CP0C1_DA) |
> ++                       (1 << CP0C1_CA),
> ++        .CP0_Config2 = MIPS_CONFIG2,
> ++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 << CP0C3_MT) |
> ++                       (1 << CP0C3_DSPP),
> ++        .CP0_LLAddr_rw_bitmask = 0,
> ++        .CP0_LLAddr_shift = 0,
> ++        .SYNCI_Step = 32,
> ++        .CCRes = 2,
> ++        .CP0_Status_rw_bitmask = 0x3778FF1F,
> ++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 << CP0TCSt_TCU2) |
> ++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
> ++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
> ++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
> ++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
> ++                    (0xff << CP0TCSt_TASID),
> ++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
> ++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 << FCR0_PRID),
> ++        .CP1_fcr31 = 0,
> ++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
> ++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
> ++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
> ++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe << CP0SRSC0_SRS1),
> ++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
> ++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe << CP0SRSC1_SRS4),
> ++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
> ++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe << CP0SRSC2_SRS7),
> ++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12) |
> ++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe << CP0SRSC3_SRS10),
> ++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
> ++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe << CP0SRSC4_SRS13),
> ++        .SEGBITS = 32,
> ++        .PABITS = 32,
> ++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> ++        .mmu_type = MMU_TYPE_R4000,
> ++    },
> +     {
> +         .name = "74Kf",
> +         .CP0_PRid = 0x00019700,
> +--
> +2.14.5
> +
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-08  5:05     ` Victor Kamensky
@ 2020-10-08  5:55       ` Khem Raj
  0 siblings, 0 replies; 16+ messages in thread
From: Khem Raj @ 2020-10-08  5:55 UTC (permalink / raw)
  To: Victor Kamensky (kamensky); +Cc: openembedded-core, Richard Purdie

On Wed, Oct 7, 2020 at 10:05 PM Victor Kamensky (kamensky)
<kamensky@cisco.com> wrote:
>
> Hi Khem,
>
> As you saw in my response to Paul and you offered explanation
> why, mips32r6-generic did not work for me.
>
> But I think impact of 32 TLBs vs 16 TLBs vs 64 TLBs is fair question.
> It could be done with 34Kf as the base cpu model.
>
> Here I've done another round of testings in my setup, after removing
> all my instrumentation and removing -fno-omit-frame-pointer
> options in qemu. I've tested core-image-full-cmdline:do_testimage sample
> of 5 runs. In ratio column 1 unit it is execution time of qemumips64
> machine. 34Kf-xxx of course runs qemumips. This tables summarizes
> my results and shows difference between these options:
>
> machine/cpu   avg (s)   stdev (s)    ratio
>
> qemumips64        231          6      1.00
>
> 34Kf-16tlb        605         12      2.61
>
> 34Kf-32tlb        340         15      1.47
>
> 34Kf-64tlb        272         14      1.17
>
> So, yes, 32 TLBs would give noticeable improvements as well,
> but 64 TLBs still better.


thanks, lets use 64 TLB model then and I think one fix we will get
into is ensuring we keep this
patch forward portable, which I think you have already described so it
should be good to go as far as I can see.


>
> Thanks,
> Victor
>
> ________________________________________
> From: Khem Raj <raj.khem@gmail.com>
> Sent: Wednesday, October 7, 2020 3:05 PM
> To: Victor Kamensky (kamensky); openembedded-core@lists.openembedded.org
> Cc: Richard Purdie
> Subject: Re: [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
>
> Hi Victor
>
> Thanks for investigating it, these are hard problems to root cause.
> I am fine with this patchset as it is. one comment/question I have is if
> you tried 32 TLBs since r6 implementation does allow 32 TLBs this would
> make it less fictition and perhaps any 32bit mips issues could be shared
> with mips32r6 implementation. I am curious if that would result
> in better performance or 64 TLB emulation is better.
>
> On 10/7/20 1:38 PM, Victor Kamensky wrote:
> > In Yocto Project PR 13992 it was reported that qemumips
> > in autobuilder runs almost twice slower then qemumips64 and
> > some times hit time out.
> >
> > Upon investigations of qemu-system with perf, gdb, and
> > SystemTap and comparing qemumips and qemumips64 machines
> > behavior it was noticed that qemu soft mmu code behaves
> > quite different and in case if qemumips tlbwr instruction
> > called 16 times more oftern. It happens that in qemumips64
> > case qemu runs with cpu type that contains 64 TLB, but in case
> > of qemumips qemu runs with cpu type that contains only
> > 16 TLBs.
> >
> > The idea of proposed qemu patch is to introduce fictitious
> > 34Kf-64tlb cpu type that defined exactly as 34Kf but has
> > 64 TLBs, instead of original 16 TLBs.
> >
> > Testing of core-image-full-cmdline:do_testimage with
> > 34Kf-64tlb shows 40% or so test execution real time
> > improvement.
> >
> > Note for future porters of the patch: easiest way to update
> > the patch and be in sync with 34Kf definition is to copy
> > 34Kf machine definition and apply the following changes to
> > it (just change 15 to 63 of CP0C1_MMU bits value)
> >
> > [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> > 2c2
> > <         .name = "34Kf",
> >>          .name = "34Kf-64tlb",
> > 6c6
> > <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
> >>          .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
> >
> > Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> >
> > Upstream Status: Inappropriate
> >
> > Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> > ---
> >   meta/recipes-devtools/qemu/qemu.inc                |   1 +
> >   ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
> >   2 files changed, 119 insertions(+)
> >   create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> >
> > diff --git a/meta/recipes-devtools/qemu/qemu.inc b/meta/recipes-devtools/qemu/qemu.inc
> > index bbb9038961..6c0edcb706 100644
> > --- a/meta/recipes-devtools/qemu/qemu.inc
> > +++ b/meta/recipes-devtools/qemu/qemu.inc
> > @@ -31,6 +31,7 @@ SRC_URI = "https://download.qemu.org/${BPN}-${PV}.tar.xz \
> >              file://0001-qemu-Do-not-include-file-if-not-exists.patch \
> >              file://find_datadir.patch \
> >              file://usb-fix-setup_len-init.patch \
> > +           file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
> >              "
> >   UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
> >
> > diff --git a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> > new file mode 100644
> > index 0000000000..b6312e1543
> > --- /dev/null
> > +++ b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> > @@ -0,0 +1,118 @@
> > +From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
> > +From: Victor Kamensky <kamensky@cisco.com>
> > +Date: Wed, 7 Oct 2020 10:19:42 -0700
> > +Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but with
> > + 64 TLBs
> > +
> > +In Yocto Project CI runs it was observed that test run
> > +of 32 bit mips image takes almost twice longer than 64 bit
> > +mips image with the same logical load and CI execution
> > +hits timeout.
> > +
> > +See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> > +
> > +Yocto project uses 34Kf cpu type to run 32 bit mips image,
> > +and MIPS64R2-generic cpu type to run 64 bit mips64 image.
> > +
> > +Upon qemu behavior differences investigation between mips
> > +and mips64 two prominent observations came up: under
> > +logically similar load (same definition and configuration
> > +of user-land image) in case of mips get_physical_address
> > +function is called almost twice more often, meaning
> > +twice more memory accesses involved in this case. Also
> > +number of tlbwr instruction executed (r4k_helper_tlbwr
> > +qemu function) almost 16 time bigger in mips case than in
> > +mips64.
> > +
> > +It turns out that 34Kf cpu has 16 TLBs, but in case of
> > +MIPS64R2-generic it is 64 TLBs. So that explains why
> > +some many more tlbwr had to be execute by kernel TLB refill
> > +handler in case of 32 bit misp.
> > +
> > +The idea of the fix is to come up with new 34Kf-64tlb fictitious
> > +cpu type, that would behave exactly as 34Kf but it would
> > +contain 64 TLBs to reduce TLB trashing. After all, adding
> > +more TLBs to soft mmu is easy.
> > +
> > +Experiment with some significant non-trvial load in Yocto
> > +environment by running do_testimage load shows that 34Kf-64tlb
> > +cpu performs 40% or so better than original 34Kf cpu wrt test
> > +execution real time.
> > +
> > +It is not ideal to have cpu type that does not exist in the
> > +wild but given performance gains it seems to be justified.
> > +
> > +Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> > +---
> > + target/mips/translate_init.inc.c | 55 ++++++++++++++++++++++++++++++++++++++++
> > + 1 file changed, 55 insertions(+)
> > +
> > +diff --git a/target/mips/translate_init.inc.c b/target/mips/translate_init.inc.c
> > +index 637caccd89..b73ab48231 100644
> > +--- a/target/mips/translate_init.inc.c
> > ++++ b/target/mips/translate_init.inc.c
> > +@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
> > +         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> > +         .mmu_type = MMU_TYPE_R4000,
> > +     },
> > ++    /*
> > ++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB entries
> > ++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to improve
> > ++     * performance by reducing number of TLB refill exceptions and
> > ++     * eliminating need to run all corresponding TLB refill handling
> > ++     * instructions.
> > ++     */
> > ++    {
> > ++        .name = "34Kf-64tlb",
> > ++        .CP0_PRid = 0x00019500,
> > ++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
> > ++                       (MMU_TYPE_R4000 << CP0C0_MT),
> > ++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
> > ++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 << CP0C1_IA) |
> > ++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 << CP0C1_DA) |
> > ++                       (1 << CP0C1_CA),
> > ++        .CP0_Config2 = MIPS_CONFIG2,
> > ++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 << CP0C3_MT) |
> > ++                       (1 << CP0C3_DSPP),
> > ++        .CP0_LLAddr_rw_bitmask = 0,
> > ++        .CP0_LLAddr_shift = 0,
> > ++        .SYNCI_Step = 32,
> > ++        .CCRes = 2,
> > ++        .CP0_Status_rw_bitmask = 0x3778FF1F,
> > ++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 << CP0TCSt_TCU2) |
> > ++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
> > ++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
> > ++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
> > ++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
> > ++                    (0xff << CP0TCSt_TASID),
> > ++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
> > ++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 << FCR0_PRID),
> > ++        .CP1_fcr31 = 0,
> > ++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
> > ++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
> > ++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
> > ++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe << CP0SRSC0_SRS1),
> > ++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
> > ++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe << CP0SRSC1_SRS4),
> > ++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
> > ++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe << CP0SRSC2_SRS7),
> > ++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12) |
> > ++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe << CP0SRSC3_SRS10),
> > ++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
> > ++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe << CP0SRSC4_SRS13),
> > ++        .SEGBITS = 32,
> > ++        .PABITS = 32,
> > ++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> > ++        .mmu_type = MMU_TYPE_R4000,
> > ++    },
> > +     {
> > +         .name = "74Kf",
> > +         .CP0_PRid = 0x00019700,
> > +--
> > +2.14.5
> > +
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-07 20:38 ` [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type Victor Kamensky
  2020-10-07 20:46   ` [OE-core] " Paul Barker
  2020-10-07 22:05   ` Khem Raj
@ 2020-10-08  7:29   ` Ross Burton
  2020-10-08 11:53     ` Alexander Kanavin
  2 siblings, 1 reply; 16+ messages in thread
From: Ross Burton @ 2020-10-08  7:29 UTC (permalink / raw)
  To: kamensky; +Cc: OE-core

Excellent work to identify a relatively simple way to dramatically
improve performance. Nice one!

Ross

On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
wrote:
>
> In Yocto Project PR 13992 it was reported that qemumips
> in autobuilder runs almost twice slower then qemumips64 and
> some times hit time out.
>
> Upon investigations of qemu-system with perf, gdb, and
> SystemTap and comparing qemumips and qemumips64 machines
> behavior it was noticed that qemu soft mmu code behaves
> quite different and in case if qemumips tlbwr instruction
> called 16 times more oftern. It happens that in qemumips64
> case qemu runs with cpu type that contains 64 TLB, but in case
> of qemumips qemu runs with cpu type that contains only
> 16 TLBs.
>
> The idea of proposed qemu patch is to introduce fictitious
> 34Kf-64tlb cpu type that defined exactly as 34Kf but has
> 64 TLBs, instead of original 16 TLBs.
>
> Testing of core-image-full-cmdline:do_testimage with
> 34Kf-64tlb shows 40% or so test execution real time
> improvement.
>
> Note for future porters of the patch: easiest way to update
> the patch and be in sync with 34Kf definition is to copy
> 34Kf machine definition and apply the following changes to
> it (just change 15 to 63 of CP0C1_MMU bits value)
>
> [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> 2c2
> <         .name = "34Kf",
> >         .name = "34Kf-64tlb",
> 6c6
> <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
> >         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>
> Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
>
> Upstream Status: Inappropriate
>
> Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> ---
>  meta/recipes-devtools/qemu/qemu.inc                |   1 +
>  ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
>  2 files changed, 119 insertions(+)
>  create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>
> diff --git a/meta/recipes-devtools/qemu/qemu.inc b/meta/recipes-devtools/qemu/qemu.inc
> index bbb9038961..6c0edcb706 100644
> --- a/meta/recipes-devtools/qemu/qemu.inc
> +++ b/meta/recipes-devtools/qemu/qemu.inc
> @@ -31,6 +31,7 @@ SRC_URI = "https://download.qemu.org/${BPN}-${PV}.tar.xz \
>             file://0001-qemu-Do-not-include-file-if-not-exists.patch \
>             file://find_datadir.patch \
>             file://usb-fix-setup_len-init.patch \
> +           file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
>             "
>  UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
>
> diff --git a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> new file mode 100644
> index 0000000000..b6312e1543
> --- /dev/null
> +++ b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> @@ -0,0 +1,118 @@
> +From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
> +From: Victor Kamensky <kamensky@cisco.com>
> +Date: Wed, 7 Oct 2020 10:19:42 -0700
> +Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but with
> + 64 TLBs
> +
> +In Yocto Project CI runs it was observed that test run
> +of 32 bit mips image takes almost twice longer than 64 bit
> +mips image with the same logical load and CI execution
> +hits timeout.
> +
> +See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> +
> +Yocto project uses 34Kf cpu type to run 32 bit mips image,
> +and MIPS64R2-generic cpu type to run 64 bit mips64 image.
> +
> +Upon qemu behavior differences investigation between mips
> +and mips64 two prominent observations came up: under
> +logically similar load (same definition and configuration
> +of user-land image) in case of mips get_physical_address
> +function is called almost twice more often, meaning
> +twice more memory accesses involved in this case. Also
> +number of tlbwr instruction executed (r4k_helper_tlbwr
> +qemu function) almost 16 time bigger in mips case than in
> +mips64.
> +
> +It turns out that 34Kf cpu has 16 TLBs, but in case of
> +MIPS64R2-generic it is 64 TLBs. So that explains why
> +some many more tlbwr had to be execute by kernel TLB refill
> +handler in case of 32 bit misp.
> +
> +The idea of the fix is to come up with new 34Kf-64tlb fictitious
> +cpu type, that would behave exactly as 34Kf but it would
> +contain 64 TLBs to reduce TLB trashing. After all, adding
> +more TLBs to soft mmu is easy.
> +
> +Experiment with some significant non-trvial load in Yocto
> +environment by running do_testimage load shows that 34Kf-64tlb
> +cpu performs 40% or so better than original 34Kf cpu wrt test
> +execution real time.
> +
> +It is not ideal to have cpu type that does not exist in the
> +wild but given performance gains it seems to be justified.
> +
> +Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> +---
> + target/mips/translate_init.inc.c | 55 ++++++++++++++++++++++++++++++++++++++++
> + 1 file changed, 55 insertions(+)
> +
> +diff --git a/target/mips/translate_init.inc.c b/target/mips/translate_init.inc.c
> +index 637caccd89..b73ab48231 100644
> +--- a/target/mips/translate_init.inc.c
> ++++ b/target/mips/translate_init.inc.c
> +@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
> +         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> +         .mmu_type = MMU_TYPE_R4000,
> +     },
> ++    /*
> ++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB entries
> ++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to improve
> ++     * performance by reducing number of TLB refill exceptions and
> ++     * eliminating need to run all corresponding TLB refill handling
> ++     * instructions.
> ++     */
> ++    {
> ++        .name = "34Kf-64tlb",
> ++        .CP0_PRid = 0x00019500,
> ++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
> ++                       (MMU_TYPE_R4000 << CP0C0_MT),
> ++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
> ++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 << CP0C1_IA) |
> ++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 << CP0C1_DA) |
> ++                       (1 << CP0C1_CA),
> ++        .CP0_Config2 = MIPS_CONFIG2,
> ++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 << CP0C3_MT) |
> ++                       (1 << CP0C3_DSPP),
> ++        .CP0_LLAddr_rw_bitmask = 0,
> ++        .CP0_LLAddr_shift = 0,
> ++        .SYNCI_Step = 32,
> ++        .CCRes = 2,
> ++        .CP0_Status_rw_bitmask = 0x3778FF1F,
> ++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 << CP0TCSt_TCU2) |
> ++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
> ++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
> ++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
> ++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
> ++                    (0xff << CP0TCSt_TASID),
> ++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
> ++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 << FCR0_PRID),
> ++        .CP1_fcr31 = 0,
> ++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
> ++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
> ++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
> ++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe << CP0SRSC0_SRS1),
> ++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
> ++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe << CP0SRSC1_SRS4),
> ++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
> ++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe << CP0SRSC2_SRS7),
> ++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12) |
> ++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe << CP0SRSC3_SRS10),
> ++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
> ++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
> ++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe << CP0SRSC4_SRS13),
> ++        .SEGBITS = 32,
> ++        .PABITS = 32,
> ++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> ++        .mmu_type = MMU_TYPE_R4000,
> ++    },
> +     {
> +         .name = "74Kf",
> +         .CP0_PRid = 0x00019700,
> +--
> +2.14.5
> +
> --
> 2.14.5
>
>
> 
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-08  7:29   ` [OE-core] " Ross Burton
@ 2020-10-08 11:53     ` Alexander Kanavin
  2020-10-08 16:05       ` Khem Raj
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Kanavin @ 2020-10-08 11:53 UTC (permalink / raw)
  To: Victor Kamensky (kamensky); +Cc: OE-core, Ross Burton

[-- Attachment #1: Type: text/plain, Size: 9626 bytes --]

Thanks - I note that Upstream-Status is missing, are you planning to
approach qemu upstream with this?

Alex

On Thu, 8 Oct 2020 at 09:30, Ross Burton <ross@burtonini.com> wrote:

> Excellent work to identify a relatively simple way to dramatically
> improve performance. Nice one!
>
> Ross
>
> On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
> lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
> wrote:
> >
> > In Yocto Project PR 13992 it was reported that qemumips
> > in autobuilder runs almost twice slower then qemumips64 and
> > some times hit time out.
> >
> > Upon investigations of qemu-system with perf, gdb, and
> > SystemTap and comparing qemumips and qemumips64 machines
> > behavior it was noticed that qemu soft mmu code behaves
> > quite different and in case if qemumips tlbwr instruction
> > called 16 times more oftern. It happens that in qemumips64
> > case qemu runs with cpu type that contains 64 TLB, but in case
> > of qemumips qemu runs with cpu type that contains only
> > 16 TLBs.
> >
> > The idea of proposed qemu patch is to introduce fictitious
> > 34Kf-64tlb cpu type that defined exactly as 34Kf but has
> > 64 TLBs, instead of original 16 TLBs.
> >
> > Testing of core-image-full-cmdline:do_testimage with
> > 34Kf-64tlb shows 40% or so test execution real time
> > improvement.
> >
> > Note for future porters of the patch: easiest way to update
> > the patch and be in sync with 34Kf definition is to copy
> > 34Kf machine definition and apply the following changes to
> > it (just change 15 to 63 of CP0C1_MMU bits value)
> >
> > [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
> > 2c2
> > <         .name = "34Kf",
> > >         .name = "34Kf-64tlb",
> > 6c6
> > <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 <<
> CP0C1_MMU) |
> > >         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 <<
> CP0C1_MMU) |
> >
> > Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> >
> > Upstream Status: Inappropriate
> >
> > Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> > ---
> >  meta/recipes-devtools/qemu/qemu.inc                |   1 +
> >  ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118
> +++++++++++++++++++++
> >  2 files changed, 119 insertions(+)
> >  create mode 100644
> meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> >
> > diff --git a/meta/recipes-devtools/qemu/qemu.inc
> b/meta/recipes-devtools/qemu/qemu.inc
> > index bbb9038961..6c0edcb706 100644
> > --- a/meta/recipes-devtools/qemu/qemu.inc
> > +++ b/meta/recipes-devtools/qemu/qemu.inc
> > @@ -31,6 +31,7 @@ SRC_URI = "
> https://download.qemu.org/${BPN}-${PV}.tar.xz \
> >             file://0001-qemu-Do-not-include-file-if-not-exists.patch \
> >             file://find_datadir.patch \
> >             file://usb-fix-setup_len-init.patch \
> > +
>  file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
> >             "
> >  UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
> >
> > diff --git
> a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> > new file mode 100644
> > index 0000000000..b6312e1543
> > --- /dev/null
> > +++
> b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
> > @@ -0,0 +1,118 @@
> > +From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
> > +From: Victor Kamensky <kamensky@cisco.com>
> > +Date: Wed, 7 Oct 2020 10:19:42 -0700
> > +Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but
> with
> > + 64 TLBs
> > +
> > +In Yocto Project CI runs it was observed that test run
> > +of 32 bit mips image takes almost twice longer than 64 bit
> > +mips image with the same logical load and CI execution
> > +hits timeout.
> > +
> > +See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
> > +
> > +Yocto project uses 34Kf cpu type to run 32 bit mips image,
> > +and MIPS64R2-generic cpu type to run 64 bit mips64 image.
> > +
> > +Upon qemu behavior differences investigation between mips
> > +and mips64 two prominent observations came up: under
> > +logically similar load (same definition and configuration
> > +of user-land image) in case of mips get_physical_address
> > +function is called almost twice more often, meaning
> > +twice more memory accesses involved in this case. Also
> > +number of tlbwr instruction executed (r4k_helper_tlbwr
> > +qemu function) almost 16 time bigger in mips case than in
> > +mips64.
> > +
> > +It turns out that 34Kf cpu has 16 TLBs, but in case of
> > +MIPS64R2-generic it is 64 TLBs. So that explains why
> > +some many more tlbwr had to be execute by kernel TLB refill
> > +handler in case of 32 bit misp.
> > +
> > +The idea of the fix is to come up with new 34Kf-64tlb fictitious
> > +cpu type, that would behave exactly as 34Kf but it would
> > +contain 64 TLBs to reduce TLB trashing. After all, adding
> > +more TLBs to soft mmu is easy.
> > +
> > +Experiment with some significant non-trvial load in Yocto
> > +environment by running do_testimage load shows that 34Kf-64tlb
> > +cpu performs 40% or so better than original 34Kf cpu wrt test
> > +execution real time.
> > +
> > +It is not ideal to have cpu type that does not exist in the
> > +wild but given performance gains it seems to be justified.
> > +
> > +Signed-off-by: Victor Kamensky <kamensky@cisco.com>
> > +---
> > + target/mips/translate_init.inc.c | 55
> ++++++++++++++++++++++++++++++++++++++++
> > + 1 file changed, 55 insertions(+)
> > +
> > +diff --git a/target/mips/translate_init.inc.c
> b/target/mips/translate_init.inc.c
> > +index 637caccd89..b73ab48231 100644
> > +--- a/target/mips/translate_init.inc.c
> > ++++ b/target/mips/translate_init.inc.c
> > +@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
> > +         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> > +         .mmu_type = MMU_TYPE_R4000,
> > +     },
> > ++    /*
> > ++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB
> entries
> > ++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to
> improve
> > ++     * performance by reducing number of TLB refill exceptions and
> > ++     * eliminating need to run all corresponding TLB refill handling
> > ++     * instructions.
> > ++     */
> > ++    {
> > ++        .name = "34Kf-64tlb",
> > ++        .CP0_PRid = 0x00019500,
> > ++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
> > ++                       (MMU_TYPE_R4000 << CP0C0_MT),
> > ++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 <<
> CP0C1_MMU) |
> > ++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 <<
> CP0C1_IA) |
> > ++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 <<
> CP0C1_DA) |
> > ++                       (1 << CP0C1_CA),
> > ++        .CP0_Config2 = MIPS_CONFIG2,
> > ++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 <<
> CP0C3_MT) |
> > ++                       (1 << CP0C3_DSPP),
> > ++        .CP0_LLAddr_rw_bitmask = 0,
> > ++        .CP0_LLAddr_shift = 0,
> > ++        .SYNCI_Step = 32,
> > ++        .CCRes = 2,
> > ++        .CP0_Status_rw_bitmask = 0x3778FF1F,
> > ++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 <<
> CP0TCSt_TCU2) |
> > ++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
> > ++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
> > ++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
> > ++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
> > ++                    (0xff << CP0TCSt_TASID),
> > ++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
> > ++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 <<
> FCR0_PRID),
> > ++        .CP1_fcr31 = 0,
> > ++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
> > ++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
> > ++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
> > ++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe <<
> CP0SRSC0_SRS1),
> > ++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
> > ++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe <<
> CP0SRSC1_SRS4),
> > ++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
> > ++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe <<
> CP0SRSC2_SRS7),
> > ++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12)
> |
> > ++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe <<
> CP0SRSC3_SRS10),
> > ++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
> > ++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
> > ++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe <<
> CP0SRSC4_SRS13),
> > ++        .SEGBITS = 32,
> > ++        .PABITS = 32,
> > ++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
> > ++        .mmu_type = MMU_TYPE_R4000,
> > ++    },
> > +     {
> > +         .name = "74Kf",
> > +         .CP0_PRid = 0x00019700,
> > +--
> > +2.14.5
> > +
> > --
> > 2.14.5
> >
> >
> >
> >
>
> 
>
>

[-- Attachment #2: Type: text/html, Size: 12300 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-08 11:53     ` Alexander Kanavin
@ 2020-10-08 16:05       ` Khem Raj
  2020-10-08 16:39         ` Victor Kamensky
  0 siblings, 1 reply; 16+ messages in thread
From: Khem Raj @ 2020-10-08 16:05 UTC (permalink / raw)
  To: Alexander Kanavin; +Cc: Victor Kamensky (kamensky), OE-core, Ross Burton

On Thu, Oct 8, 2020 at 4:53 AM Alexander Kanavin <alex.kanavin@gmail.com> wrote:
>
> Thanks - I note that Upstream-Status is missing, are you planning to approach qemu upstream with this?
>

Thinking about upstreaming, I think it might be worth proposing it upstream.

> Alex
>
> On Thu, 8 Oct 2020 at 09:30, Ross Burton <ross@burtonini.com> wrote:
>>
>> Excellent work to identify a relatively simple way to dramatically
>> improve performance. Nice one!
>>
>> Ross
>>
>> On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
>> lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
>> wrote:
>> >
>> > In Yocto Project PR 13992 it was reported that qemumips
>> > in autobuilder runs almost twice slower then qemumips64 and
>> > some times hit time out.
>> >
>> > Upon investigations of qemu-system with perf, gdb, and
>> > SystemTap and comparing qemumips and qemumips64 machines
>> > behavior it was noticed that qemu soft mmu code behaves
>> > quite different and in case if qemumips tlbwr instruction
>> > called 16 times more oftern. It happens that in qemumips64
>> > case qemu runs with cpu type that contains 64 TLB, but in case
>> > of qemumips qemu runs with cpu type that contains only
>> > 16 TLBs.
>> >
>> > The idea of proposed qemu patch is to introduce fictitious
>> > 34Kf-64tlb cpu type that defined exactly as 34Kf but has
>> > 64 TLBs, instead of original 16 TLBs.
>> >
>> > Testing of core-image-full-cmdline:do_testimage with
>> > 34Kf-64tlb shows 40% or so test execution real time
>> > improvement.
>> >
>> > Note for future porters of the patch: easiest way to update
>> > the patch and be in sync with 34Kf definition is to copy
>> > 34Kf machine definition and apply the following changes to
>> > it (just change 15 to 63 of CP0C1_MMU bits value)
>> >
>> > [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
>> > 2c2
>> > <         .name = "34Kf",
>> > >         .name = "34Kf-64tlb",
>> > 6c6
>> > <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
>> > >         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>> >
>> > Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
>> >
>> > Upstream Status: Inappropriate
>> >
>> > Signed-off-by: Victor Kamensky <kamensky@cisco.com>
>> > ---
>> >  meta/recipes-devtools/qemu/qemu.inc                |   1 +
>> >  ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
>> >  2 files changed, 119 insertions(+)
>> >  create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> >
>> > diff --git a/meta/recipes-devtools/qemu/qemu.inc b/meta/recipes-devtools/qemu/qemu.inc
>> > index bbb9038961..6c0edcb706 100644
>> > --- a/meta/recipes-devtools/qemu/qemu.inc
>> > +++ b/meta/recipes-devtools/qemu/qemu.inc
>> > @@ -31,6 +31,7 @@ SRC_URI = "https://download.qemu.org/${BPN}-${PV}.tar.xz \
>> >             file://0001-qemu-Do-not-include-file-if-not-exists.patch \
>> >             file://find_datadir.patch \
>> >             file://usb-fix-setup_len-init.patch \
>> > +           file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
>> >             "
>> >  UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
>> >
>> > diff --git a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> > new file mode 100644
>> > index 0000000000..b6312e1543
>> > --- /dev/null
>> > +++ b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> > @@ -0,0 +1,118 @@
>> > +From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
>> > +From: Victor Kamensky <kamensky@cisco.com>
>> > +Date: Wed, 7 Oct 2020 10:19:42 -0700
>> > +Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but with
>> > + 64 TLBs
>> > +
>> > +In Yocto Project CI runs it was observed that test run
>> > +of 32 bit mips image takes almost twice longer than 64 bit
>> > +mips image with the same logical load and CI execution
>> > +hits timeout.
>> > +
>> > +See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
>> > +
>> > +Yocto project uses 34Kf cpu type to run 32 bit mips image,
>> > +and MIPS64R2-generic cpu type to run 64 bit mips64 image.
>> > +
>> > +Upon qemu behavior differences investigation between mips
>> > +and mips64 two prominent observations came up: under
>> > +logically similar load (same definition and configuration
>> > +of user-land image) in case of mips get_physical_address
>> > +function is called almost twice more often, meaning
>> > +twice more memory accesses involved in this case. Also
>> > +number of tlbwr instruction executed (r4k_helper_tlbwr
>> > +qemu function) almost 16 time bigger in mips case than in
>> > +mips64.
>> > +
>> > +It turns out that 34Kf cpu has 16 TLBs, but in case of
>> > +MIPS64R2-generic it is 64 TLBs. So that explains why
>> > +some many more tlbwr had to be execute by kernel TLB refill
>> > +handler in case of 32 bit misp.
>> > +
>> > +The idea of the fix is to come up with new 34Kf-64tlb fictitious
>> > +cpu type, that would behave exactly as 34Kf but it would
>> > +contain 64 TLBs to reduce TLB trashing. After all, adding
>> > +more TLBs to soft mmu is easy.
>> > +
>> > +Experiment with some significant non-trvial load in Yocto
>> > +environment by running do_testimage load shows that 34Kf-64tlb
>> > +cpu performs 40% or so better than original 34Kf cpu wrt test
>> > +execution real time.
>> > +
>> > +It is not ideal to have cpu type that does not exist in the
>> > +wild but given performance gains it seems to be justified.
>> > +
>> > +Signed-off-by: Victor Kamensky <kamensky@cisco.com>
>> > +---
>> > + target/mips/translate_init.inc.c | 55 ++++++++++++++++++++++++++++++++++++++++
>> > + 1 file changed, 55 insertions(+)
>> > +
>> > +diff --git a/target/mips/translate_init.inc.c b/target/mips/translate_init.inc.c
>> > +index 637caccd89..b73ab48231 100644
>> > +--- a/target/mips/translate_init.inc.c
>> > ++++ b/target/mips/translate_init.inc.c
>> > +@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
>> > +         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
>> > +         .mmu_type = MMU_TYPE_R4000,
>> > +     },
>> > ++    /*
>> > ++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB entries
>> > ++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to improve
>> > ++     * performance by reducing number of TLB refill exceptions and
>> > ++     * eliminating need to run all corresponding TLB refill handling
>> > ++     * instructions.
>> > ++     */
>> > ++    {
>> > ++        .name = "34Kf-64tlb",
>> > ++        .CP0_PRid = 0x00019500,
>> > ++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
>> > ++                       (MMU_TYPE_R4000 << CP0C0_MT),
>> > ++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>> > ++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 << CP0C1_IA) |
>> > ++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 << CP0C1_DA) |
>> > ++                       (1 << CP0C1_CA),
>> > ++        .CP0_Config2 = MIPS_CONFIG2,
>> > ++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 << CP0C3_MT) |
>> > ++                       (1 << CP0C3_DSPP),
>> > ++        .CP0_LLAddr_rw_bitmask = 0,
>> > ++        .CP0_LLAddr_shift = 0,
>> > ++        .SYNCI_Step = 32,
>> > ++        .CCRes = 2,
>> > ++        .CP0_Status_rw_bitmask = 0x3778FF1F,
>> > ++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 << CP0TCSt_TCU2) |
>> > ++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
>> > ++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
>> > ++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
>> > ++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
>> > ++                    (0xff << CP0TCSt_TASID),
>> > ++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
>> > ++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 << FCR0_PRID),
>> > ++        .CP1_fcr31 = 0,
>> > ++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
>> > ++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
>> > ++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
>> > ++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe << CP0SRSC0_SRS1),
>> > ++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
>> > ++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe << CP0SRSC1_SRS4),
>> > ++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
>> > ++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe << CP0SRSC2_SRS7),
>> > ++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12) |
>> > ++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe << CP0SRSC3_SRS10),
>> > ++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
>> > ++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe << CP0SRSC4_SRS13),
>> > ++        .SEGBITS = 32,
>> > ++        .PABITS = 32,
>> > ++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
>> > ++        .mmu_type = MMU_TYPE_R4000,
>> > ++    },
>> > +     {
>> > +         .name = "74Kf",
>> > +         .CP0_PRid = 0x00019700,
>> > +--
>> > +2.14.5
>> > +
>> > --
>> > 2.14.5
>> >
>> >
>> >
>> >
>>
>>
>>
>
> 
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
  2020-10-08 16:05       ` Khem Raj
@ 2020-10-08 16:39         ` Victor Kamensky
  0 siblings, 0 replies; 16+ messages in thread
From: Victor Kamensky @ 2020-10-08 16:39 UTC (permalink / raw)
  To: Khem Raj, Alexander Kanavin; +Cc: OE-core, Ross Burton

Hi Khem, Alexander,

Please response inline, look for 'kamensky>'

________________________________________
From: openembedded-core@lists.openembedded.org <openembedded-core@lists.openembedded.org> on behalf of Khem Raj <raj.khem@gmail.com>
Sent: Thursday, October 8, 2020 9:05 AM
To: Alexander Kanavin
Cc: Victor Kamensky (kamensky); OE-core; Ross Burton
Subject: Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type

On Thu, Oct 8, 2020 at 4:53 AM Alexander Kanavin <alex.kanavin@gmail.com> wrote:
>
> Thanks - I note that Upstream-Status is missing, are you planning to approach qemu upstream with this?
>

Thinking about upstreaming, I think it might be worth proposing it upstream.

kamensky> Yes, the same was briefly discussed during today's big triage meeting:
kamensky> I will try to submit it to qemu upstream, and argue our case. it won't
kamensky> hurt to try it anyway.

kamensky> I as far as Upstream-Status concerned. Yes, it looks I've messed up,
kamensky> I've added 'Upstream Status' to OE patch. Did not realize that it should
kamensky> be added to the added patch itself.

Thanks,
Victor

> Alex
>
> On Thu, 8 Oct 2020 at 09:30, Ross Burton <ross@burtonini.com> wrote:
>>
>> Excellent work to identify a relatively simple way to dramatically
>> improve performance. Nice one!
>>
>> Ross
>>
>> On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
>> lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
>> wrote:
>> >
>> > In Yocto Project PR 13992 it was reported that qemumips
>> > in autobuilder runs almost twice slower then qemumips64 and
>> > some times hit time out.
>> >
>> > Upon investigations of qemu-system with perf, gdb, and
>> > SystemTap and comparing qemumips and qemumips64 machines
>> > behavior it was noticed that qemu soft mmu code behaves
>> > quite different and in case if qemumips tlbwr instruction
>> > called 16 times more oftern. It happens that in qemumips64
>> > case qemu runs with cpu type that contains 64 TLB, but in case
>> > of qemumips qemu runs with cpu type that contains only
>> > 16 TLBs.
>> >
>> > The idea of proposed qemu patch is to introduce fictitious
>> > 34Kf-64tlb cpu type that defined exactly as 34Kf but has
>> > 64 TLBs, instead of original 16 TLBs.
>> >
>> > Testing of core-image-full-cmdline:do_testimage with
>> > 34Kf-64tlb shows 40% or so test execution real time
>> > improvement.
>> >
>> > Note for future porters of the patch: easiest way to update
>> > the patch and be in sync with 34Kf definition is to copy
>> > 34Kf machine definition and apply the following changes to
>> > it (just change 15 to 63 of CP0C1_MMU bits value)
>> >
>> > [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
>> > 2c2
>> > <         .name = "34Kf",
>> > >         .name = "34Kf-64tlb",
>> > 6c6
>> > <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
>> > >         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>> >
>> > Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
>> >
>> > Upstream Status: Inappropriate
>> >
>> > Signed-off-by: Victor Kamensky <kamensky@cisco.com>
>> > ---
>> >  meta/recipes-devtools/qemu/qemu.inc                |   1 +
>> >  ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
>> >  2 files changed, 119 insertions(+)
>> >  create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> >
>> > diff --git a/meta/recipes-devtools/qemu/qemu.inc b/meta/recipes-devtools/qemu/qemu.inc
>> > index bbb9038961..6c0edcb706 100644
>> > --- a/meta/recipes-devtools/qemu/qemu.inc
>> > +++ b/meta/recipes-devtools/qemu/qemu.inc
>> > @@ -31,6 +31,7 @@ SRC_URI = "https://download.qemu.org/${BPN}-${PV}.tar.xz \
>> >             file://0001-qemu-Do-not-include-file-if-not-exists.patch \
>> >             file://find_datadir.patch \
>> >             file://usb-fix-setup_len-init.patch \
>> > +           file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
>> >             "
>> >  UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
>> >
>> > diff --git a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> > new file mode 100644
>> > index 0000000000..b6312e1543
>> > --- /dev/null
>> > +++ b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> > @@ -0,0 +1,118 @@
>> > +From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
>> > +From: Victor Kamensky <kamensky@cisco.com>
>> > +Date: Wed, 7 Oct 2020 10:19:42 -0700
>> > +Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but with
>> > + 64 TLBs
>> > +
>> > +In Yocto Project CI runs it was observed that test run
>> > +of 32 bit mips image takes almost twice longer than 64 bit
>> > +mips image with the same logical load and CI execution
>> > +hits timeout.
>> > +
>> > +See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
>> > +
>> > +Yocto project uses 34Kf cpu type to run 32 bit mips image,
>> > +and MIPS64R2-generic cpu type to run 64 bit mips64 image.
>> > +
>> > +Upon qemu behavior differences investigation between mips
>> > +and mips64 two prominent observations came up: under
>> > +logically similar load (same definition and configuration
>> > +of user-land image) in case of mips get_physical_address
>> > +function is called almost twice more often, meaning
>> > +twice more memory accesses involved in this case. Also
>> > +number of tlbwr instruction executed (r4k_helper_tlbwr
>> > +qemu function) almost 16 time bigger in mips case than in
>> > +mips64.
>> > +
>> > +It turns out that 34Kf cpu has 16 TLBs, but in case of
>> > +MIPS64R2-generic it is 64 TLBs. So that explains why
>> > +some many more tlbwr had to be execute by kernel TLB refill
>> > +handler in case of 32 bit misp.
>> > +
>> > +The idea of the fix is to come up with new 34Kf-64tlb fictitious
>> > +cpu type, that would behave exactly as 34Kf but it would
>> > +contain 64 TLBs to reduce TLB trashing. After all, adding
>> > +more TLBs to soft mmu is easy.
>> > +
>> > +Experiment with some significant non-trvial load in Yocto
>> > +environment by running do_testimage load shows that 34Kf-64tlb
>> > +cpu performs 40% or so better than original 34Kf cpu wrt test
>> > +execution real time.
>> > +
>> > +It is not ideal to have cpu type that does not exist in the
>> > +wild but given performance gains it seems to be justified.
>> > +
>> > +Signed-off-by: Victor Kamensky <kamensky@cisco.com>
>> > +---
>> > + target/mips/translate_init.inc.c | 55 ++++++++++++++++++++++++++++++++++++++++
>> > + 1 file changed, 55 insertions(+)
>> > +
>> > +diff --git a/target/mips/translate_init.inc.c b/target/mips/translate_init.inc.c
>> > +index 637caccd89..b73ab48231 100644
>> > +--- a/target/mips/translate_init.inc.c
>> > ++++ b/target/mips/translate_init.inc.c
>> > +@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
>> > +         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
>> > +         .mmu_type = MMU_TYPE_R4000,
>> > +     },
>> > ++    /*
>> > ++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB entries
>> > ++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to improve
>> > ++     * performance by reducing number of TLB refill exceptions and
>> > ++     * eliminating need to run all corresponding TLB refill handling
>> > ++     * instructions.
>> > ++     */
>> > ++    {
>> > ++        .name = "34Kf-64tlb",
>> > ++        .CP0_PRid = 0x00019500,
>> > ++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
>> > ++                       (MMU_TYPE_R4000 << CP0C0_MT),
>> > ++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>> > ++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 << CP0C1_IA) |
>> > ++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 << CP0C1_DA) |
>> > ++                       (1 << CP0C1_CA),
>> > ++        .CP0_Config2 = MIPS_CONFIG2,
>> > ++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 << CP0C3_MT) |
>> > ++                       (1 << CP0C3_DSPP),
>> > ++        .CP0_LLAddr_rw_bitmask = 0,
>> > ++        .CP0_LLAddr_shift = 0,
>> > ++        .SYNCI_Step = 32,
>> > ++        .CCRes = 2,
>> > ++        .CP0_Status_rw_bitmask = 0x3778FF1F,
>> > ++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 << CP0TCSt_TCU2) |
>> > ++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
>> > ++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
>> > ++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
>> > ++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
>> > ++                    (0xff << CP0TCSt_TASID),
>> > ++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
>> > ++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 << FCR0_PRID),
>> > ++        .CP1_fcr31 = 0,
>> > ++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
>> > ++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
>> > ++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
>> > ++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe << CP0SRSC0_SRS1),
>> > ++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
>> > ++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe << CP0SRSC1_SRS4),
>> > ++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
>> > ++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe << CP0SRSC2_SRS7),
>> > ++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12) |
>> > ++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe << CP0SRSC3_SRS10),
>> > ++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
>> > ++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe << CP0SRSC4_SRS13),
>> > ++        .SEGBITS = 32,
>> > ++        .PABITS = 32,
>> > ++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
>> > ++        .mmu_type = MMU_TYPE_R4000,
>> > ++    },
>> > +     {
>> > +         .name = "74Kf",
>> > +         .CP0_PRid = 0x00019700,
>> > +--
>> > +2.14.5
>> > +
>> > --
>> > 2.14.5
>> >
>> >
>> >
>> >
>>
>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-10-08 16:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-07 20:38 [PATCH 0/2] qemumips: speeding up Victor Kamensky
2020-10-07 20:38 ` [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type Victor Kamensky
2020-10-07 20:46   ` [OE-core] " Paul Barker
2020-10-07 21:52     ` Victor Kamensky
2020-10-07 22:11       ` Khem Raj
2020-10-07 22:04     ` Richard Purdie
2020-10-07 22:15     ` Khem Raj
2020-10-07 22:24       ` Paul Barker
2020-10-07 22:05   ` Khem Raj
2020-10-08  5:05     ` Victor Kamensky
2020-10-08  5:55       ` Khem Raj
2020-10-08  7:29   ` [OE-core] " Ross Burton
2020-10-08 11:53     ` Alexander Kanavin
2020-10-08 16:05       ` Khem Raj
2020-10-08 16:39         ` Victor Kamensky
2020-10-07 20:38 ` [PATCH 2/2] qemumips: use 34Kf-64tlb CPU emulation Victor Kamensky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.