dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
       [not found] <20190729095155.GP22106@shao2-debian>
@ 2019-07-30 17:50 ` Thomas Zimmermann
  2019-07-30 18:12   ` Daniel Vetter
  2019-08-04 18:39   ` Thomas Zimmermann
  0 siblings, 2 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-07-30 17:50 UTC (permalink / raw)
  To: kernel test robot, Noralf Trønnes, Daniel Vetter
  Cc: Stephen Rothwell, lkp, dri-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 59680 bytes --]

Am 29.07.19 um 11:51 schrieb kernel test robot:
> Greeting,
> 
> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
> 
> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master

Daniel, Noralf, we may have to revert this patch.

I expected some change in display performance, but not in VM. Since it's
a server chipset, probably no one cares much about display performance.
So that seemed like a good trade-off for re-using shared code.

Part of the patch set is that the generic fb emulation now maps and
unmaps the fbdev BO when updating the screen. I guess that's the cause
of the performance regression. And it should be visible with other
drivers as well if they use a shadow FB for fbdev emulation.

The thing is that we'd need another generic fbdev emulation for ast and
mgag200 that handles this issue properly.

Best regards
Thomas

> 
> in testcase: vm-scalability
> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
> with following parameters:
> 
> 	runtime: 300s
> 	size: 8T
> 	test: anon-cow-seq-hugetlb
> 	cpufreq_governor: performance
> 
> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
> 
> 
> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> To reproduce:
> 
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp install job.yaml  # job file is attached in this email
>         bin/lkp run     job.yaml
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>   gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
> 
> commit: 
>   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
>   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> 
> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 
> ---------------- --------------------------- 
>        fail:runs  %reproduction    fail:runs
>            |             |             |    
>           2:4          -50%            :4     dmesg.WARNING:at#for_ip_interrupt_entry/0x
>            :4           25%           1:4     dmesg.WARNING:at_ip___perf_sw_event/0x
>            :4           25%           1:4     dmesg.WARNING:at_ip__fsnotify_parent/0x
>          %stddev     %change         %stddev
>              \          |                \  
>      43955 ±  2%     -18.8%      35691        vm-scalability.median
>       0.06 ±  7%    +193.0%       0.16 ±  2%  vm-scalability.median_stddev
>   14906559 ±  2%     -17.9%   12237079        vm-scalability.throughput
>      87651 ±  2%     -17.4%      72374        vm-scalability.time.involuntary_context_switches
>    2086168           -23.6%    1594224        vm-scalability.time.minor_page_faults
>      15082 ±  2%     -10.4%      13517        vm-scalability.time.percent_of_cpu_this_job_got
>      29987            -8.9%      27327        vm-scalability.time.system_time
>      15755           -12.4%      13795        vm-scalability.time.user_time
>     122011           -19.3%      98418        vm-scalability.time.voluntary_context_switches
>  3.034e+09           -23.6%  2.318e+09        vm-scalability.workload
>     242478 ± 12%     +68.5%     408518 ± 23%  cpuidle.POLL.time
>       2788 ± 21%    +117.4%       6062 ± 26%  cpuidle.POLL.usage
>      56653 ± 10%     +64.4%      93144 ± 20%  meminfo.Mapped
>     120392 ±  7%     +14.0%     137212 ±  4%  meminfo.Shmem
>      47221 ± 11%     +77.1%      83634 ± 22%  numa-meminfo.node0.Mapped
>     120465 ±  7%     +13.9%     137205 ±  4%  numa-meminfo.node0.Shmem
>    2885513           -16.5%    2409384        numa-numastat.node0.local_node
>    2885471           -16.5%    2409354        numa-numastat.node0.numa_hit
>      11813 ± 11%     +76.3%      20824 ± 22%  numa-vmstat.node0.nr_mapped
>      30096 ±  7%     +13.8%      34238 ±  4%  numa-vmstat.node0.nr_shmem
>      43.72 ±  2%      +5.5       49.20        mpstat.cpu.all.idle%
>       0.03 ±  4%      +0.0        0.05 ±  6%  mpstat.cpu.all.soft%
>      19.51            -2.4       17.08        mpstat.cpu.all.usr%
>       1012            -7.9%     932.75        turbostat.Avg_MHz
>      32.38 ± 10%     +25.8%      40.73        turbostat.CPU%c1
>     145.51            -3.1%     141.01        turbostat.PkgWatt
>      15.09           -19.2%      12.19        turbostat.RAMWatt
>      43.50 ±  2%     +13.2%      49.25        vmstat.cpu.id
>      18.75 ±  2%     -13.3%      16.25 ±  2%  vmstat.cpu.us
>     152.00 ±  2%      -9.5%     137.50        vmstat.procs.r
>       4800           -13.1%       4173        vmstat.system.cs
>     156170           -11.9%     137594        slabinfo.anon_vma.active_objs
>       3395           -11.9%       2991        slabinfo.anon_vma.active_slabs
>     156190           -11.9%     137606        slabinfo.anon_vma.num_objs
>       3395           -11.9%       2991        slabinfo.anon_vma.num_slabs
>       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.active_objs
>       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.num_objs
>       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.active_objs
>       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.num_objs
>       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.active_objs
>       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.num_objs
>    1330122           -23.6%    1016557        proc-vmstat.htlb_buddy_alloc_success
>      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_active_anon
>      67277            +2.9%      69246        proc-vmstat.nr_anon_pages
>     218.50 ±  3%     -10.6%     195.25        proc-vmstat.nr_dirtied
>     288628            +1.4%     292755        proc-vmstat.nr_file_pages
>     360.50            -2.7%     350.75        proc-vmstat.nr_inactive_file
>      14225 ±  9%     +63.8%      23304 ± 20%  proc-vmstat.nr_mapped
>      30109 ±  7%     +13.8%      34259 ±  4%  proc-vmstat.nr_shmem
>      99870            -1.3%      98597        proc-vmstat.nr_slab_unreclaimable
>     204.00 ±  4%     -12.1%     179.25        proc-vmstat.nr_written
>      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_zone_active_anon
>     360.50            -2.7%     350.75        proc-vmstat.nr_zone_inactive_file
>       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults
>       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults_local
>    2904082           -16.4%    2427026        proc-vmstat.numa_hit
>    2904081           -16.4%    2427025        proc-vmstat.numa_local
>  6.828e+08           -23.5%  5.221e+08        proc-vmstat.pgalloc_normal
>    2900008           -17.2%    2400195        proc-vmstat.pgfault
>  6.827e+08           -23.5%   5.22e+08        proc-vmstat.pgfree
>  1.635e+10           -17.0%  1.357e+10        perf-stat.i.branch-instructions
>       1.53 ±  4%      -0.1        1.45 ±  3%  perf-stat.i.branch-miss-rate%
>  2.581e+08 ±  3%     -20.5%  2.051e+08 ±  2%  perf-stat.i.branch-misses
>      12.66            +1.1       13.78        perf-stat.i.cache-miss-rate%
>   72720849           -12.0%   63958986        perf-stat.i.cache-misses
>  5.766e+08           -18.6%  4.691e+08        perf-stat.i.cache-references
>       4674 ±  2%     -13.0%       4064        perf-stat.i.context-switches
>       4.29           +12.5%       4.83        perf-stat.i.cpi
>  2.573e+11            -7.4%  2.383e+11        perf-stat.i.cpu-cycles
>     231.35           -21.5%     181.56        perf-stat.i.cpu-migrations
>       3522            +4.4%       3677        perf-stat.i.cycles-between-cache-misses
>       0.09 ± 13%      +0.0        0.12 ±  5%  perf-stat.i.iTLB-load-miss-rate%
>  5.894e+10           -15.8%  4.961e+10        perf-stat.i.iTLB-loads
>  5.901e+10           -15.8%  4.967e+10        perf-stat.i.instructions
>       1291 ± 14%     -21.8%       1010        perf-stat.i.instructions-per-iTLB-miss
>       0.24           -11.0%       0.21        perf-stat.i.ipc
>       9476           -17.5%       7821        perf-stat.i.minor-faults
>       9478           -17.5%       7821        perf-stat.i.page-faults
>       9.76            -3.6%       9.41        perf-stat.overall.MPKI
>       1.59 ±  4%      -0.1        1.52        perf-stat.overall.branch-miss-rate%
>      12.61            +1.1       13.71        perf-stat.overall.cache-miss-rate%
>       4.38           +10.5%       4.83        perf-stat.overall.cpi
>       3557            +5.3%       3747        perf-stat.overall.cycles-between-cache-misses
>       0.08 ± 12%      +0.0        0.10        perf-stat.overall.iTLB-load-miss-rate%
>       1268 ± 15%     -23.0%     976.22        perf-stat.overall.instructions-per-iTLB-miss
>       0.23            -9.5%       0.21        perf-stat.overall.ipc
>       5815            +9.7%       6378        perf-stat.overall.path-length
>  1.634e+10           -17.5%  1.348e+10        perf-stat.ps.branch-instructions
>  2.595e+08 ±  3%     -21.2%  2.043e+08 ±  2%  perf-stat.ps.branch-misses
>   72565205           -12.2%   63706339        perf-stat.ps.cache-misses
>  5.754e+08           -19.2%  4.646e+08        perf-stat.ps.cache-references
>       4640 ±  2%     -12.5%       4060        perf-stat.ps.context-switches
>  2.581e+11            -7.5%  2.387e+11        perf-stat.ps.cpu-cycles
>     229.91           -22.0%     179.42        perf-stat.ps.cpu-migrations
>  5.889e+10           -16.3%  4.927e+10        perf-stat.ps.iTLB-loads
>  5.899e+10           -16.3%  4.938e+10        perf-stat.ps.instructions
>       9388           -18.2%       7677        perf-stat.ps.minor-faults
>       9389           -18.2%       7677        perf-stat.ps.page-faults
>  1.764e+13           -16.2%  1.479e+13        perf-stat.total.instructions
>      46803 ±  3%     -18.8%      37982 ±  6%  sched_debug.cfs_rq:/.exec_clock.min
>       5320 ±  3%     +23.7%       6581 ±  3%  sched_debug.cfs_rq:/.exec_clock.stddev
>       6737 ± 14%     +58.1%      10649 ± 10%  sched_debug.cfs_rq:/.load.avg
>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.load.max
>      46952 ± 16%     +64.8%      77388 ± 11%  sched_debug.cfs_rq:/.load.stddev
>       7.12 ±  4%     +49.1%      10.62 ±  6%  sched_debug.cfs_rq:/.load_avg.avg
>     474.40 ± 23%     +67.5%     794.60 ± 10%  sched_debug.cfs_rq:/.load_avg.max
>      37.70 ± 11%     +74.8%      65.90 ±  9%  sched_debug.cfs_rq:/.load_avg.stddev
>   13424269 ±  4%     -15.6%   11328098 ±  2%  sched_debug.cfs_rq:/.min_vruntime.avg
>   15411275 ±  3%     -12.4%   13505072 ±  2%  sched_debug.cfs_rq:/.min_vruntime.max
>    7939295 ±  6%     -17.5%    6551322 ±  7%  sched_debug.cfs_rq:/.min_vruntime.min
>      21.44 ±  7%     -56.1%       9.42 ±  4%  sched_debug.cfs_rq:/.nr_spread_over.avg
>     117.45 ± 11%     -60.6%      46.30 ± 14%  sched_debug.cfs_rq:/.nr_spread_over.max
>      19.33 ±  8%     -66.4%       6.49 ±  9%  sched_debug.cfs_rq:/.nr_spread_over.stddev
>       4.32 ± 15%     +84.4%       7.97 ±  3%  sched_debug.cfs_rq:/.runnable_load_avg.avg
>     353.85 ± 29%    +118.8%     774.35 ± 11%  sched_debug.cfs_rq:/.runnable_load_avg.max
>      27.30 ± 24%    +118.5%      59.64 ±  9%  sched_debug.cfs_rq:/.runnable_load_avg.stddev
>       6729 ± 14%     +58.2%      10644 ± 10%  sched_debug.cfs_rq:/.runnable_weight.avg
>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.runnable_weight.max
>      46950 ± 16%     +64.8%      77387 ± 11%  sched_debug.cfs_rq:/.runnable_weight.stddev
>    5305069 ±  4%     -17.4%    4380376 ±  7%  sched_debug.cfs_rq:/.spread0.avg
>    7328745 ±  3%      -9.9%    6600897 ±  3%  sched_debug.cfs_rq:/.spread0.max
>    2220837 ±  4%     +55.8%    3460596 ±  5%  sched_debug.cpu.avg_idle.avg
>    4590666 ±  9%     +76.8%    8117037 ± 15%  sched_debug.cpu.avg_idle.max
>     485052 ±  7%     +80.3%     874679 ± 10%  sched_debug.cpu.avg_idle.stddev
>     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock.stddev
>     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock_task.stddev
>       3.20 ± 10%    +109.6%       6.70 ±  3%  sched_debug.cpu.cpu_load[0].avg
>     309.10 ± 20%    +150.3%     773.75 ± 12%  sched_debug.cpu.cpu_load[0].max
>      21.02 ± 14%    +160.8%      54.80 ±  9%  sched_debug.cpu.cpu_load[0].stddev
>       3.19 ±  8%    +109.8%       6.70 ±  3%  sched_debug.cpu.cpu_load[1].avg
>     299.75 ± 19%    +158.0%     773.30 ± 12%  sched_debug.cpu.cpu_load[1].max
>      20.32 ± 12%    +168.7%      54.62 ±  9%  sched_debug.cpu.cpu_load[1].stddev
>       3.20 ±  8%    +109.1%       6.69 ±  4%  sched_debug.cpu.cpu_load[2].avg
>     288.90 ± 20%    +167.0%     771.40 ± 12%  sched_debug.cpu.cpu_load[2].max
>      19.70 ± 12%    +175.4%      54.27 ±  9%  sched_debug.cpu.cpu_load[2].stddev
>       3.16 ±  8%    +110.9%       6.66 ±  6%  sched_debug.cpu.cpu_load[3].avg
>     275.50 ± 24%    +178.4%     766.95 ± 12%  sched_debug.cpu.cpu_load[3].max
>      18.92 ± 15%    +184.2%      53.77 ± 10%  sched_debug.cpu.cpu_load[3].stddev
>       3.08 ±  8%    +115.7%       6.65 ±  7%  sched_debug.cpu.cpu_load[4].avg
>     263.55 ± 28%    +188.7%     760.85 ± 12%  sched_debug.cpu.cpu_load[4].max
>      18.03 ± 18%    +196.6%      53.46 ± 11%  sched_debug.cpu.cpu_load[4].stddev
>      14543            -9.6%      13150        sched_debug.cpu.curr->pid.max
>       5293 ± 16%     +74.7%       9248 ± 11%  sched_debug.cpu.load.avg
>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cpu.load.max
>      40887 ± 19%     +78.3%      72891 ±  9%  sched_debug.cpu.load.stddev
>    1141679 ±  4%     +56.9%    1790907 ±  5%  sched_debug.cpu.max_idle_balance_cost.avg
>    2432100 ±  9%     +72.6%    4196779 ± 13%  sched_debug.cpu.max_idle_balance_cost.max
>     745656           +29.3%     964170 ±  5%  sched_debug.cpu.max_idle_balance_cost.min
>     239032 ±  9%     +81.9%     434806 ± 10%  sched_debug.cpu.max_idle_balance_cost.stddev
>       0.00 ± 27%     +92.1%       0.00 ± 31%  sched_debug.cpu.next_balance.stddev
>       1030 ±  4%     -10.4%     924.00 ±  2%  sched_debug.cpu.nr_switches.min
>       0.04 ± 26%    +139.0%       0.09 ± 41%  sched_debug.cpu.nr_uninterruptible.avg
>     830.35 ±  6%     -12.0%     730.50 ±  2%  sched_debug.cpu.sched_count.min
>     912.00 ±  2%      -9.5%     825.38        sched_debug.cpu.ttwu_count.avg
>     433.05 ±  3%     -19.2%     350.05 ±  3%  sched_debug.cpu.ttwu_count.min
>     160.70 ±  3%     -12.5%     140.60 ±  4%  sched_debug.cpu.ttwu_local.min
>       9072 ± 11%     -36.4%       5767 ±  8%  softirqs.CPU1.RCU
>      12769 ±  5%     +15.3%      14718 ±  3%  softirqs.CPU101.SCHED
>      13198           +11.5%      14717 ±  3%  softirqs.CPU102.SCHED
>      12981 ±  4%     +13.9%      14788 ±  3%  softirqs.CPU105.SCHED
>      13486 ±  3%     +11.8%      15071 ±  4%  softirqs.CPU111.SCHED
>      12794 ±  4%     +14.1%      14601 ±  9%  softirqs.CPU112.SCHED
>      12999 ±  4%     +10.1%      14314 ±  4%  softirqs.CPU115.SCHED
>      12844 ±  4%     +10.6%      14202 ±  2%  softirqs.CPU120.SCHED
>      13336 ±  3%      +9.4%      14585 ±  3%  softirqs.CPU122.SCHED
>      12639 ±  4%     +20.2%      15195        softirqs.CPU123.SCHED
>      13040 ±  5%     +15.2%      15024 ±  5%  softirqs.CPU126.SCHED
>      13123           +15.1%      15106 ±  5%  softirqs.CPU127.SCHED
>       9188 ±  6%     -35.7%       5911 ±  2%  softirqs.CPU13.RCU
>      13054 ±  3%     +13.1%      14761 ±  5%  softirqs.CPU130.SCHED
>      13158 ±  2%     +13.9%      14985 ±  5%  softirqs.CPU131.SCHED
>      12797 ±  6%     +13.5%      14524 ±  3%  softirqs.CPU133.SCHED
>      12452 ±  5%     +14.8%      14297        softirqs.CPU134.SCHED
>      13078 ±  3%     +10.4%      14439 ±  3%  softirqs.CPU138.SCHED
>      12617 ±  2%     +14.5%      14442 ±  5%  softirqs.CPU139.SCHED
>      12974 ±  3%     +13.7%      14752 ±  4%  softirqs.CPU142.SCHED
>      12579 ±  4%     +19.1%      14983 ±  3%  softirqs.CPU143.SCHED
>       9122 ± 24%     -44.6%       5053 ±  5%  softirqs.CPU144.RCU
>      13366 ±  2%     +11.1%      14848 ±  3%  softirqs.CPU149.SCHED
>      13246 ±  2%     +22.0%      16162 ±  7%  softirqs.CPU150.SCHED
>      13452 ±  3%     +20.5%      16210 ±  7%  softirqs.CPU151.SCHED
>      13507           +10.1%      14869        softirqs.CPU156.SCHED
>      13808 ±  3%      +9.2%      15079 ±  4%  softirqs.CPU157.SCHED
>      13442 ±  2%     +13.4%      15248 ±  4%  softirqs.CPU160.SCHED
>      13311           +12.1%      14920 ±  2%  softirqs.CPU162.SCHED
>      13544 ±  3%      +8.5%      14695 ±  4%  softirqs.CPU163.SCHED
>      13648 ±  3%     +11.2%      15179 ±  2%  softirqs.CPU166.SCHED
>      13404 ±  4%     +12.5%      15079 ±  3%  softirqs.CPU168.SCHED
>      13421 ±  6%     +16.0%      15568 ±  8%  softirqs.CPU169.SCHED
>      13115 ±  3%     +23.1%      16139 ± 10%  softirqs.CPU171.SCHED
>      13424 ±  6%     +10.4%      14822 ±  3%  softirqs.CPU175.SCHED
>      13274 ±  3%     +13.7%      15087 ±  9%  softirqs.CPU185.SCHED
>      13409 ±  3%     +12.3%      15063 ±  3%  softirqs.CPU190.SCHED
>      13181 ±  7%     +13.4%      14946 ±  3%  softirqs.CPU196.SCHED
>      13578 ±  3%     +10.9%      15061        softirqs.CPU197.SCHED
>      13323 ±  5%     +24.8%      16627 ±  6%  softirqs.CPU198.SCHED
>      14072 ±  2%     +12.3%      15798 ±  7%  softirqs.CPU199.SCHED
>      12604 ± 13%     +17.9%      14865        softirqs.CPU201.SCHED
>      13380 ±  4%     +14.8%      15356 ±  3%  softirqs.CPU203.SCHED
>      13481 ±  8%     +14.2%      15390 ±  3%  softirqs.CPU204.SCHED
>      12921 ±  2%     +13.8%      14710 ±  3%  softirqs.CPU206.SCHED
>      13468           +13.0%      15218 ±  2%  softirqs.CPU208.SCHED
>      13253 ±  2%     +13.1%      14992        softirqs.CPU209.SCHED
>      13319 ±  2%     +14.3%      15225 ±  7%  softirqs.CPU210.SCHED
>      13673 ±  5%     +16.3%      15895 ±  3%  softirqs.CPU211.SCHED
>      13290           +17.0%      15556 ±  5%  softirqs.CPU212.SCHED
>      13455 ±  4%     +14.4%      15392 ±  3%  softirqs.CPU213.SCHED
>      13454 ±  4%     +14.3%      15377 ±  3%  softirqs.CPU215.SCHED
>      13872 ±  7%      +9.7%      15221 ±  5%  softirqs.CPU220.SCHED
>      13555 ±  4%     +17.3%      15896 ±  5%  softirqs.CPU222.SCHED
>      13411 ±  4%     +20.8%      16197 ±  6%  softirqs.CPU223.SCHED
>       8472 ± 21%     -44.8%       4680 ±  3%  softirqs.CPU224.RCU
>      13141 ±  3%     +16.2%      15265 ±  7%  softirqs.CPU225.SCHED
>      14084 ±  3%      +8.2%      15242 ±  2%  softirqs.CPU226.SCHED
>      13528 ±  4%     +11.3%      15063 ±  4%  softirqs.CPU228.SCHED
>      13218 ±  3%     +16.3%      15377 ±  4%  softirqs.CPU229.SCHED
>      14031 ±  4%     +10.2%      15467 ±  2%  softirqs.CPU231.SCHED
>      13770 ±  3%     +14.0%      15700 ±  3%  softirqs.CPU232.SCHED
>      13456 ±  3%     +12.3%      15105 ±  3%  softirqs.CPU233.SCHED
>      13137 ±  4%     +13.5%      14909 ±  3%  softirqs.CPU234.SCHED
>      13318 ±  2%     +14.7%      15280 ±  2%  softirqs.CPU235.SCHED
>      13690 ±  2%     +13.7%      15563 ±  7%  softirqs.CPU238.SCHED
>      13771 ±  5%     +20.8%      16634 ±  7%  softirqs.CPU241.SCHED
>      13317 ±  7%     +19.5%      15919 ±  9%  softirqs.CPU243.SCHED
>       8234 ± 16%     -43.9%       4616 ±  5%  softirqs.CPU244.RCU
>      13845 ±  6%     +13.0%      15643 ±  3%  softirqs.CPU244.SCHED
>      13179 ±  3%     +16.3%      15323        softirqs.CPU246.SCHED
>      13754           +12.2%      15438 ±  3%  softirqs.CPU248.SCHED
>      13769 ±  4%     +10.9%      15276 ±  2%  softirqs.CPU252.SCHED
>      13702           +10.5%      15147 ±  2%  softirqs.CPU254.SCHED
>      13315 ±  2%     +12.5%      14980 ±  3%  softirqs.CPU255.SCHED
>      13785 ±  3%     +12.9%      15568 ±  5%  softirqs.CPU256.SCHED
>      13307 ±  3%     +15.0%      15298 ±  3%  softirqs.CPU257.SCHED
>      13864 ±  3%     +10.5%      15313 ±  2%  softirqs.CPU259.SCHED
>      13879 ±  2%     +11.4%      15465        softirqs.CPU261.SCHED
>      13815           +13.6%      15687 ±  5%  softirqs.CPU264.SCHED
>     119574 ±  2%     +11.8%     133693 ± 11%  softirqs.CPU266.TIMER
>      13688           +10.9%      15180 ±  6%  softirqs.CPU267.SCHED
>      11716 ±  4%     +19.3%      13974 ±  8%  softirqs.CPU27.SCHED
>      13866 ±  3%     +13.7%      15765 ±  4%  softirqs.CPU271.SCHED
>      13887 ±  5%     +12.5%      15621        softirqs.CPU272.SCHED
>      13383 ±  3%     +19.8%      16031 ±  2%  softirqs.CPU274.SCHED
>      13347           +14.1%      15232 ±  3%  softirqs.CPU275.SCHED
>      12884 ±  2%     +21.0%      15593 ±  4%  softirqs.CPU276.SCHED
>      13131 ±  5%     +13.4%      14891 ±  5%  softirqs.CPU277.SCHED
>      12891 ±  2%     +19.2%      15371 ±  4%  softirqs.CPU278.SCHED
>      13313 ±  4%     +13.0%      15049 ±  2%  softirqs.CPU279.SCHED
>      13514 ±  3%     +10.2%      14897 ±  2%  softirqs.CPU280.SCHED
>      13501 ±  3%     +13.7%      15346        softirqs.CPU281.SCHED
>      13261           +17.5%      15577        softirqs.CPU282.SCHED
>       8076 ± 15%     -43.7%       4546 ±  5%  softirqs.CPU283.RCU
>      13686 ±  3%     +12.6%      15413 ±  2%  softirqs.CPU284.SCHED
>      13439 ±  2%      +9.2%      14670 ±  4%  softirqs.CPU285.SCHED
>       8878 ±  9%     -35.4%       5735 ±  4%  softirqs.CPU35.RCU
>      11690 ±  2%     +13.6%      13274 ±  5%  softirqs.CPU40.SCHED
>      11714 ±  2%     +19.3%      13975 ± 13%  softirqs.CPU41.SCHED
>      11763           +12.5%      13239 ±  4%  softirqs.CPU45.SCHED
>      11662 ±  2%      +9.4%      12757 ±  3%  softirqs.CPU46.SCHED
>      11805 ±  2%      +9.3%      12902 ±  2%  softirqs.CPU50.SCHED
>      12158 ±  3%     +12.3%      13655 ±  8%  softirqs.CPU55.SCHED
>      11716 ±  4%      +8.8%      12751 ±  3%  softirqs.CPU58.SCHED
>      11922 ±  2%      +9.9%      13100 ±  4%  softirqs.CPU64.SCHED
>       9674 ± 17%     -41.8%       5625 ±  6%  softirqs.CPU66.RCU
>      11818           +12.0%      13237        softirqs.CPU66.SCHED
>     124682 ±  7%      -6.1%     117088 ±  5%  softirqs.CPU66.TIMER
>       8637 ±  9%     -34.0%       5700 ±  7%  softirqs.CPU70.RCU
>      11624 ±  2%     +11.0%      12901 ±  2%  softirqs.CPU70.SCHED
>      12372 ±  2%     +13.2%      14003 ±  3%  softirqs.CPU71.SCHED
>       9949 ± 25%     -33.9%       6574 ± 31%  softirqs.CPU72.RCU
>      10392 ± 26%     -35.1%       6745 ± 35%  softirqs.CPU73.RCU
>      12766 ±  3%     +11.1%      14188 ±  3%  softirqs.CPU76.SCHED
>      12611 ±  2%     +18.8%      14984 ±  5%  softirqs.CPU78.SCHED
>      12786 ±  3%     +17.9%      15079 ±  7%  softirqs.CPU79.SCHED
>      11947 ±  4%      +9.7%      13103 ±  4%  softirqs.CPU8.SCHED
>      13379 ±  7%     +11.8%      14962 ±  4%  softirqs.CPU83.SCHED
>      13438 ±  5%      +9.7%      14738 ±  2%  softirqs.CPU84.SCHED
>      12768           +19.4%      15241 ±  6%  softirqs.CPU88.SCHED
>       8604 ± 13%     -39.3%       5222 ±  3%  softirqs.CPU89.RCU
>      13077 ±  2%     +17.1%      15308 ±  7%  softirqs.CPU89.SCHED
>      11887 ±  3%     +20.1%      14272 ±  5%  softirqs.CPU9.SCHED
>      12723 ±  3%     +11.3%      14165 ±  4%  softirqs.CPU90.SCHED
>       8439 ± 12%     -38.9%       5153 ±  4%  softirqs.CPU91.RCU
>      13429 ±  3%     +10.3%      14806 ±  2%  softirqs.CPU95.SCHED
>      12852 ±  4%     +10.3%      14174 ±  5%  softirqs.CPU96.SCHED
>      13010 ±  2%     +14.4%      14888 ±  5%  softirqs.CPU97.SCHED
>    2315644 ±  4%     -36.2%    1477200 ±  4%  softirqs.RCU
>       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.NMI:Non-maskable_interrupts
>       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.PMI:Performance_monitoring_interrupts
>     252.00 ± 11%     -35.2%     163.25 ± 13%  interrupts.CPU104.RES:Rescheduling_interrupts
>       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.NMI:Non-maskable_interrupts
>       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.PMI:Performance_monitoring_interrupts
>     245.75 ± 19%     -31.0%     169.50 ±  7%  interrupts.CPU105.RES:Rescheduling_interrupts
>     228.75 ± 13%     -24.7%     172.25 ± 19%  interrupts.CPU106.RES:Rescheduling_interrupts
>       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.NMI:Non-maskable_interrupts
>       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.PMI:Performance_monitoring_interrupts
>       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.NMI:Non-maskable_interrupts
>       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.PMI:Performance_monitoring_interrupts
>       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.NMI:Non-maskable_interrupts
>       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.PMI:Performance_monitoring_interrupts
>     311.50 ± 23%     -47.7%     163.00 ±  9%  interrupts.CPU122.RES:Rescheduling_interrupts
>     266.75 ± 19%     -31.6%     182.50 ± 15%  interrupts.CPU124.RES:Rescheduling_interrupts
>     293.75 ± 33%     -32.3%     198.75 ± 19%  interrupts.CPU125.RES:Rescheduling_interrupts
>       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.NMI:Non-maskable_interrupts
>       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.PMI:Performance_monitoring_interrupts
>       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.NMI:Non-maskable_interrupts
>       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.PMI:Performance_monitoring_interrupts
>       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.NMI:Non-maskable_interrupts
>       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.PMI:Performance_monitoring_interrupts
>     219.50 ± 27%     -23.0%     169.00 ± 21%  interrupts.CPU139.RES:Rescheduling_interrupts
>     290.25 ± 25%     -32.5%     196.00 ± 11%  interrupts.CPU14.RES:Rescheduling_interrupts
>     243.50 ±  4%     -16.0%     204.50 ± 12%  interrupts.CPU140.RES:Rescheduling_interrupts
>       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.NMI:Non-maskable_interrupts
>       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.PMI:Performance_monitoring_interrupts
>       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.NMI:Non-maskable_interrupts
>       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.PMI:Performance_monitoring_interrupts
>     292.25 ± 34%     -33.9%     193.25 ±  6%  interrupts.CPU15.RES:Rescheduling_interrupts
>     424.25 ± 37%     -58.5%     176.25 ± 14%  interrupts.CPU158.RES:Rescheduling_interrupts
>     312.50 ± 42%     -54.2%     143.00 ± 18%  interrupts.CPU159.RES:Rescheduling_interrupts
>     725.00 ±118%     -75.7%     176.25 ± 14%  interrupts.CPU163.RES:Rescheduling_interrupts
>       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.NMI:Non-maskable_interrupts
>       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.PMI:Performance_monitoring_interrupts
>     239.50 ± 30%     -46.6%     128.00 ± 14%  interrupts.CPU179.RES:Rescheduling_interrupts
>     320.75 ± 15%     -24.0%     243.75 ± 20%  interrupts.CPU20.RES:Rescheduling_interrupts
>     302.50 ± 17%     -47.2%     159.75 ±  8%  interrupts.CPU200.RES:Rescheduling_interrupts
>       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.NMI:Non-maskable_interrupts
>       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.PMI:Performance_monitoring_interrupts
>     217.00 ± 11%     -34.6%     142.00 ± 12%  interrupts.CPU214.RES:Rescheduling_interrupts
>       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.NMI:Non-maskable_interrupts
>       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.PMI:Performance_monitoring_interrupts
>       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.NMI:Non-maskable_interrupts
>       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.PMI:Performance_monitoring_interrupts
>     289.50 ± 28%     -41.1%     170.50 ±  8%  interrupts.CPU22.RES:Rescheduling_interrupts
>       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.NMI:Non-maskable_interrupts
>       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.PMI:Performance_monitoring_interrupts
>       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.NMI:Non-maskable_interrupts
>       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.PMI:Performance_monitoring_interrupts
>       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.NMI:Non-maskable_interrupts
>       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.PMI:Performance_monitoring_interrupts
>       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.NMI:Non-maskable_interrupts
>       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.PMI:Performance_monitoring_interrupts
>     248.00 ± 36%     -36.3%     158.00 ± 19%  interrupts.CPU228.RES:Rescheduling_interrupts
>       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.NMI:Non-maskable_interrupts
>       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.PMI:Performance_monitoring_interrupts
>     404.25 ± 69%     -65.5%     139.50 ± 17%  interrupts.CPU236.RES:Rescheduling_interrupts
>     566.50 ± 40%     -73.6%     149.50 ± 31%  interrupts.CPU237.RES:Rescheduling_interrupts
>     243.50 ± 26%     -37.1%     153.25 ± 21%  interrupts.CPU248.RES:Rescheduling_interrupts
>     258.25 ± 12%     -53.5%     120.00 ± 18%  interrupts.CPU249.RES:Rescheduling_interrupts
>       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.NMI:Non-maskable_interrupts
>       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.PMI:Performance_monitoring_interrupts
>       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.NMI:Non-maskable_interrupts
>       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.PMI:Performance_monitoring_interrupts
>     425.00 ± 59%     -60.3%     168.75 ± 34%  interrupts.CPU258.RES:Rescheduling_interrupts
>       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.NMI:Non-maskable_interrupts
>       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.PMI:Performance_monitoring_interrupts
>       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.NMI:Non-maskable_interrupts
>       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.PMI:Performance_monitoring_interrupts
>       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.NMI:Non-maskable_interrupts
>       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.PMI:Performance_monitoring_interrupts
>       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.NMI:Non-maskable_interrupts
>       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.PMI:Performance_monitoring_interrupts
>       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.NMI:Non-maskable_interrupts
>       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.PMI:Performance_monitoring_interrupts
>       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.NMI:Non-maskable_interrupts
>       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.PMI:Performance_monitoring_interrupts
>       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.NMI:Non-maskable_interrupts
>       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.PMI:Performance_monitoring_interrupts
>     331.75 ± 32%     -39.8%     199.75 ± 17%  interrupts.CPU29.RES:Rescheduling_interrupts
>       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.NMI:Non-maskable_interrupts
>       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.PMI:Performance_monitoring_interrupts
>     298.50 ± 30%     -39.7%     180.00 ±  6%  interrupts.CPU34.RES:Rescheduling_interrupts
>       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.NMI:Non-maskable_interrupts
>       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.PMI:Performance_monitoring_interrupts
>     270.50 ± 24%     -31.1%     186.25 ±  3%  interrupts.CPU36.RES:Rescheduling_interrupts
>       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.NMI:Non-maskable_interrupts
>       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.PMI:Performance_monitoring_interrupts
>     286.75 ± 36%     -32.4%     193.75 ±  7%  interrupts.CPU45.RES:Rescheduling_interrupts
>     259.00 ± 12%     -23.6%     197.75 ± 13%  interrupts.CPU46.RES:Rescheduling_interrupts
>     244.00 ± 21%     -35.6%     157.25 ± 11%  interrupts.CPU47.RES:Rescheduling_interrupts
>     230.00 ±  7%     -21.3%     181.00 ± 11%  interrupts.CPU48.RES:Rescheduling_interrupts
>     281.00 ± 13%     -27.4%     204.00 ± 15%  interrupts.CPU53.RES:Rescheduling_interrupts
>     256.75 ±  5%     -18.4%     209.50 ± 12%  interrupts.CPU54.RES:Rescheduling_interrupts
>       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.NMI:Non-maskable_interrupts
>       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.PMI:Performance_monitoring_interrupts
>     316.00 ± 25%     -41.4%     185.25 ± 13%  interrupts.CPU59.RES:Rescheduling_interrupts
>       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.NMI:Non-maskable_interrupts
>       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.PMI:Performance_monitoring_interrupts
>       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.NMI:Non-maskable_interrupts
>       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.PMI:Performance_monitoring_interrupts
>       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.NMI:Non-maskable_interrupts
>       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.PMI:Performance_monitoring_interrupts
>       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.NMI:Non-maskable_interrupts
>       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.PMI:Performance_monitoring_interrupts
>     319.00 ± 40%     -44.7%     176.25 ±  9%  interrupts.CPU67.RES:Rescheduling_interrupts
>       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.NMI:Non-maskable_interrupts
>       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.PMI:Performance_monitoring_interrupts
>       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.NMI:Non-maskable_interrupts
>       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.PMI:Performance_monitoring_interrupts
>       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.NMI:Non-maskable_interrupts
>       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.PMI:Performance_monitoring_interrupts
>     426.75 ± 61%     -67.7%     138.00 ±  8%  interrupts.CPU75.RES:Rescheduling_interrupts
>     192.50 ± 13%     +45.6%     280.25 ± 45%  interrupts.CPU76.RES:Rescheduling_interrupts
>     274.25 ± 34%     -42.2%     158.50 ± 34%  interrupts.CPU77.RES:Rescheduling_interrupts
>       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.NMI:Non-maskable_interrupts
>       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.PMI:Performance_monitoring_interrupts
>     348.50 ± 53%     -47.3%     183.75 ± 29%  interrupts.CPU80.RES:Rescheduling_interrupts
>       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.NMI:Non-maskable_interrupts
>       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.PMI:Performance_monitoring_interrupts
>       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.NMI:Non-maskable_interrupts
>       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.PMI:Performance_monitoring_interrupts
>       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.NMI:Non-maskable_interrupts
>       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.PMI:Performance_monitoring_interrupts
>     408.75 ± 58%     -56.8%     176.75 ± 25%  interrupts.CPU92.RES:Rescheduling_interrupts
>     399.00 ± 64%     -63.6%     145.25 ± 16%  interrupts.CPU93.RES:Rescheduling_interrupts
>     314.75 ± 36%     -44.2%     175.75 ± 13%  interrupts.CPU94.RES:Rescheduling_interrupts
>     191.00 ± 15%     -29.1%     135.50 ±  9%  interrupts.CPU97.RES:Rescheduling_interrupts
>      94.00 ±  8%     +50.0%     141.00 ± 12%  interrupts.IWI:IRQ_work_interrupts
>     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.NMI:Non-maskable_interrupts
>     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.PMI:Performance_monitoring_interrupts
>      12.75 ± 11%      -4.1        8.67 ± 31%  perf-profile.calltrace.cycles-pp.do_rw_once
>       1.02 ± 16%      -0.6        0.47 ± 59%  perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle
>       1.10 ± 15%      -0.4        0.66 ± 14%  perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
>       1.05 ± 16%      -0.4        0.61 ± 14%  perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter
>       1.58 ±  4%      +0.3        1.91 ±  7%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page
>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       2.11 ±  4%      +0.5        2.60 ±  7%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault
>       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
>       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       1.90 ±  5%      +0.6        2.45 ±  7%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage
>       0.65 ± 62%      +0.6        1.20 ± 15%  perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
>       0.60 ± 62%      +0.6        1.16 ± 18%  perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap
>       0.95 ± 17%      +0.6        1.52 ±  8%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner
>       0.61 ± 62%      +0.6        1.18 ± 18%  perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput
>       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit
>       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit
>       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
>       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group
>       1.30 ±  9%      +0.6        1.92 ±  8%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock
>       0.19 ±173%      +0.7        0.89 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu
>       0.19 ±173%      +0.7        0.90 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu
>       0.00            +0.8        0.77 ± 30%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page
>       0.00            +0.8        0.78 ± 30%  perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page
>       0.00            +0.8        0.79 ± 29%  perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
>       0.82 ± 67%      +0.9        1.72 ± 22%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault
>       0.84 ± 66%      +0.9        1.74 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
>       2.52 ±  6%      +0.9        3.44 ±  9%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page
>       0.83 ± 67%      +0.9        1.75 ± 21%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
>       0.84 ± 66%      +0.9        1.77 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
>       1.64 ± 12%      +1.0        2.67 ±  7%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault
>       1.65 ± 45%      +1.3        2.99 ± 18%  perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
>       1.74 ± 13%      +1.4        3.16 ±  6%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault
>       2.56 ± 48%      +2.2        4.81 ± 19%  perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault
>      12.64 ± 14%      +3.6       16.20 ±  8%  perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault
>       2.97 ±  7%      +3.8        6.74 ±  9%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow
>      19.99 ±  9%      +4.1       24.05 ±  6%  perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault
>       1.37 ± 15%      -0.5        0.83 ± 13%  perf-profile.children.cycles-pp.sched_clock_cpu
>       1.31 ± 16%      -0.5        0.78 ± 13%  perf-profile.children.cycles-pp.sched_clock
>       1.29 ± 16%      -0.5        0.77 ± 13%  perf-profile.children.cycles-pp.native_sched_clock
>       1.80 ±  2%      -0.3        1.47 ± 10%  perf-profile.children.cycles-pp.task_tick_fair
>       0.73 ±  2%      -0.2        0.54 ± 11%  perf-profile.children.cycles-pp.update_curr
>       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.children.cycles-pp.account_process_tick
>       0.73 ± 10%      -0.2        0.58 ±  9%  perf-profile.children.cycles-pp.rcu_sched_clock_irq
>       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.children.cycles-pp.__acct_update_integrals
>       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs
>       0.40 ± 12%      -0.1        0.30 ± 14%  perf-profile.children.cycles-pp.__next_timer_interrupt
>       0.47 ±  7%      -0.1        0.39 ± 13%  perf-profile.children.cycles-pp.update_rq_clock
>       0.29 ± 12%      -0.1        0.21 ± 15%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
>       0.21 ±  7%      -0.1        0.14 ± 12%  perf-profile.children.cycles-pp.account_system_index_time
>       0.38 ±  2%      -0.1        0.31 ± 12%  perf-profile.children.cycles-pp.timerqueue_add
>       0.26 ± 11%      -0.1        0.20 ± 13%  perf-profile.children.cycles-pp.find_next_bit
>       0.23 ± 15%      -0.1        0.17 ± 15%  perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit
>       0.14 ±  8%      -0.1        0.07 ± 14%  perf-profile.children.cycles-pp.account_user_time
>       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.children.cycles-pp.cpuacct_charge
>       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.children.cycles-pp.irq_work_tick
>       0.11 ± 13%      -0.0        0.07 ± 25%  perf-profile.children.cycles-pp.tick_sched_do_timer
>       0.12 ± 10%      -0.0        0.08 ± 15%  perf-profile.children.cycles-pp.get_cpu_device
>       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.children.cycles-pp.raise_softirq
>       0.12 ±  3%      -0.0        0.09 ±  8%  perf-profile.children.cycles-pp.write
>       0.11 ± 13%      +0.0        0.14 ±  8%  perf-profile.children.cycles-pp.native_write_msr
>       0.09 ±  9%      +0.0        0.11 ±  7%  perf-profile.children.cycles-pp.finish_task_switch
>       0.10 ± 10%      +0.0        0.13 ±  5%  perf-profile.children.cycles-pp.schedule_idle
>       0.07 ±  6%      +0.0        0.10 ± 12%  perf-profile.children.cycles-pp.__read_nocancel
>       0.04 ± 58%      +0.0        0.07 ± 15%  perf-profile.children.cycles-pp.__free_pages_ok
>       0.06 ±  7%      +0.0        0.09 ± 13%  perf-profile.children.cycles-pp.perf_read
>       0.07            +0.0        0.11 ± 14%  perf-profile.children.cycles-pp.perf_evsel__read_counter
>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.cmd_stat
>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.__run_perf_stat
>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.process_interval
>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.read_counters
>       0.07 ± 22%      +0.0        0.11 ± 19%  perf-profile.children.cycles-pp.__handle_mm_fault
>       0.07 ± 19%      +0.1        0.13 ±  8%  perf-profile.children.cycles-pp.rb_erase
>       0.03 ±100%      +0.1        0.09 ±  9%  perf-profile.children.cycles-pp.smp_call_function_single
>       0.01 ±173%      +0.1        0.08 ± 11%  perf-profile.children.cycles-pp.perf_event_read
>       0.00            +0.1        0.07 ± 13%  perf-profile.children.cycles-pp.__perf_event_read_value
>       0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.__intel_pmu_enable_all
>       0.08 ± 17%      +0.1        0.15 ±  8%  perf-profile.children.cycles-pp.native_apic_msr_eoi_write
>       0.04 ±103%      +0.1        0.13 ± 58%  perf-profile.children.cycles-pp.shmem_getpage_gfp
>       0.38 ± 14%      +0.1        0.51 ±  6%  perf-profile.children.cycles-pp.run_timer_softirq
>       0.11 ±  4%      +0.3        0.37 ± 32%  perf-profile.children.cycles-pp.worker_thread
>       0.20 ±  5%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.ret_from_fork
>       0.20 ±  4%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.kthread
>       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.memcpy_erms
>       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.drm_fb_helper_dirty_work
>       0.00            +0.3        0.31 ± 37%  perf-profile.children.cycles-pp.process_one_work
>       0.47 ± 48%      +0.4        0.91 ± 19%  perf-profile.children.cycles-pp.prep_new_huge_page
>       0.70 ± 29%      +0.5        1.16 ± 18%  perf-profile.children.cycles-pp.free_huge_page
>       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_flush_mmu
>       0.72 ± 29%      +0.5        1.18 ± 18%  perf-profile.children.cycles-pp.release_pages
>       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_finish_mmu
>       0.76 ± 27%      +0.5        1.23 ± 18%  perf-profile.children.cycles-pp.exit_mmap
>       0.77 ± 27%      +0.5        1.24 ± 18%  perf-profile.children.cycles-pp.mmput
>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.__x64_sys_exit_group
>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_group_exit
>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_exit
>       1.28 ± 29%      +0.5        1.76 ±  9%  perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
>       0.77 ± 28%      +0.5        1.26 ± 13%  perf-profile.children.cycles-pp.alloc_fresh_huge_page
>       1.53 ± 15%      +0.7        2.26 ± 14%  perf-profile.children.cycles-pp.do_syscall_64
>       1.53 ± 15%      +0.7        2.27 ± 14%  perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
>       1.13 ±  3%      +0.9        2.07 ± 14%  perf-profile.children.cycles-pp.interrupt_entry
>       0.79 ±  9%      +1.0        1.76 ±  5%  perf-profile.children.cycles-pp.perf_event_task_tick
>       1.71 ± 39%      +1.4        3.08 ± 16%  perf-profile.children.cycles-pp.alloc_surplus_huge_page
>       2.66 ± 42%      +2.3        4.94 ± 17%  perf-profile.children.cycles-pp.alloc_huge_page
>       2.89 ± 45%      +2.7        5.54 ± 18%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>       3.34 ± 35%      +2.7        6.02 ± 17%  perf-profile.children.cycles-pp._raw_spin_lock
>      12.77 ± 14%      +3.9       16.63 ±  7%  perf-profile.children.cycles-pp.mutex_spin_on_owner
>      20.12 ±  9%      +4.0       24.16 ±  6%  perf-profile.children.cycles-pp.hugetlb_cow
>      15.40 ± 10%      -3.6       11.84 ± 28%  perf-profile.self.cycles-pp.do_rw_once
>       4.02 ±  9%      -1.3        2.73 ± 30%  perf-profile.self.cycles-pp.do_access
>       2.00 ± 14%      -0.6        1.41 ± 13%  perf-profile.self.cycles-pp.cpuidle_enter_state
>       1.26 ± 16%      -0.5        0.74 ± 13%  perf-profile.self.cycles-pp.native_sched_clock
>       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.self.cycles-pp.account_process_tick
>       0.27 ± 19%      -0.2        0.12 ± 17%  perf-profile.self.cycles-pp.timerqueue_del
>       0.53 ±  3%      -0.1        0.38 ± 11%  perf-profile.self.cycles-pp.update_curr
>       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.self.cycles-pp.__acct_update_integrals
>       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs
>       0.61 ±  4%      -0.1        0.51 ±  8%  perf-profile.self.cycles-pp.task_tick_fair
>       0.20 ±  8%      -0.1        0.12 ± 14%  perf-profile.self.cycles-pp.account_system_index_time
>       0.23 ± 15%      -0.1        0.16 ± 17%  perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit
>       0.25 ± 11%      -0.1        0.18 ± 14%  perf-profile.self.cycles-pp.find_next_bit
>       0.10 ± 11%      -0.1        0.03 ±100%  perf-profile.self.cycles-pp.tick_sched_do_timer
>       0.29            -0.1        0.23 ± 11%  perf-profile.self.cycles-pp.timerqueue_add
>       0.12 ± 10%      -0.1        0.06 ± 17%  perf-profile.self.cycles-pp.account_user_time
>       0.22 ± 15%      -0.1        0.16 ±  6%  perf-profile.self.cycles-pp.scheduler_tick
>       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.self.cycles-pp.cpuacct_charge
>       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.self.cycles-pp.irq_work_tick
>       0.07 ± 13%      -0.0        0.03 ±100%  perf-profile.self.cycles-pp.update_process_times
>       0.12 ±  7%      -0.0        0.08 ± 15%  perf-profile.self.cycles-pp.get_cpu_device
>       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.self.cycles-pp.raise_softirq
>       0.12 ± 11%      -0.0        0.09 ±  7%  perf-profile.self.cycles-pp.tick_nohz_get_sleep_length
>       0.11 ± 11%      +0.0        0.14 ±  6%  perf-profile.self.cycles-pp.native_write_msr
>       0.10 ±  5%      +0.1        0.15 ±  8%  perf-profile.self.cycles-pp.__remove_hrtimer
>       0.07 ± 23%      +0.1        0.13 ±  8%  perf-profile.self.cycles-pp.rb_erase
>       0.08 ± 17%      +0.1        0.15 ±  7%  perf-profile.self.cycles-pp.native_apic_msr_eoi_write
>       0.00            +0.1        0.08 ± 10%  perf-profile.self.cycles-pp.smp_call_function_single
>       0.32 ± 17%      +0.1        0.42 ±  7%  perf-profile.self.cycles-pp.run_timer_softirq
>       0.22 ±  5%      +0.1        0.34 ±  4%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
>       0.45 ± 15%      +0.2        0.60 ± 12%  perf-profile.self.cycles-pp.rcu_irq_enter
>       0.31 ±  8%      +0.2        0.46 ± 16%  perf-profile.self.cycles-pp.irq_enter
>       0.29 ± 10%      +0.2        0.44 ± 16%  perf-profile.self.cycles-pp.apic_timer_interrupt
>       0.71 ± 30%      +0.2        0.92 ±  8%  perf-profile.self.cycles-pp.perf_mux_hrtimer_handler
>       0.00            +0.3        0.28 ± 37%  perf-profile.self.cycles-pp.memcpy_erms
>       1.12 ±  3%      +0.9        2.02 ± 15%  perf-profile.self.cycles-pp.interrupt_entry
>       0.79 ±  9%      +0.9        1.73 ±  5%  perf-profile.self.cycles-pp.perf_event_task_tick
>       2.49 ± 45%      +2.1        4.55 ± 20%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
>      10.95 ± 15%      +2.7       13.61 ±  8%  perf-profile.self.cycles-pp.mutex_spin_on_owner
> 
> 
>                                                                                 
>                                vm-scalability.throughput                        
>                                                                                 
>   1.6e+07 +-+---------------------------------------------------------------+   
>           |..+.+    +..+.+..+.+.   +.      +..+.+..+.+..+.+..+.+..+    +    |   
>   1.4e+07 +-+  :    :  O      O    O                           O            |   
>   1.2e+07 O-+O O  O O    O  O    O    O O  O  O    O    O    O      O  O O  O   
>           |     :   :                           O    O    O       O         |   
>     1e+07 +-+   :  :                                                        |   
>           |     :  :                                                        |   
>     8e+06 +-+   :  :                                                        |   
>           |      : :                                                        |   
>     6e+06 +-+    : :                                                        |   
>     4e+06 +-+    : :                                                        |   
>           |      ::                                                         |   
>     2e+06 +-+     :                                                         |   
>           |       :                                                         |   
>         0 +-+---------------------------------------------------------------+   
>                                                                                 
>                                                                                                                                                                 
>                          vm-scalability.time.minor_page_faults                  
>                                                                                 
>   2.5e+06 +-+---------------------------------------------------------------+   
>           |                                                                 |   
>           |..+.+    +..+.+..+.+..+.+..+.+..  .+.  .+.+..+.+..+.+..+.+..+    |   
>     2e+06 +-+  :    :                      +.   +.                          |   
>           O  O O: O O  O O  O O  O O                    O      O            |   
>           |     :   :                 O O  O  O O  O O    O  O    O O  O O  O   
>   1.5e+06 +-+   :  :                                                        |   
>           |     :  :                                                        |   
>     1e+06 +-+    : :                                                        |   
>           |      : :                                                        |   
>           |      : :                                                        |   
>    500000 +-+    : :                                                        |   
>           |       :                                                         |   
>           |       :                                                         |   
>         0 +-+---------------------------------------------------------------+   
>                                                                                 
>                                                                                                                                                                 
>                                 vm-scalability.workload                         
>                                                                                 
>   3.5e+09 +-+---------------------------------------------------------------+   
>           | .+.                      .+.+..                        .+..     |   
>     3e+09 +-+  +    +..+.+..+.+..+.+.      +..+.+..+.+..+.+..+.+..+    +    |   
>           |    :    :       O O                                O            |   
>   2.5e+09 O-+O O: O O  O O       O O  O    O            O                   |   
>           |     :   :                   O     O O  O O    O  O    O O  O O  O   
>     2e+09 +-+   :  :                                                        |   
>           |     :  :                                                        |   
>   1.5e+09 +-+    : :                                                        |   
>           |      : :                                                        |   
>     1e+09 +-+    : :                                                        |   
>           |      : :                                                        |   
>     5e+08 +-+     :                                                         |   
>           |       :                                                         |   
>         0 +-+---------------------------------------------------------------+   
>                                                                                 
>                                                                                 
> [*] bisect-good sample
> [O] bisect-bad  sample
> 
> 
> 
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
> 
> 
> Thanks,
> Rong Chen
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-30 17:50 ` [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression Thomas Zimmermann
@ 2019-07-30 18:12   ` Daniel Vetter
  2019-07-30 18:50     ` Thomas Zimmermann
  2019-08-04 18:39   ` Thomas Zimmermann
  1 sibling, 1 reply; 61+ messages in thread
From: Daniel Vetter @ 2019-07-30 18:12 UTC (permalink / raw)
  To: Thomas Zimmermann; +Cc: Stephen Rothwell, LKP, dri-devel, kernel test robot

On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> Am 29.07.19 um 11:51 schrieb kernel test robot:
> > Greeting,
> >
> > FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
> >
> > commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> > https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>
> Daniel, Noralf, we may have to revert this patch.
>
> I expected some change in display performance, but not in VM. Since it's
> a server chipset, probably no one cares much about display performance.
> So that seemed like a good trade-off for re-using shared code.
>
> Part of the patch set is that the generic fb emulation now maps and
> unmaps the fbdev BO when updating the screen. I guess that's the cause
> of the performance regression. And it should be visible with other
> drivers as well if they use a shadow FB for fbdev emulation.

For fbcon we should need to do any maps/unamps at all, this is for the
fbdev mmap support only. If the testcase mentioned here tests fbdev
mmap handling it's pretty badly misnamed :-) And as long as you don't
have an fbdev mmap there shouldn't be any impact at all.

> The thing is that we'd need another generic fbdev emulation for ast and
> mgag200 that handles this issue properly.

Yeah I dont think we want to jump the gun here.  If you can try to
repro locally and profile where we're wasting cpu time I hope that
should sched a light what's going wrong here.
-Daniel

>
> Best regards
> Thomas
>
> >
> > in testcase: vm-scalability
> > on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
> > with following parameters:
> >
> >       runtime: 300s
> >       size: 8T
> >       test: anon-cow-seq-hugetlb
> >       cpufreq_governor: performance
> >
> > test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> > test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
> >
> >
> >
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> >
> >
> > To reproduce:
> >
> >         git clone https://github.com/intel/lkp-tests.git
> >         cd lkp-tests
> >         bin/lkp install job.yaml  # job file is attached in this email
> >         bin/lkp run     job.yaml
> >
> > =========================================================================================
> > compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
> >   gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
> >
> > commit:
> >   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
> >   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> >
> > f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
> > ---------------- ---------------------------
> >        fail:runs  %reproduction    fail:runs
> >            |             |             |
> >           2:4          -50%            :4     dmesg.WARNING:at#for_ip_interrupt_entry/0x
> >            :4           25%           1:4     dmesg.WARNING:at_ip___perf_sw_event/0x
> >            :4           25%           1:4     dmesg.WARNING:at_ip__fsnotify_parent/0x
> >          %stddev     %change         %stddev
> >              \          |                \
> >      43955 ±  2%     -18.8%      35691        vm-scalability.median
> >       0.06 ±  7%    +193.0%       0.16 ±  2%  vm-scalability.median_stddev
> >   14906559 ±  2%     -17.9%   12237079        vm-scalability.throughput
> >      87651 ±  2%     -17.4%      72374        vm-scalability.time.involuntary_context_switches
> >    2086168           -23.6%    1594224        vm-scalability.time.minor_page_faults
> >      15082 ±  2%     -10.4%      13517        vm-scalability.time.percent_of_cpu_this_job_got
> >      29987            -8.9%      27327        vm-scalability.time.system_time
> >      15755           -12.4%      13795        vm-scalability.time.user_time
> >     122011           -19.3%      98418        vm-scalability.time.voluntary_context_switches
> >  3.034e+09           -23.6%  2.318e+09        vm-scalability.workload
> >     242478 ± 12%     +68.5%     408518 ± 23%  cpuidle.POLL.time
> >       2788 ± 21%    +117.4%       6062 ± 26%  cpuidle.POLL.usage
> >      56653 ± 10%     +64.4%      93144 ± 20%  meminfo.Mapped
> >     120392 ±  7%     +14.0%     137212 ±  4%  meminfo.Shmem
> >      47221 ± 11%     +77.1%      83634 ± 22%  numa-meminfo.node0.Mapped
> >     120465 ±  7%     +13.9%     137205 ±  4%  numa-meminfo.node0.Shmem
> >    2885513           -16.5%    2409384        numa-numastat.node0.local_node
> >    2885471           -16.5%    2409354        numa-numastat.node0.numa_hit
> >      11813 ± 11%     +76.3%      20824 ± 22%  numa-vmstat.node0.nr_mapped
> >      30096 ±  7%     +13.8%      34238 ±  4%  numa-vmstat.node0.nr_shmem
> >      43.72 ±  2%      +5.5       49.20        mpstat.cpu.all.idle%
> >       0.03 ±  4%      +0.0        0.05 ±  6%  mpstat.cpu.all.soft%
> >      19.51            -2.4       17.08        mpstat.cpu.all.usr%
> >       1012            -7.9%     932.75        turbostat.Avg_MHz
> >      32.38 ± 10%     +25.8%      40.73        turbostat.CPU%c1
> >     145.51            -3.1%     141.01        turbostat.PkgWatt
> >      15.09           -19.2%      12.19        turbostat.RAMWatt
> >      43.50 ±  2%     +13.2%      49.25        vmstat.cpu.id
> >      18.75 ±  2%     -13.3%      16.25 ±  2%  vmstat.cpu.us
> >     152.00 ±  2%      -9.5%     137.50        vmstat.procs.r
> >       4800           -13.1%       4173        vmstat.system.cs
> >     156170           -11.9%     137594        slabinfo.anon_vma.active_objs
> >       3395           -11.9%       2991        slabinfo.anon_vma.active_slabs
> >     156190           -11.9%     137606        slabinfo.anon_vma.num_objs
> >       3395           -11.9%       2991        slabinfo.anon_vma.num_slabs
> >       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.active_objs
> >       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.num_objs
> >       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.active_objs
> >       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.num_objs
> >       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.active_objs
> >       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.num_objs
> >    1330122           -23.6%    1016557        proc-vmstat.htlb_buddy_alloc_success
> >      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_active_anon
> >      67277            +2.9%      69246        proc-vmstat.nr_anon_pages
> >     218.50 ±  3%     -10.6%     195.25        proc-vmstat.nr_dirtied
> >     288628            +1.4%     292755        proc-vmstat.nr_file_pages
> >     360.50            -2.7%     350.75        proc-vmstat.nr_inactive_file
> >      14225 ±  9%     +63.8%      23304 ± 20%  proc-vmstat.nr_mapped
> >      30109 ±  7%     +13.8%      34259 ±  4%  proc-vmstat.nr_shmem
> >      99870            -1.3%      98597        proc-vmstat.nr_slab_unreclaimable
> >     204.00 ±  4%     -12.1%     179.25        proc-vmstat.nr_written
> >      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_zone_active_anon
> >     360.50            -2.7%     350.75        proc-vmstat.nr_zone_inactive_file
> >       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults
> >       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults_local
> >    2904082           -16.4%    2427026        proc-vmstat.numa_hit
> >    2904081           -16.4%    2427025        proc-vmstat.numa_local
> >  6.828e+08           -23.5%  5.221e+08        proc-vmstat.pgalloc_normal
> >    2900008           -17.2%    2400195        proc-vmstat.pgfault
> >  6.827e+08           -23.5%   5.22e+08        proc-vmstat.pgfree
> >  1.635e+10           -17.0%  1.357e+10        perf-stat.i.branch-instructions
> >       1.53 ±  4%      -0.1        1.45 ±  3%  perf-stat.i.branch-miss-rate%
> >  2.581e+08 ±  3%     -20.5%  2.051e+08 ±  2%  perf-stat.i.branch-misses
> >      12.66            +1.1       13.78        perf-stat.i.cache-miss-rate%
> >   72720849           -12.0%   63958986        perf-stat.i.cache-misses
> >  5.766e+08           -18.6%  4.691e+08        perf-stat.i.cache-references
> >       4674 ±  2%     -13.0%       4064        perf-stat.i.context-switches
> >       4.29           +12.5%       4.83        perf-stat.i.cpi
> >  2.573e+11            -7.4%  2.383e+11        perf-stat.i.cpu-cycles
> >     231.35           -21.5%     181.56        perf-stat.i.cpu-migrations
> >       3522            +4.4%       3677        perf-stat.i.cycles-between-cache-misses
> >       0.09 ± 13%      +0.0        0.12 ±  5%  perf-stat.i.iTLB-load-miss-rate%
> >  5.894e+10           -15.8%  4.961e+10        perf-stat.i.iTLB-loads
> >  5.901e+10           -15.8%  4.967e+10        perf-stat.i.instructions
> >       1291 ± 14%     -21.8%       1010        perf-stat.i.instructions-per-iTLB-miss
> >       0.24           -11.0%       0.21        perf-stat.i.ipc
> >       9476           -17.5%       7821        perf-stat.i.minor-faults
> >       9478           -17.5%       7821        perf-stat.i.page-faults
> >       9.76            -3.6%       9.41        perf-stat.overall.MPKI
> >       1.59 ±  4%      -0.1        1.52        perf-stat.overall.branch-miss-rate%
> >      12.61            +1.1       13.71        perf-stat.overall.cache-miss-rate%
> >       4.38           +10.5%       4.83        perf-stat.overall.cpi
> >       3557            +5.3%       3747        perf-stat.overall.cycles-between-cache-misses
> >       0.08 ± 12%      +0.0        0.10        perf-stat.overall.iTLB-load-miss-rate%
> >       1268 ± 15%     -23.0%     976.22        perf-stat.overall.instructions-per-iTLB-miss
> >       0.23            -9.5%       0.21        perf-stat.overall.ipc
> >       5815            +9.7%       6378        perf-stat.overall.path-length
> >  1.634e+10           -17.5%  1.348e+10        perf-stat.ps.branch-instructions
> >  2.595e+08 ±  3%     -21.2%  2.043e+08 ±  2%  perf-stat.ps.branch-misses
> >   72565205           -12.2%   63706339        perf-stat.ps.cache-misses
> >  5.754e+08           -19.2%  4.646e+08        perf-stat.ps.cache-references
> >       4640 ±  2%     -12.5%       4060        perf-stat.ps.context-switches
> >  2.581e+11            -7.5%  2.387e+11        perf-stat.ps.cpu-cycles
> >     229.91           -22.0%     179.42        perf-stat.ps.cpu-migrations
> >  5.889e+10           -16.3%  4.927e+10        perf-stat.ps.iTLB-loads
> >  5.899e+10           -16.3%  4.938e+10        perf-stat.ps.instructions
> >       9388           -18.2%       7677        perf-stat.ps.minor-faults
> >       9389           -18.2%       7677        perf-stat.ps.page-faults
> >  1.764e+13           -16.2%  1.479e+13        perf-stat.total.instructions
> >      46803 ±  3%     -18.8%      37982 ±  6%  sched_debug.cfs_rq:/.exec_clock.min
> >       5320 ±  3%     +23.7%       6581 ±  3%  sched_debug.cfs_rq:/.exec_clock.stddev
> >       6737 ± 14%     +58.1%      10649 ± 10%  sched_debug.cfs_rq:/.load.avg
> >     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.load.max
> >      46952 ± 16%     +64.8%      77388 ± 11%  sched_debug.cfs_rq:/.load.stddev
> >       7.12 ±  4%     +49.1%      10.62 ±  6%  sched_debug.cfs_rq:/.load_avg.avg
> >     474.40 ± 23%     +67.5%     794.60 ± 10%  sched_debug.cfs_rq:/.load_avg.max
> >      37.70 ± 11%     +74.8%      65.90 ±  9%  sched_debug.cfs_rq:/.load_avg.stddev
> >   13424269 ±  4%     -15.6%   11328098 ±  2%  sched_debug.cfs_rq:/.min_vruntime.avg
> >   15411275 ±  3%     -12.4%   13505072 ±  2%  sched_debug.cfs_rq:/.min_vruntime.max
> >    7939295 ±  6%     -17.5%    6551322 ±  7%  sched_debug.cfs_rq:/.min_vruntime.min
> >      21.44 ±  7%     -56.1%       9.42 ±  4%  sched_debug.cfs_rq:/.nr_spread_over.avg
> >     117.45 ± 11%     -60.6%      46.30 ± 14%  sched_debug.cfs_rq:/.nr_spread_over.max
> >      19.33 ±  8%     -66.4%       6.49 ±  9%  sched_debug.cfs_rq:/.nr_spread_over.stddev
> >       4.32 ± 15%     +84.4%       7.97 ±  3%  sched_debug.cfs_rq:/.runnable_load_avg.avg
> >     353.85 ± 29%    +118.8%     774.35 ± 11%  sched_debug.cfs_rq:/.runnable_load_avg.max
> >      27.30 ± 24%    +118.5%      59.64 ±  9%  sched_debug.cfs_rq:/.runnable_load_avg.stddev
> >       6729 ± 14%     +58.2%      10644 ± 10%  sched_debug.cfs_rq:/.runnable_weight.avg
> >     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.runnable_weight.max
> >      46950 ± 16%     +64.8%      77387 ± 11%  sched_debug.cfs_rq:/.runnable_weight.stddev
> >    5305069 ±  4%     -17.4%    4380376 ±  7%  sched_debug.cfs_rq:/.spread0.avg
> >    7328745 ±  3%      -9.9%    6600897 ±  3%  sched_debug.cfs_rq:/.spread0.max
> >    2220837 ±  4%     +55.8%    3460596 ±  5%  sched_debug.cpu.avg_idle.avg
> >    4590666 ±  9%     +76.8%    8117037 ± 15%  sched_debug.cpu.avg_idle.max
> >     485052 ±  7%     +80.3%     874679 ± 10%  sched_debug.cpu.avg_idle.stddev
> >     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock.stddev
> >     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock_task.stddev
> >       3.20 ± 10%    +109.6%       6.70 ±  3%  sched_debug.cpu.cpu_load[0].avg
> >     309.10 ± 20%    +150.3%     773.75 ± 12%  sched_debug.cpu.cpu_load[0].max
> >      21.02 ± 14%    +160.8%      54.80 ±  9%  sched_debug.cpu.cpu_load[0].stddev
> >       3.19 ±  8%    +109.8%       6.70 ±  3%  sched_debug.cpu.cpu_load[1].avg
> >     299.75 ± 19%    +158.0%     773.30 ± 12%  sched_debug.cpu.cpu_load[1].max
> >      20.32 ± 12%    +168.7%      54.62 ±  9%  sched_debug.cpu.cpu_load[1].stddev
> >       3.20 ±  8%    +109.1%       6.69 ±  4%  sched_debug.cpu.cpu_load[2].avg
> >     288.90 ± 20%    +167.0%     771.40 ± 12%  sched_debug.cpu.cpu_load[2].max
> >      19.70 ± 12%    +175.4%      54.27 ±  9%  sched_debug.cpu.cpu_load[2].stddev
> >       3.16 ±  8%    +110.9%       6.66 ±  6%  sched_debug.cpu.cpu_load[3].avg
> >     275.50 ± 24%    +178.4%     766.95 ± 12%  sched_debug.cpu.cpu_load[3].max
> >      18.92 ± 15%    +184.2%      53.77 ± 10%  sched_debug.cpu.cpu_load[3].stddev
> >       3.08 ±  8%    +115.7%       6.65 ±  7%  sched_debug.cpu.cpu_load[4].avg
> >     263.55 ± 28%    +188.7%     760.85 ± 12%  sched_debug.cpu.cpu_load[4].max
> >      18.03 ± 18%    +196.6%      53.46 ± 11%  sched_debug.cpu.cpu_load[4].stddev
> >      14543            -9.6%      13150        sched_debug.cpu.curr->pid.max
> >       5293 ± 16%     +74.7%       9248 ± 11%  sched_debug.cpu.load.avg
> >     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cpu.load.max
> >      40887 ± 19%     +78.3%      72891 ±  9%  sched_debug.cpu.load.stddev
> >    1141679 ±  4%     +56.9%    1790907 ±  5%  sched_debug.cpu.max_idle_balance_cost.avg
> >    2432100 ±  9%     +72.6%    4196779 ± 13%  sched_debug.cpu.max_idle_balance_cost.max
> >     745656           +29.3%     964170 ±  5%  sched_debug.cpu.max_idle_balance_cost.min
> >     239032 ±  9%     +81.9%     434806 ± 10%  sched_debug.cpu.max_idle_balance_cost.stddev
> >       0.00 ± 27%     +92.1%       0.00 ± 31%  sched_debug.cpu.next_balance.stddev
> >       1030 ±  4%     -10.4%     924.00 ±  2%  sched_debug.cpu.nr_switches.min
> >       0.04 ± 26%    +139.0%       0.09 ± 41%  sched_debug.cpu.nr_uninterruptible.avg
> >     830.35 ±  6%     -12.0%     730.50 ±  2%  sched_debug.cpu.sched_count.min
> >     912.00 ±  2%      -9.5%     825.38        sched_debug.cpu.ttwu_count.avg
> >     433.05 ±  3%     -19.2%     350.05 ±  3%  sched_debug.cpu.ttwu_count.min
> >     160.70 ±  3%     -12.5%     140.60 ±  4%  sched_debug.cpu.ttwu_local.min
> >       9072 ± 11%     -36.4%       5767 ±  8%  softirqs.CPU1.RCU
> >      12769 ±  5%     +15.3%      14718 ±  3%  softirqs.CPU101.SCHED
> >      13198           +11.5%      14717 ±  3%  softirqs.CPU102.SCHED
> >      12981 ±  4%     +13.9%      14788 ±  3%  softirqs.CPU105.SCHED
> >      13486 ±  3%     +11.8%      15071 ±  4%  softirqs.CPU111.SCHED
> >      12794 ±  4%     +14.1%      14601 ±  9%  softirqs.CPU112.SCHED
> >      12999 ±  4%     +10.1%      14314 ±  4%  softirqs.CPU115.SCHED
> >      12844 ±  4%     +10.6%      14202 ±  2%  softirqs.CPU120.SCHED
> >      13336 ±  3%      +9.4%      14585 ±  3%  softirqs.CPU122.SCHED
> >      12639 ±  4%     +20.2%      15195        softirqs.CPU123.SCHED
> >      13040 ±  5%     +15.2%      15024 ±  5%  softirqs.CPU126.SCHED
> >      13123           +15.1%      15106 ±  5%  softirqs.CPU127.SCHED
> >       9188 ±  6%     -35.7%       5911 ±  2%  softirqs.CPU13.RCU
> >      13054 ±  3%     +13.1%      14761 ±  5%  softirqs.CPU130.SCHED
> >      13158 ±  2%     +13.9%      14985 ±  5%  softirqs.CPU131.SCHED
> >      12797 ±  6%     +13.5%      14524 ±  3%  softirqs.CPU133.SCHED
> >      12452 ±  5%     +14.8%      14297        softirqs.CPU134.SCHED
> >      13078 ±  3%     +10.4%      14439 ±  3%  softirqs.CPU138.SCHED
> >      12617 ±  2%     +14.5%      14442 ±  5%  softirqs.CPU139.SCHED
> >      12974 ±  3%     +13.7%      14752 ±  4%  softirqs.CPU142.SCHED
> >      12579 ±  4%     +19.1%      14983 ±  3%  softirqs.CPU143.SCHED
> >       9122 ± 24%     -44.6%       5053 ±  5%  softirqs.CPU144.RCU
> >      13366 ±  2%     +11.1%      14848 ±  3%  softirqs.CPU149.SCHED
> >      13246 ±  2%     +22.0%      16162 ±  7%  softirqs.CPU150.SCHED
> >      13452 ±  3%     +20.5%      16210 ±  7%  softirqs.CPU151.SCHED
> >      13507           +10.1%      14869        softirqs.CPU156.SCHED
> >      13808 ±  3%      +9.2%      15079 ±  4%  softirqs.CPU157.SCHED
> >      13442 ±  2%     +13.4%      15248 ±  4%  softirqs.CPU160.SCHED
> >      13311           +12.1%      14920 ±  2%  softirqs.CPU162.SCHED
> >      13544 ±  3%      +8.5%      14695 ±  4%  softirqs.CPU163.SCHED
> >      13648 ±  3%     +11.2%      15179 ±  2%  softirqs.CPU166.SCHED
> >      13404 ±  4%     +12.5%      15079 ±  3%  softirqs.CPU168.SCHED
> >      13421 ±  6%     +16.0%      15568 ±  8%  softirqs.CPU169.SCHED
> >      13115 ±  3%     +23.1%      16139 ± 10%  softirqs.CPU171.SCHED
> >      13424 ±  6%     +10.4%      14822 ±  3%  softirqs.CPU175.SCHED
> >      13274 ±  3%     +13.7%      15087 ±  9%  softirqs.CPU185.SCHED
> >      13409 ±  3%     +12.3%      15063 ±  3%  softirqs.CPU190.SCHED
> >      13181 ±  7%     +13.4%      14946 ±  3%  softirqs.CPU196.SCHED
> >      13578 ±  3%     +10.9%      15061        softirqs.CPU197.SCHED
> >      13323 ±  5%     +24.8%      16627 ±  6%  softirqs.CPU198.SCHED
> >      14072 ±  2%     +12.3%      15798 ±  7%  softirqs.CPU199.SCHED
> >      12604 ± 13%     +17.9%      14865        softirqs.CPU201.SCHED
> >      13380 ±  4%     +14.8%      15356 ±  3%  softirqs.CPU203.SCHED
> >      13481 ±  8%     +14.2%      15390 ±  3%  softirqs.CPU204.SCHED
> >      12921 ±  2%     +13.8%      14710 ±  3%  softirqs.CPU206.SCHED
> >      13468           +13.0%      15218 ±  2%  softirqs.CPU208.SCHED
> >      13253 ±  2%     +13.1%      14992        softirqs.CPU209.SCHED
> >      13319 ±  2%     +14.3%      15225 ±  7%  softirqs.CPU210.SCHED
> >      13673 ±  5%     +16.3%      15895 ±  3%  softirqs.CPU211.SCHED
> >      13290           +17.0%      15556 ±  5%  softirqs.CPU212.SCHED
> >      13455 ±  4%     +14.4%      15392 ±  3%  softirqs.CPU213.SCHED
> >      13454 ±  4%     +14.3%      15377 ±  3%  softirqs.CPU215.SCHED
> >      13872 ±  7%      +9.7%      15221 ±  5%  softirqs.CPU220.SCHED
> >      13555 ±  4%     +17.3%      15896 ±  5%  softirqs.CPU222.SCHED
> >      13411 ±  4%     +20.8%      16197 ±  6%  softirqs.CPU223.SCHED
> >       8472 ± 21%     -44.8%       4680 ±  3%  softirqs.CPU224.RCU
> >      13141 ±  3%     +16.2%      15265 ±  7%  softirqs.CPU225.SCHED
> >      14084 ±  3%      +8.2%      15242 ±  2%  softirqs.CPU226.SCHED
> >      13528 ±  4%     +11.3%      15063 ±  4%  softirqs.CPU228.SCHED
> >      13218 ±  3%     +16.3%      15377 ±  4%  softirqs.CPU229.SCHED
> >      14031 ±  4%     +10.2%      15467 ±  2%  softirqs.CPU231.SCHED
> >      13770 ±  3%     +14.0%      15700 ±  3%  softirqs.CPU232.SCHED
> >      13456 ±  3%     +12.3%      15105 ±  3%  softirqs.CPU233.SCHED
> >      13137 ±  4%     +13.5%      14909 ±  3%  softirqs.CPU234.SCHED
> >      13318 ±  2%     +14.7%      15280 ±  2%  softirqs.CPU235.SCHED
> >      13690 ±  2%     +13.7%      15563 ±  7%  softirqs.CPU238.SCHED
> >      13771 ±  5%     +20.8%      16634 ±  7%  softirqs.CPU241.SCHED
> >      13317 ±  7%     +19.5%      15919 ±  9%  softirqs.CPU243.SCHED
> >       8234 ± 16%     -43.9%       4616 ±  5%  softirqs.CPU244.RCU
> >      13845 ±  6%     +13.0%      15643 ±  3%  softirqs.CPU244.SCHED
> >      13179 ±  3%     +16.3%      15323        softirqs.CPU246.SCHED
> >      13754           +12.2%      15438 ±  3%  softirqs.CPU248.SCHED
> >      13769 ±  4%     +10.9%      15276 ±  2%  softirqs.CPU252.SCHED
> >      13702           +10.5%      15147 ±  2%  softirqs.CPU254.SCHED
> >      13315 ±  2%     +12.5%      14980 ±  3%  softirqs.CPU255.SCHED
> >      13785 ±  3%     +12.9%      15568 ±  5%  softirqs.CPU256.SCHED
> >      13307 ±  3%     +15.0%      15298 ±  3%  softirqs.CPU257.SCHED
> >      13864 ±  3%     +10.5%      15313 ±  2%  softirqs.CPU259.SCHED
> >      13879 ±  2%     +11.4%      15465        softirqs.CPU261.SCHED
> >      13815           +13.6%      15687 ±  5%  softirqs.CPU264.SCHED
> >     119574 ±  2%     +11.8%     133693 ± 11%  softirqs.CPU266.TIMER
> >      13688           +10.9%      15180 ±  6%  softirqs.CPU267.SCHED
> >      11716 ±  4%     +19.3%      13974 ±  8%  softirqs.CPU27.SCHED
> >      13866 ±  3%     +13.7%      15765 ±  4%  softirqs.CPU271.SCHED
> >      13887 ±  5%     +12.5%      15621        softirqs.CPU272.SCHED
> >      13383 ±  3%     +19.8%      16031 ±  2%  softirqs.CPU274.SCHED
> >      13347           +14.1%      15232 ±  3%  softirqs.CPU275.SCHED
> >      12884 ±  2%     +21.0%      15593 ±  4%  softirqs.CPU276.SCHED
> >      13131 ±  5%     +13.4%      14891 ±  5%  softirqs.CPU277.SCHED
> >      12891 ±  2%     +19.2%      15371 ±  4%  softirqs.CPU278.SCHED
> >      13313 ±  4%     +13.0%      15049 ±  2%  softirqs.CPU279.SCHED
> >      13514 ±  3%     +10.2%      14897 ±  2%  softirqs.CPU280.SCHED
> >      13501 ±  3%     +13.7%      15346        softirqs.CPU281.SCHED
> >      13261           +17.5%      15577        softirqs.CPU282.SCHED
> >       8076 ± 15%     -43.7%       4546 ±  5%  softirqs.CPU283.RCU
> >      13686 ±  3%     +12.6%      15413 ±  2%  softirqs.CPU284.SCHED
> >      13439 ±  2%      +9.2%      14670 ±  4%  softirqs.CPU285.SCHED
> >       8878 ±  9%     -35.4%       5735 ±  4%  softirqs.CPU35.RCU
> >      11690 ±  2%     +13.6%      13274 ±  5%  softirqs.CPU40.SCHED
> >      11714 ±  2%     +19.3%      13975 ± 13%  softirqs.CPU41.SCHED
> >      11763           +12.5%      13239 ±  4%  softirqs.CPU45.SCHED
> >      11662 ±  2%      +9.4%      12757 ±  3%  softirqs.CPU46.SCHED
> >      11805 ±  2%      +9.3%      12902 ±  2%  softirqs.CPU50.SCHED
> >      12158 ±  3%     +12.3%      13655 ±  8%  softirqs.CPU55.SCHED
> >      11716 ±  4%      +8.8%      12751 ±  3%  softirqs.CPU58.SCHED
> >      11922 ±  2%      +9.9%      13100 ±  4%  softirqs.CPU64.SCHED
> >       9674 ± 17%     -41.8%       5625 ±  6%  softirqs.CPU66.RCU
> >      11818           +12.0%      13237        softirqs.CPU66.SCHED
> >     124682 ±  7%      -6.1%     117088 ±  5%  softirqs.CPU66.TIMER
> >       8637 ±  9%     -34.0%       5700 ±  7%  softirqs.CPU70.RCU
> >      11624 ±  2%     +11.0%      12901 ±  2%  softirqs.CPU70.SCHED
> >      12372 ±  2%     +13.2%      14003 ±  3%  softirqs.CPU71.SCHED
> >       9949 ± 25%     -33.9%       6574 ± 31%  softirqs.CPU72.RCU
> >      10392 ± 26%     -35.1%       6745 ± 35%  softirqs.CPU73.RCU
> >      12766 ±  3%     +11.1%      14188 ±  3%  softirqs.CPU76.SCHED
> >      12611 ±  2%     +18.8%      14984 ±  5%  softirqs.CPU78.SCHED
> >      12786 ±  3%     +17.9%      15079 ±  7%  softirqs.CPU79.SCHED
> >      11947 ±  4%      +9.7%      13103 ±  4%  softirqs.CPU8.SCHED
> >      13379 ±  7%     +11.8%      14962 ±  4%  softirqs.CPU83.SCHED
> >      13438 ±  5%      +9.7%      14738 ±  2%  softirqs.CPU84.SCHED
> >      12768           +19.4%      15241 ±  6%  softirqs.CPU88.SCHED
> >       8604 ± 13%     -39.3%       5222 ±  3%  softirqs.CPU89.RCU
> >      13077 ±  2%     +17.1%      15308 ±  7%  softirqs.CPU89.SCHED
> >      11887 ±  3%     +20.1%      14272 ±  5%  softirqs.CPU9.SCHED
> >      12723 ±  3%     +11.3%      14165 ±  4%  softirqs.CPU90.SCHED
> >       8439 ± 12%     -38.9%       5153 ±  4%  softirqs.CPU91.RCU
> >      13429 ±  3%     +10.3%      14806 ±  2%  softirqs.CPU95.SCHED
> >      12852 ±  4%     +10.3%      14174 ±  5%  softirqs.CPU96.SCHED
> >      13010 ±  2%     +14.4%      14888 ±  5%  softirqs.CPU97.SCHED
> >    2315644 ±  4%     -36.2%    1477200 ±  4%  softirqs.RCU
> >       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.NMI:Non-maskable_interrupts
> >       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.PMI:Performance_monitoring_interrupts
> >     252.00 ± 11%     -35.2%     163.25 ± 13%  interrupts.CPU104.RES:Rescheduling_interrupts
> >       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.NMI:Non-maskable_interrupts
> >       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.PMI:Performance_monitoring_interrupts
> >     245.75 ± 19%     -31.0%     169.50 ±  7%  interrupts.CPU105.RES:Rescheduling_interrupts
> >     228.75 ± 13%     -24.7%     172.25 ± 19%  interrupts.CPU106.RES:Rescheduling_interrupts
> >       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.NMI:Non-maskable_interrupts
> >       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.PMI:Performance_monitoring_interrupts
> >       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.NMI:Non-maskable_interrupts
> >       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.PMI:Performance_monitoring_interrupts
> >       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.NMI:Non-maskable_interrupts
> >       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.PMI:Performance_monitoring_interrupts
> >     311.50 ± 23%     -47.7%     163.00 ±  9%  interrupts.CPU122.RES:Rescheduling_interrupts
> >     266.75 ± 19%     -31.6%     182.50 ± 15%  interrupts.CPU124.RES:Rescheduling_interrupts
> >     293.75 ± 33%     -32.3%     198.75 ± 19%  interrupts.CPU125.RES:Rescheduling_interrupts
> >       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.NMI:Non-maskable_interrupts
> >       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.PMI:Performance_monitoring_interrupts
> >       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.NMI:Non-maskable_interrupts
> >       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.PMI:Performance_monitoring_interrupts
> >       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.NMI:Non-maskable_interrupts
> >       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.PMI:Performance_monitoring_interrupts
> >     219.50 ± 27%     -23.0%     169.00 ± 21%  interrupts.CPU139.RES:Rescheduling_interrupts
> >     290.25 ± 25%     -32.5%     196.00 ± 11%  interrupts.CPU14.RES:Rescheduling_interrupts
> >     243.50 ±  4%     -16.0%     204.50 ± 12%  interrupts.CPU140.RES:Rescheduling_interrupts
> >       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.NMI:Non-maskable_interrupts
> >       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.PMI:Performance_monitoring_interrupts
> >       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.NMI:Non-maskable_interrupts
> >       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.PMI:Performance_monitoring_interrupts
> >     292.25 ± 34%     -33.9%     193.25 ±  6%  interrupts.CPU15.RES:Rescheduling_interrupts
> >     424.25 ± 37%     -58.5%     176.25 ± 14%  interrupts.CPU158.RES:Rescheduling_interrupts
> >     312.50 ± 42%     -54.2%     143.00 ± 18%  interrupts.CPU159.RES:Rescheduling_interrupts
> >     725.00 ±118%     -75.7%     176.25 ± 14%  interrupts.CPU163.RES:Rescheduling_interrupts
> >       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.NMI:Non-maskable_interrupts
> >       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.PMI:Performance_monitoring_interrupts
> >     239.50 ± 30%     -46.6%     128.00 ± 14%  interrupts.CPU179.RES:Rescheduling_interrupts
> >     320.75 ± 15%     -24.0%     243.75 ± 20%  interrupts.CPU20.RES:Rescheduling_interrupts
> >     302.50 ± 17%     -47.2%     159.75 ±  8%  interrupts.CPU200.RES:Rescheduling_interrupts
> >       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.NMI:Non-maskable_interrupts
> >       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.PMI:Performance_monitoring_interrupts
> >     217.00 ± 11%     -34.6%     142.00 ± 12%  interrupts.CPU214.RES:Rescheduling_interrupts
> >       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.NMI:Non-maskable_interrupts
> >       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.PMI:Performance_monitoring_interrupts
> >       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.NMI:Non-maskable_interrupts
> >       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.PMI:Performance_monitoring_interrupts
> >     289.50 ± 28%     -41.1%     170.50 ±  8%  interrupts.CPU22.RES:Rescheduling_interrupts
> >       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.NMI:Non-maskable_interrupts
> >       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.PMI:Performance_monitoring_interrupts
> >       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.NMI:Non-maskable_interrupts
> >       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.PMI:Performance_monitoring_interrupts
> >       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.NMI:Non-maskable_interrupts
> >       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.PMI:Performance_monitoring_interrupts
> >       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.NMI:Non-maskable_interrupts
> >       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.PMI:Performance_monitoring_interrupts
> >     248.00 ± 36%     -36.3%     158.00 ± 19%  interrupts.CPU228.RES:Rescheduling_interrupts
> >       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.NMI:Non-maskable_interrupts
> >       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.PMI:Performance_monitoring_interrupts
> >     404.25 ± 69%     -65.5%     139.50 ± 17%  interrupts.CPU236.RES:Rescheduling_interrupts
> >     566.50 ± 40%     -73.6%     149.50 ± 31%  interrupts.CPU237.RES:Rescheduling_interrupts
> >     243.50 ± 26%     -37.1%     153.25 ± 21%  interrupts.CPU248.RES:Rescheduling_interrupts
> >     258.25 ± 12%     -53.5%     120.00 ± 18%  interrupts.CPU249.RES:Rescheduling_interrupts
> >       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.NMI:Non-maskable_interrupts
> >       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.PMI:Performance_monitoring_interrupts
> >       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.NMI:Non-maskable_interrupts
> >       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.PMI:Performance_monitoring_interrupts
> >     425.00 ± 59%     -60.3%     168.75 ± 34%  interrupts.CPU258.RES:Rescheduling_interrupts
> >       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.NMI:Non-maskable_interrupts
> >       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.PMI:Performance_monitoring_interrupts
> >       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.NMI:Non-maskable_interrupts
> >       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.PMI:Performance_monitoring_interrupts
> >       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.NMI:Non-maskable_interrupts
> >       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.PMI:Performance_monitoring_interrupts
> >       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.NMI:Non-maskable_interrupts
> >       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.PMI:Performance_monitoring_interrupts
> >       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.NMI:Non-maskable_interrupts
> >       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.PMI:Performance_monitoring_interrupts
> >       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.NMI:Non-maskable_interrupts
> >       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.PMI:Performance_monitoring_interrupts
> >       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.NMI:Non-maskable_interrupts
> >       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.PMI:Performance_monitoring_interrupts
> >     331.75 ± 32%     -39.8%     199.75 ± 17%  interrupts.CPU29.RES:Rescheduling_interrupts
> >       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.NMI:Non-maskable_interrupts
> >       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.PMI:Performance_monitoring_interrupts
> >     298.50 ± 30%     -39.7%     180.00 ±  6%  interrupts.CPU34.RES:Rescheduling_interrupts
> >       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.NMI:Non-maskable_interrupts
> >       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.PMI:Performance_monitoring_interrupts
> >     270.50 ± 24%     -31.1%     186.25 ±  3%  interrupts.CPU36.RES:Rescheduling_interrupts
> >       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.NMI:Non-maskable_interrupts
> >       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.PMI:Performance_monitoring_interrupts
> >     286.75 ± 36%     -32.4%     193.75 ±  7%  interrupts.CPU45.RES:Rescheduling_interrupts
> >     259.00 ± 12%     -23.6%     197.75 ± 13%  interrupts.CPU46.RES:Rescheduling_interrupts
> >     244.00 ± 21%     -35.6%     157.25 ± 11%  interrupts.CPU47.RES:Rescheduling_interrupts
> >     230.00 ±  7%     -21.3%     181.00 ± 11%  interrupts.CPU48.RES:Rescheduling_interrupts
> >     281.00 ± 13%     -27.4%     204.00 ± 15%  interrupts.CPU53.RES:Rescheduling_interrupts
> >     256.75 ±  5%     -18.4%     209.50 ± 12%  interrupts.CPU54.RES:Rescheduling_interrupts
> >       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.NMI:Non-maskable_interrupts
> >       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.PMI:Performance_monitoring_interrupts
> >     316.00 ± 25%     -41.4%     185.25 ± 13%  interrupts.CPU59.RES:Rescheduling_interrupts
> >       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.NMI:Non-maskable_interrupts
> >       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.PMI:Performance_monitoring_interrupts
> >       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.NMI:Non-maskable_interrupts
> >       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.PMI:Performance_monitoring_interrupts
> >       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.NMI:Non-maskable_interrupts
> >       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.PMI:Performance_monitoring_interrupts
> >       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.NMI:Non-maskable_interrupts
> >       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.PMI:Performance_monitoring_interrupts
> >     319.00 ± 40%     -44.7%     176.25 ±  9%  interrupts.CPU67.RES:Rescheduling_interrupts
> >       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.NMI:Non-maskable_interrupts
> >       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.PMI:Performance_monitoring_interrupts
> >       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.NMI:Non-maskable_interrupts
> >       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.PMI:Performance_monitoring_interrupts
> >       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.NMI:Non-maskable_interrupts
> >       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.PMI:Performance_monitoring_interrupts
> >     426.75 ± 61%     -67.7%     138.00 ±  8%  interrupts.CPU75.RES:Rescheduling_interrupts
> >     192.50 ± 13%     +45.6%     280.25 ± 45%  interrupts.CPU76.RES:Rescheduling_interrupts
> >     274.25 ± 34%     -42.2%     158.50 ± 34%  interrupts.CPU77.RES:Rescheduling_interrupts
> >       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.NMI:Non-maskable_interrupts
> >       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.PMI:Performance_monitoring_interrupts
> >     348.50 ± 53%     -47.3%     183.75 ± 29%  interrupts.CPU80.RES:Rescheduling_interrupts
> >       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.NMI:Non-maskable_interrupts
> >       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.PMI:Performance_monitoring_interrupts
> >       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.NMI:Non-maskable_interrupts
> >       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.PMI:Performance_monitoring_interrupts
> >       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.NMI:Non-maskable_interrupts
> >       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.PMI:Performance_monitoring_interrupts
> >     408.75 ± 58%     -56.8%     176.75 ± 25%  interrupts.CPU92.RES:Rescheduling_interrupts
> >     399.00 ± 64%     -63.6%     145.25 ± 16%  interrupts.CPU93.RES:Rescheduling_interrupts
> >     314.75 ± 36%     -44.2%     175.75 ± 13%  interrupts.CPU94.RES:Rescheduling_interrupts
> >     191.00 ± 15%     -29.1%     135.50 ±  9%  interrupts.CPU97.RES:Rescheduling_interrupts
> >      94.00 ±  8%     +50.0%     141.00 ± 12%  interrupts.IWI:IRQ_work_interrupts
> >     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.NMI:Non-maskable_interrupts
> >     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.PMI:Performance_monitoring_interrupts
> >      12.75 ± 11%      -4.1        8.67 ± 31%  perf-profile.calltrace.cycles-pp.do_rw_once
> >       1.02 ± 16%      -0.6        0.47 ± 59%  perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle
> >       1.10 ± 15%      -0.4        0.66 ± 14%  perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
> >       1.05 ± 16%      -0.4        0.61 ± 14%  perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter
> >       1.58 ±  4%      +0.3        1.91 ±  7%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page
> >       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >       2.11 ±  4%      +0.5        2.60 ±  7%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault
> >       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
> >       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >       1.90 ±  5%      +0.6        2.45 ±  7%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage
> >       0.65 ± 62%      +0.6        1.20 ± 15%  perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
> >       0.60 ± 62%      +0.6        1.16 ± 18%  perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap
> >       0.95 ± 17%      +0.6        1.52 ±  8%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner
> >       0.61 ± 62%      +0.6        1.18 ± 18%  perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput
> >       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit
> >       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit
> >       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
> >       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group
> >       1.30 ±  9%      +0.6        1.92 ±  8%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock
> >       0.19 ±173%      +0.7        0.89 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu
> >       0.19 ±173%      +0.7        0.90 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu
> >       0.00            +0.8        0.77 ± 30%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page
> >       0.00            +0.8        0.78 ± 30%  perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page
> >       0.00            +0.8        0.79 ± 29%  perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
> >       0.82 ± 67%      +0.9        1.72 ± 22%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault
> >       0.84 ± 66%      +0.9        1.74 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
> >       2.52 ±  6%      +0.9        3.44 ±  9%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page
> >       0.83 ± 67%      +0.9        1.75 ± 21%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
> >       0.84 ± 66%      +0.9        1.77 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
> >       1.64 ± 12%      +1.0        2.67 ±  7%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault
> >       1.65 ± 45%      +1.3        2.99 ± 18%  perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
> >       1.74 ± 13%      +1.4        3.16 ±  6%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault
> >       2.56 ± 48%      +2.2        4.81 ± 19%  perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault
> >      12.64 ± 14%      +3.6       16.20 ±  8%  perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault
> >       2.97 ±  7%      +3.8        6.74 ±  9%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow
> >      19.99 ±  9%      +4.1       24.05 ±  6%  perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault
> >       1.37 ± 15%      -0.5        0.83 ± 13%  perf-profile.children.cycles-pp.sched_clock_cpu
> >       1.31 ± 16%      -0.5        0.78 ± 13%  perf-profile.children.cycles-pp.sched_clock
> >       1.29 ± 16%      -0.5        0.77 ± 13%  perf-profile.children.cycles-pp.native_sched_clock
> >       1.80 ±  2%      -0.3        1.47 ± 10%  perf-profile.children.cycles-pp.task_tick_fair
> >       0.73 ±  2%      -0.2        0.54 ± 11%  perf-profile.children.cycles-pp.update_curr
> >       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.children.cycles-pp.account_process_tick
> >       0.73 ± 10%      -0.2        0.58 ±  9%  perf-profile.children.cycles-pp.rcu_sched_clock_irq
> >       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.children.cycles-pp.__acct_update_integrals
> >       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs
> >       0.40 ± 12%      -0.1        0.30 ± 14%  perf-profile.children.cycles-pp.__next_timer_interrupt
> >       0.47 ±  7%      -0.1        0.39 ± 13%  perf-profile.children.cycles-pp.update_rq_clock
> >       0.29 ± 12%      -0.1        0.21 ± 15%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
> >       0.21 ±  7%      -0.1        0.14 ± 12%  perf-profile.children.cycles-pp.account_system_index_time
> >       0.38 ±  2%      -0.1        0.31 ± 12%  perf-profile.children.cycles-pp.timerqueue_add
> >       0.26 ± 11%      -0.1        0.20 ± 13%  perf-profile.children.cycles-pp.find_next_bit
> >       0.23 ± 15%      -0.1        0.17 ± 15%  perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit
> >       0.14 ±  8%      -0.1        0.07 ± 14%  perf-profile.children.cycles-pp.account_user_time
> >       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.children.cycles-pp.cpuacct_charge
> >       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.children.cycles-pp.irq_work_tick
> >       0.11 ± 13%      -0.0        0.07 ± 25%  perf-profile.children.cycles-pp.tick_sched_do_timer
> >       0.12 ± 10%      -0.0        0.08 ± 15%  perf-profile.children.cycles-pp.get_cpu_device
> >       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.children.cycles-pp.raise_softirq
> >       0.12 ±  3%      -0.0        0.09 ±  8%  perf-profile.children.cycles-pp.write
> >       0.11 ± 13%      +0.0        0.14 ±  8%  perf-profile.children.cycles-pp.native_write_msr
> >       0.09 ±  9%      +0.0        0.11 ±  7%  perf-profile.children.cycles-pp.finish_task_switch
> >       0.10 ± 10%      +0.0        0.13 ±  5%  perf-profile.children.cycles-pp.schedule_idle
> >       0.07 ±  6%      +0.0        0.10 ± 12%  perf-profile.children.cycles-pp.__read_nocancel
> >       0.04 ± 58%      +0.0        0.07 ± 15%  perf-profile.children.cycles-pp.__free_pages_ok
> >       0.06 ±  7%      +0.0        0.09 ± 13%  perf-profile.children.cycles-pp.perf_read
> >       0.07            +0.0        0.11 ± 14%  perf-profile.children.cycles-pp.perf_evsel__read_counter
> >       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.cmd_stat
> >       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.__run_perf_stat
> >       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.process_interval
> >       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.read_counters
> >       0.07 ± 22%      +0.0        0.11 ± 19%  perf-profile.children.cycles-pp.__handle_mm_fault
> >       0.07 ± 19%      +0.1        0.13 ±  8%  perf-profile.children.cycles-pp.rb_erase
> >       0.03 ±100%      +0.1        0.09 ±  9%  perf-profile.children.cycles-pp.smp_call_function_single
> >       0.01 ±173%      +0.1        0.08 ± 11%  perf-profile.children.cycles-pp.perf_event_read
> >       0.00            +0.1        0.07 ± 13%  perf-profile.children.cycles-pp.__perf_event_read_value
> >       0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.__intel_pmu_enable_all
> >       0.08 ± 17%      +0.1        0.15 ±  8%  perf-profile.children.cycles-pp.native_apic_msr_eoi_write
> >       0.04 ±103%      +0.1        0.13 ± 58%  perf-profile.children.cycles-pp.shmem_getpage_gfp
> >       0.38 ± 14%      +0.1        0.51 ±  6%  perf-profile.children.cycles-pp.run_timer_softirq
> >       0.11 ±  4%      +0.3        0.37 ± 32%  perf-profile.children.cycles-pp.worker_thread
> >       0.20 ±  5%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.ret_from_fork
> >       0.20 ±  4%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.kthread
> >       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.memcpy_erms
> >       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.drm_fb_helper_dirty_work
> >       0.00            +0.3        0.31 ± 37%  perf-profile.children.cycles-pp.process_one_work
> >       0.47 ± 48%      +0.4        0.91 ± 19%  perf-profile.children.cycles-pp.prep_new_huge_page
> >       0.70 ± 29%      +0.5        1.16 ± 18%  perf-profile.children.cycles-pp.free_huge_page
> >       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_flush_mmu
> >       0.72 ± 29%      +0.5        1.18 ± 18%  perf-profile.children.cycles-pp.release_pages
> >       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_finish_mmu
> >       0.76 ± 27%      +0.5        1.23 ± 18%  perf-profile.children.cycles-pp.exit_mmap
> >       0.77 ± 27%      +0.5        1.24 ± 18%  perf-profile.children.cycles-pp.mmput
> >       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.__x64_sys_exit_group
> >       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_group_exit
> >       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_exit
> >       1.28 ± 29%      +0.5        1.76 ±  9%  perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
> >       0.77 ± 28%      +0.5        1.26 ± 13%  perf-profile.children.cycles-pp.alloc_fresh_huge_page
> >       1.53 ± 15%      +0.7        2.26 ± 14%  perf-profile.children.cycles-pp.do_syscall_64
> >       1.53 ± 15%      +0.7        2.27 ± 14%  perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
> >       1.13 ±  3%      +0.9        2.07 ± 14%  perf-profile.children.cycles-pp.interrupt_entry
> >       0.79 ±  9%      +1.0        1.76 ±  5%  perf-profile.children.cycles-pp.perf_event_task_tick
> >       1.71 ± 39%      +1.4        3.08 ± 16%  perf-profile.children.cycles-pp.alloc_surplus_huge_page
> >       2.66 ± 42%      +2.3        4.94 ± 17%  perf-profile.children.cycles-pp.alloc_huge_page
> >       2.89 ± 45%      +2.7        5.54 ± 18%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> >       3.34 ± 35%      +2.7        6.02 ± 17%  perf-profile.children.cycles-pp._raw_spin_lock
> >      12.77 ± 14%      +3.9       16.63 ±  7%  perf-profile.children.cycles-pp.mutex_spin_on_owner
> >      20.12 ±  9%      +4.0       24.16 ±  6%  perf-profile.children.cycles-pp.hugetlb_cow
> >      15.40 ± 10%      -3.6       11.84 ± 28%  perf-profile.self.cycles-pp.do_rw_once
> >       4.02 ±  9%      -1.3        2.73 ± 30%  perf-profile.self.cycles-pp.do_access
> >       2.00 ± 14%      -0.6        1.41 ± 13%  perf-profile.self.cycles-pp.cpuidle_enter_state
> >       1.26 ± 16%      -0.5        0.74 ± 13%  perf-profile.self.cycles-pp.native_sched_clock
> >       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.self.cycles-pp.account_process_tick
> >       0.27 ± 19%      -0.2        0.12 ± 17%  perf-profile.self.cycles-pp.timerqueue_del
> >       0.53 ±  3%      -0.1        0.38 ± 11%  perf-profile.self.cycles-pp.update_curr
> >       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.self.cycles-pp.__acct_update_integrals
> >       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs
> >       0.61 ±  4%      -0.1        0.51 ±  8%  perf-profile.self.cycles-pp.task_tick_fair
> >       0.20 ±  8%      -0.1        0.12 ± 14%  perf-profile.self.cycles-pp.account_system_index_time
> >       0.23 ± 15%      -0.1        0.16 ± 17%  perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit
> >       0.25 ± 11%      -0.1        0.18 ± 14%  perf-profile.self.cycles-pp.find_next_bit
> >       0.10 ± 11%      -0.1        0.03 ±100%  perf-profile.self.cycles-pp.tick_sched_do_timer
> >       0.29            -0.1        0.23 ± 11%  perf-profile.self.cycles-pp.timerqueue_add
> >       0.12 ± 10%      -0.1        0.06 ± 17%  perf-profile.self.cycles-pp.account_user_time
> >       0.22 ± 15%      -0.1        0.16 ±  6%  perf-profile.self.cycles-pp.scheduler_tick
> >       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.self.cycles-pp.cpuacct_charge
> >       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.self.cycles-pp.irq_work_tick
> >       0.07 ± 13%      -0.0        0.03 ±100%  perf-profile.self.cycles-pp.update_process_times
> >       0.12 ±  7%      -0.0        0.08 ± 15%  perf-profile.self.cycles-pp.get_cpu_device
> >       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.self.cycles-pp.raise_softirq
> >       0.12 ± 11%      -0.0        0.09 ±  7%  perf-profile.self.cycles-pp.tick_nohz_get_sleep_length
> >       0.11 ± 11%      +0.0        0.14 ±  6%  perf-profile.self.cycles-pp.native_write_msr
> >       0.10 ±  5%      +0.1        0.15 ±  8%  perf-profile.self.cycles-pp.__remove_hrtimer
> >       0.07 ± 23%      +0.1        0.13 ±  8%  perf-profile.self.cycles-pp.rb_erase
> >       0.08 ± 17%      +0.1        0.15 ±  7%  perf-profile.self.cycles-pp.native_apic_msr_eoi_write
> >       0.00            +0.1        0.08 ± 10%  perf-profile.self.cycles-pp.smp_call_function_single
> >       0.32 ± 17%      +0.1        0.42 ±  7%  perf-profile.self.cycles-pp.run_timer_softirq
> >       0.22 ±  5%      +0.1        0.34 ±  4%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
> >       0.45 ± 15%      +0.2        0.60 ± 12%  perf-profile.self.cycles-pp.rcu_irq_enter
> >       0.31 ±  8%      +0.2        0.46 ± 16%  perf-profile.self.cycles-pp.irq_enter
> >       0.29 ± 10%      +0.2        0.44 ± 16%  perf-profile.self.cycles-pp.apic_timer_interrupt
> >       0.71 ± 30%      +0.2        0.92 ±  8%  perf-profile.self.cycles-pp.perf_mux_hrtimer_handler
> >       0.00            +0.3        0.28 ± 37%  perf-profile.self.cycles-pp.memcpy_erms
> >       1.12 ±  3%      +0.9        2.02 ± 15%  perf-profile.self.cycles-pp.interrupt_entry
> >       0.79 ±  9%      +0.9        1.73 ±  5%  perf-profile.self.cycles-pp.perf_event_task_tick
> >       2.49 ± 45%      +2.1        4.55 ± 20%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
> >      10.95 ± 15%      +2.7       13.61 ±  8%  perf-profile.self.cycles-pp.mutex_spin_on_owner
> >
> >
> >
> >                                vm-scalability.throughput
> >
> >   1.6e+07 +-+---------------------------------------------------------------+
> >           |..+.+    +..+.+..+.+.   +.      +..+.+..+.+..+.+..+.+..+    +    |
> >   1.4e+07 +-+  :    :  O      O    O                           O            |
> >   1.2e+07 O-+O O  O O    O  O    O    O O  O  O    O    O    O      O  O O  O
> >           |     :   :                           O    O    O       O         |
> >     1e+07 +-+   :  :                                                        |
> >           |     :  :                                                        |
> >     8e+06 +-+   :  :                                                        |
> >           |      : :                                                        |
> >     6e+06 +-+    : :                                                        |
> >     4e+06 +-+    : :                                                        |
> >           |      ::                                                         |
> >     2e+06 +-+     :                                                         |
> >           |       :                                                         |
> >         0 +-+---------------------------------------------------------------+
> >
> >
> >                          vm-scalability.time.minor_page_faults
> >
> >   2.5e+06 +-+---------------------------------------------------------------+
> >           |                                                                 |
> >           |..+.+    +..+.+..+.+..+.+..+.+..  .+.  .+.+..+.+..+.+..+.+..+    |
> >     2e+06 +-+  :    :                      +.   +.                          |
> >           O  O O: O O  O O  O O  O O                    O      O            |
> >           |     :   :                 O O  O  O O  O O    O  O    O O  O O  O
> >   1.5e+06 +-+   :  :                                                        |
> >           |     :  :                                                        |
> >     1e+06 +-+    : :                                                        |
> >           |      : :                                                        |
> >           |      : :                                                        |
> >    500000 +-+    : :                                                        |
> >           |       :                                                         |
> >           |       :                                                         |
> >         0 +-+---------------------------------------------------------------+
> >
> >
> >                                 vm-scalability.workload
> >
> >   3.5e+09 +-+---------------------------------------------------------------+
> >           | .+.                      .+.+..                        .+..     |
> >     3e+09 +-+  +    +..+.+..+.+..+.+.      +..+.+..+.+..+.+..+.+..+    +    |
> >           |    :    :       O O                                O            |
> >   2.5e+09 O-+O O: O O  O O       O O  O    O            O                   |
> >           |     :   :                   O     O O  O O    O  O    O O  O O  O
> >     2e+09 +-+   :  :                                                        |
> >           |     :  :                                                        |
> >   1.5e+09 +-+    : :                                                        |
> >           |      : :                                                        |
> >     1e+09 +-+    : :                                                        |
> >           |      : :                                                        |
> >     5e+08 +-+     :                                                         |
> >           |       :                                                         |
> >         0 +-+---------------------------------------------------------------+
> >
> >
> > [*] bisect-good sample
> > [O] bisect-bad  sample
> >
> >
> >
> > Disclaimer:
> > Results have been estimated based on internal Intel analysis and are provided
> > for informational purposes only. Any difference in system hardware or software
> > design or configuration may affect actual performance.
> >
> >
> > Thanks,
> > Rong Chen
> >
>
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-30 18:12   ` Daniel Vetter
@ 2019-07-30 18:50     ` Thomas Zimmermann
  2019-07-30 18:59       ` Daniel Vetter
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-07-30 18:50 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Stephen Rothwell, LKP, dri-devel, kernel test robot


[-- Attachment #1.1.1: Type: text/plain, Size: 63328 bytes --]

Hi

Am 30.07.19 um 20:12 schrieb Daniel Vetter:
> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>> Greeting,
>>>
>>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
>>>
>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>
>> Daniel, Noralf, we may have to revert this patch.
>>
>> I expected some change in display performance, but not in VM. Since it's
>> a server chipset, probably no one cares much about display performance.
>> So that seemed like a good trade-off for re-using shared code.
>>
>> Part of the patch set is that the generic fb emulation now maps and
>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>> of the performance regression. And it should be visible with other
>> drivers as well if they use a shadow FB for fbdev emulation.
> 
> For fbcon we should need to do any maps/unamps at all, this is for the
> fbdev mmap support only. If the testcase mentioned here tests fbdev
> mmap handling it's pretty badly misnamed :-) And as long as you don't
> have an fbdev mmap there shouldn't be any impact at all.

The ast and mgag200 have only a few MiB of VRAM, so we have to get the
fbdev BO out if it's not being displayed. If not being mapped, it can be
evicted and make room for X, etc.

To make this work, the BO's memory is mapped and unmapped in
drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
That fbdev mapping is established on each screen update, more or less.
From my (yet unverified) understanding, this causes the performance
regression in the VM code.

The original code in mgag200 used to kmap the fbdev BO while it's being
displayed; [2] and the drawing code only mapped it when necessary (i.e.,
not being display). [3]

I think this could be added for VRAM helpers as well, but it's still a
workaround and non-VRAM drivers might also run into such a performance
regression if they use the fbdev's shadow fb.

Noralf mentioned that there are plans for other DRM clients besides the
console. They would as well run into similar problems.

>> The thing is that we'd need another generic fbdev emulation for ast and
>> mgag200 that handles this issue properly.
> 
> Yeah I dont think we want to jump the gun here.  If you can try to
> repro locally and profile where we're wasting cpu time I hope that
> should sched a light what's going wrong here.

I don't have much time ATM and I'm not even officially at work until
late Aug. I'd send you the revert and investigate later. I agree that
using generic fbdev emulation would be preferable.

Best regards
Thomas


[1]
https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/drm_fb_helper.c?id=90f479ae51afa45efab97afdde9b94b9660dd3e4#n419
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/mgag200/mgag200_mode.c?h=v5.2#n897
[3]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?h=v5.2#n75

> -Daniel
> 
>>
>> Best regards
>> Thomas
>>
>>>
>>> in testcase: vm-scalability
>>> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
>>> with following parameters:
>>>
>>>       runtime: 300s
>>>       size: 8T
>>>       test: anon-cow-seq-hugetlb
>>>       cpufreq_governor: performance
>>>
>>> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
>>> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>>>
>>>
>>>
>>> Details are as below:
>>> -------------------------------------------------------------------------------------------------->
>>>
>>>
>>> To reproduce:
>>>
>>>         git clone https://github.com/intel/lkp-tests.git
>>>         cd lkp-tests
>>>         bin/lkp install job.yaml  # job file is attached in this email
>>>         bin/lkp run     job.yaml
>>>
>>> =========================================================================================
>>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>>>   gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
>>>
>>> commit:
>>>   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
>>>   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>
>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
>>> ---------------- ---------------------------
>>>        fail:runs  %reproduction    fail:runs
>>>            |             |             |
>>>           2:4          -50%            :4     dmesg.WARNING:at#for_ip_interrupt_entry/0x
>>>            :4           25%           1:4     dmesg.WARNING:at_ip___perf_sw_event/0x
>>>            :4           25%           1:4     dmesg.WARNING:at_ip__fsnotify_parent/0x
>>>          %stddev     %change         %stddev
>>>              \          |                \
>>>      43955 ±  2%     -18.8%      35691        vm-scalability.median
>>>       0.06 ±  7%    +193.0%       0.16 ±  2%  vm-scalability.median_stddev
>>>   14906559 ±  2%     -17.9%   12237079        vm-scalability.throughput
>>>      87651 ±  2%     -17.4%      72374        vm-scalability.time.involuntary_context_switches
>>>    2086168           -23.6%    1594224        vm-scalability.time.minor_page_faults
>>>      15082 ±  2%     -10.4%      13517        vm-scalability.time.percent_of_cpu_this_job_got
>>>      29987            -8.9%      27327        vm-scalability.time.system_time
>>>      15755           -12.4%      13795        vm-scalability.time.user_time
>>>     122011           -19.3%      98418        vm-scalability.time.voluntary_context_switches
>>>  3.034e+09           -23.6%  2.318e+09        vm-scalability.workload
>>>     242478 ± 12%     +68.5%     408518 ± 23%  cpuidle.POLL.time
>>>       2788 ± 21%    +117.4%       6062 ± 26%  cpuidle.POLL.usage
>>>      56653 ± 10%     +64.4%      93144 ± 20%  meminfo.Mapped
>>>     120392 ±  7%     +14.0%     137212 ±  4%  meminfo.Shmem
>>>      47221 ± 11%     +77.1%      83634 ± 22%  numa-meminfo.node0.Mapped
>>>     120465 ±  7%     +13.9%     137205 ±  4%  numa-meminfo.node0.Shmem
>>>    2885513           -16.5%    2409384        numa-numastat.node0.local_node
>>>    2885471           -16.5%    2409354        numa-numastat.node0.numa_hit
>>>      11813 ± 11%     +76.3%      20824 ± 22%  numa-vmstat.node0.nr_mapped
>>>      30096 ±  7%     +13.8%      34238 ±  4%  numa-vmstat.node0.nr_shmem
>>>      43.72 ±  2%      +5.5       49.20        mpstat.cpu.all.idle%
>>>       0.03 ±  4%      +0.0        0.05 ±  6%  mpstat.cpu.all.soft%
>>>      19.51            -2.4       17.08        mpstat.cpu.all.usr%
>>>       1012            -7.9%     932.75        turbostat.Avg_MHz
>>>      32.38 ± 10%     +25.8%      40.73        turbostat.CPU%c1
>>>     145.51            -3.1%     141.01        turbostat.PkgWatt
>>>      15.09           -19.2%      12.19        turbostat.RAMWatt
>>>      43.50 ±  2%     +13.2%      49.25        vmstat.cpu.id
>>>      18.75 ±  2%     -13.3%      16.25 ±  2%  vmstat.cpu.us
>>>     152.00 ±  2%      -9.5%     137.50        vmstat.procs.r
>>>       4800           -13.1%       4173        vmstat.system.cs
>>>     156170           -11.9%     137594        slabinfo.anon_vma.active_objs
>>>       3395           -11.9%       2991        slabinfo.anon_vma.active_slabs
>>>     156190           -11.9%     137606        slabinfo.anon_vma.num_objs
>>>       3395           -11.9%       2991        slabinfo.anon_vma.num_slabs
>>>       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.active_objs
>>>       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.num_objs
>>>       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.active_objs
>>>       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.num_objs
>>>       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.active_objs
>>>       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.num_objs
>>>    1330122           -23.6%    1016557        proc-vmstat.htlb_buddy_alloc_success
>>>      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_active_anon
>>>      67277            +2.9%      69246        proc-vmstat.nr_anon_pages
>>>     218.50 ±  3%     -10.6%     195.25        proc-vmstat.nr_dirtied
>>>     288628            +1.4%     292755        proc-vmstat.nr_file_pages
>>>     360.50            -2.7%     350.75        proc-vmstat.nr_inactive_file
>>>      14225 ±  9%     +63.8%      23304 ± 20%  proc-vmstat.nr_mapped
>>>      30109 ±  7%     +13.8%      34259 ±  4%  proc-vmstat.nr_shmem
>>>      99870            -1.3%      98597        proc-vmstat.nr_slab_unreclaimable
>>>     204.00 ±  4%     -12.1%     179.25        proc-vmstat.nr_written
>>>      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_zone_active_anon
>>>     360.50            -2.7%     350.75        proc-vmstat.nr_zone_inactive_file
>>>       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults
>>>       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults_local
>>>    2904082           -16.4%    2427026        proc-vmstat.numa_hit
>>>    2904081           -16.4%    2427025        proc-vmstat.numa_local
>>>  6.828e+08           -23.5%  5.221e+08        proc-vmstat.pgalloc_normal
>>>    2900008           -17.2%    2400195        proc-vmstat.pgfault
>>>  6.827e+08           -23.5%   5.22e+08        proc-vmstat.pgfree
>>>  1.635e+10           -17.0%  1.357e+10        perf-stat.i.branch-instructions
>>>       1.53 ±  4%      -0.1        1.45 ±  3%  perf-stat.i.branch-miss-rate%
>>>  2.581e+08 ±  3%     -20.5%  2.051e+08 ±  2%  perf-stat.i.branch-misses
>>>      12.66            +1.1       13.78        perf-stat.i.cache-miss-rate%
>>>   72720849           -12.0%   63958986        perf-stat.i.cache-misses
>>>  5.766e+08           -18.6%  4.691e+08        perf-stat.i.cache-references
>>>       4674 ±  2%     -13.0%       4064        perf-stat.i.context-switches
>>>       4.29           +12.5%       4.83        perf-stat.i.cpi
>>>  2.573e+11            -7.4%  2.383e+11        perf-stat.i.cpu-cycles
>>>     231.35           -21.5%     181.56        perf-stat.i.cpu-migrations
>>>       3522            +4.4%       3677        perf-stat.i.cycles-between-cache-misses
>>>       0.09 ± 13%      +0.0        0.12 ±  5%  perf-stat.i.iTLB-load-miss-rate%
>>>  5.894e+10           -15.8%  4.961e+10        perf-stat.i.iTLB-loads
>>>  5.901e+10           -15.8%  4.967e+10        perf-stat.i.instructions
>>>       1291 ± 14%     -21.8%       1010        perf-stat.i.instructions-per-iTLB-miss
>>>       0.24           -11.0%       0.21        perf-stat.i.ipc
>>>       9476           -17.5%       7821        perf-stat.i.minor-faults
>>>       9478           -17.5%       7821        perf-stat.i.page-faults
>>>       9.76            -3.6%       9.41        perf-stat.overall.MPKI
>>>       1.59 ±  4%      -0.1        1.52        perf-stat.overall.branch-miss-rate%
>>>      12.61            +1.1       13.71        perf-stat.overall.cache-miss-rate%
>>>       4.38           +10.5%       4.83        perf-stat.overall.cpi
>>>       3557            +5.3%       3747        perf-stat.overall.cycles-between-cache-misses
>>>       0.08 ± 12%      +0.0        0.10        perf-stat.overall.iTLB-load-miss-rate%
>>>       1268 ± 15%     -23.0%     976.22        perf-stat.overall.instructions-per-iTLB-miss
>>>       0.23            -9.5%       0.21        perf-stat.overall.ipc
>>>       5815            +9.7%       6378        perf-stat.overall.path-length
>>>  1.634e+10           -17.5%  1.348e+10        perf-stat.ps.branch-instructions
>>>  2.595e+08 ±  3%     -21.2%  2.043e+08 ±  2%  perf-stat.ps.branch-misses
>>>   72565205           -12.2%   63706339        perf-stat.ps.cache-misses
>>>  5.754e+08           -19.2%  4.646e+08        perf-stat.ps.cache-references
>>>       4640 ±  2%     -12.5%       4060        perf-stat.ps.context-switches
>>>  2.581e+11            -7.5%  2.387e+11        perf-stat.ps.cpu-cycles
>>>     229.91           -22.0%     179.42        perf-stat.ps.cpu-migrations
>>>  5.889e+10           -16.3%  4.927e+10        perf-stat.ps.iTLB-loads
>>>  5.899e+10           -16.3%  4.938e+10        perf-stat.ps.instructions
>>>       9388           -18.2%       7677        perf-stat.ps.minor-faults
>>>       9389           -18.2%       7677        perf-stat.ps.page-faults
>>>  1.764e+13           -16.2%  1.479e+13        perf-stat.total.instructions
>>>      46803 ±  3%     -18.8%      37982 ±  6%  sched_debug.cfs_rq:/.exec_clock.min
>>>       5320 ±  3%     +23.7%       6581 ±  3%  sched_debug.cfs_rq:/.exec_clock.stddev
>>>       6737 ± 14%     +58.1%      10649 ± 10%  sched_debug.cfs_rq:/.load.avg
>>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.load.max
>>>      46952 ± 16%     +64.8%      77388 ± 11%  sched_debug.cfs_rq:/.load.stddev
>>>       7.12 ±  4%     +49.1%      10.62 ±  6%  sched_debug.cfs_rq:/.load_avg.avg
>>>     474.40 ± 23%     +67.5%     794.60 ± 10%  sched_debug.cfs_rq:/.load_avg.max
>>>      37.70 ± 11%     +74.8%      65.90 ±  9%  sched_debug.cfs_rq:/.load_avg.stddev
>>>   13424269 ±  4%     -15.6%   11328098 ±  2%  sched_debug.cfs_rq:/.min_vruntime.avg
>>>   15411275 ±  3%     -12.4%   13505072 ±  2%  sched_debug.cfs_rq:/.min_vruntime.max
>>>    7939295 ±  6%     -17.5%    6551322 ±  7%  sched_debug.cfs_rq:/.min_vruntime.min
>>>      21.44 ±  7%     -56.1%       9.42 ±  4%  sched_debug.cfs_rq:/.nr_spread_over.avg
>>>     117.45 ± 11%     -60.6%      46.30 ± 14%  sched_debug.cfs_rq:/.nr_spread_over.max
>>>      19.33 ±  8%     -66.4%       6.49 ±  9%  sched_debug.cfs_rq:/.nr_spread_over.stddev
>>>       4.32 ± 15%     +84.4%       7.97 ±  3%  sched_debug.cfs_rq:/.runnable_load_avg.avg
>>>     353.85 ± 29%    +118.8%     774.35 ± 11%  sched_debug.cfs_rq:/.runnable_load_avg.max
>>>      27.30 ± 24%    +118.5%      59.64 ±  9%  sched_debug.cfs_rq:/.runnable_load_avg.stddev
>>>       6729 ± 14%     +58.2%      10644 ± 10%  sched_debug.cfs_rq:/.runnable_weight.avg
>>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.runnable_weight.max
>>>      46950 ± 16%     +64.8%      77387 ± 11%  sched_debug.cfs_rq:/.runnable_weight.stddev
>>>    5305069 ±  4%     -17.4%    4380376 ±  7%  sched_debug.cfs_rq:/.spread0.avg
>>>    7328745 ±  3%      -9.9%    6600897 ±  3%  sched_debug.cfs_rq:/.spread0.max
>>>    2220837 ±  4%     +55.8%    3460596 ±  5%  sched_debug.cpu.avg_idle.avg
>>>    4590666 ±  9%     +76.8%    8117037 ± 15%  sched_debug.cpu.avg_idle.max
>>>     485052 ±  7%     +80.3%     874679 ± 10%  sched_debug.cpu.avg_idle.stddev
>>>     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock.stddev
>>>     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock_task.stddev
>>>       3.20 ± 10%    +109.6%       6.70 ±  3%  sched_debug.cpu.cpu_load[0].avg
>>>     309.10 ± 20%    +150.3%     773.75 ± 12%  sched_debug.cpu.cpu_load[0].max
>>>      21.02 ± 14%    +160.8%      54.80 ±  9%  sched_debug.cpu.cpu_load[0].stddev
>>>       3.19 ±  8%    +109.8%       6.70 ±  3%  sched_debug.cpu.cpu_load[1].avg
>>>     299.75 ± 19%    +158.0%     773.30 ± 12%  sched_debug.cpu.cpu_load[1].max
>>>      20.32 ± 12%    +168.7%      54.62 ±  9%  sched_debug.cpu.cpu_load[1].stddev
>>>       3.20 ±  8%    +109.1%       6.69 ±  4%  sched_debug.cpu.cpu_load[2].avg
>>>     288.90 ± 20%    +167.0%     771.40 ± 12%  sched_debug.cpu.cpu_load[2].max
>>>      19.70 ± 12%    +175.4%      54.27 ±  9%  sched_debug.cpu.cpu_load[2].stddev
>>>       3.16 ±  8%    +110.9%       6.66 ±  6%  sched_debug.cpu.cpu_load[3].avg
>>>     275.50 ± 24%    +178.4%     766.95 ± 12%  sched_debug.cpu.cpu_load[3].max
>>>      18.92 ± 15%    +184.2%      53.77 ± 10%  sched_debug.cpu.cpu_load[3].stddev
>>>       3.08 ±  8%    +115.7%       6.65 ±  7%  sched_debug.cpu.cpu_load[4].avg
>>>     263.55 ± 28%    +188.7%     760.85 ± 12%  sched_debug.cpu.cpu_load[4].max
>>>      18.03 ± 18%    +196.6%      53.46 ± 11%  sched_debug.cpu.cpu_load[4].stddev
>>>      14543            -9.6%      13150        sched_debug.cpu.curr->pid.max
>>>       5293 ± 16%     +74.7%       9248 ± 11%  sched_debug.cpu.load.avg
>>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cpu.load.max
>>>      40887 ± 19%     +78.3%      72891 ±  9%  sched_debug.cpu.load.stddev
>>>    1141679 ±  4%     +56.9%    1790907 ±  5%  sched_debug.cpu.max_idle_balance_cost.avg
>>>    2432100 ±  9%     +72.6%    4196779 ± 13%  sched_debug.cpu.max_idle_balance_cost.max
>>>     745656           +29.3%     964170 ±  5%  sched_debug.cpu.max_idle_balance_cost.min
>>>     239032 ±  9%     +81.9%     434806 ± 10%  sched_debug.cpu.max_idle_balance_cost.stddev
>>>       0.00 ± 27%     +92.1%       0.00 ± 31%  sched_debug.cpu.next_balance.stddev
>>>       1030 ±  4%     -10.4%     924.00 ±  2%  sched_debug.cpu.nr_switches.min
>>>       0.04 ± 26%    +139.0%       0.09 ± 41%  sched_debug.cpu.nr_uninterruptible.avg
>>>     830.35 ±  6%     -12.0%     730.50 ±  2%  sched_debug.cpu.sched_count.min
>>>     912.00 ±  2%      -9.5%     825.38        sched_debug.cpu.ttwu_count.avg
>>>     433.05 ±  3%     -19.2%     350.05 ±  3%  sched_debug.cpu.ttwu_count.min
>>>     160.70 ±  3%     -12.5%     140.60 ±  4%  sched_debug.cpu.ttwu_local.min
>>>       9072 ± 11%     -36.4%       5767 ±  8%  softirqs.CPU1.RCU
>>>      12769 ±  5%     +15.3%      14718 ±  3%  softirqs.CPU101.SCHED
>>>      13198           +11.5%      14717 ±  3%  softirqs.CPU102.SCHED
>>>      12981 ±  4%     +13.9%      14788 ±  3%  softirqs.CPU105.SCHED
>>>      13486 ±  3%     +11.8%      15071 ±  4%  softirqs.CPU111.SCHED
>>>      12794 ±  4%     +14.1%      14601 ±  9%  softirqs.CPU112.SCHED
>>>      12999 ±  4%     +10.1%      14314 ±  4%  softirqs.CPU115.SCHED
>>>      12844 ±  4%     +10.6%      14202 ±  2%  softirqs.CPU120.SCHED
>>>      13336 ±  3%      +9.4%      14585 ±  3%  softirqs.CPU122.SCHED
>>>      12639 ±  4%     +20.2%      15195        softirqs.CPU123.SCHED
>>>      13040 ±  5%     +15.2%      15024 ±  5%  softirqs.CPU126.SCHED
>>>      13123           +15.1%      15106 ±  5%  softirqs.CPU127.SCHED
>>>       9188 ±  6%     -35.7%       5911 ±  2%  softirqs.CPU13.RCU
>>>      13054 ±  3%     +13.1%      14761 ±  5%  softirqs.CPU130.SCHED
>>>      13158 ±  2%     +13.9%      14985 ±  5%  softirqs.CPU131.SCHED
>>>      12797 ±  6%     +13.5%      14524 ±  3%  softirqs.CPU133.SCHED
>>>      12452 ±  5%     +14.8%      14297        softirqs.CPU134.SCHED
>>>      13078 ±  3%     +10.4%      14439 ±  3%  softirqs.CPU138.SCHED
>>>      12617 ±  2%     +14.5%      14442 ±  5%  softirqs.CPU139.SCHED
>>>      12974 ±  3%     +13.7%      14752 ±  4%  softirqs.CPU142.SCHED
>>>      12579 ±  4%     +19.1%      14983 ±  3%  softirqs.CPU143.SCHED
>>>       9122 ± 24%     -44.6%       5053 ±  5%  softirqs.CPU144.RCU
>>>      13366 ±  2%     +11.1%      14848 ±  3%  softirqs.CPU149.SCHED
>>>      13246 ±  2%     +22.0%      16162 ±  7%  softirqs.CPU150.SCHED
>>>      13452 ±  3%     +20.5%      16210 ±  7%  softirqs.CPU151.SCHED
>>>      13507           +10.1%      14869        softirqs.CPU156.SCHED
>>>      13808 ±  3%      +9.2%      15079 ±  4%  softirqs.CPU157.SCHED
>>>      13442 ±  2%     +13.4%      15248 ±  4%  softirqs.CPU160.SCHED
>>>      13311           +12.1%      14920 ±  2%  softirqs.CPU162.SCHED
>>>      13544 ±  3%      +8.5%      14695 ±  4%  softirqs.CPU163.SCHED
>>>      13648 ±  3%     +11.2%      15179 ±  2%  softirqs.CPU166.SCHED
>>>      13404 ±  4%     +12.5%      15079 ±  3%  softirqs.CPU168.SCHED
>>>      13421 ±  6%     +16.0%      15568 ±  8%  softirqs.CPU169.SCHED
>>>      13115 ±  3%     +23.1%      16139 ± 10%  softirqs.CPU171.SCHED
>>>      13424 ±  6%     +10.4%      14822 ±  3%  softirqs.CPU175.SCHED
>>>      13274 ±  3%     +13.7%      15087 ±  9%  softirqs.CPU185.SCHED
>>>      13409 ±  3%     +12.3%      15063 ±  3%  softirqs.CPU190.SCHED
>>>      13181 ±  7%     +13.4%      14946 ±  3%  softirqs.CPU196.SCHED
>>>      13578 ±  3%     +10.9%      15061        softirqs.CPU197.SCHED
>>>      13323 ±  5%     +24.8%      16627 ±  6%  softirqs.CPU198.SCHED
>>>      14072 ±  2%     +12.3%      15798 ±  7%  softirqs.CPU199.SCHED
>>>      12604 ± 13%     +17.9%      14865        softirqs.CPU201.SCHED
>>>      13380 ±  4%     +14.8%      15356 ±  3%  softirqs.CPU203.SCHED
>>>      13481 ±  8%     +14.2%      15390 ±  3%  softirqs.CPU204.SCHED
>>>      12921 ±  2%     +13.8%      14710 ±  3%  softirqs.CPU206.SCHED
>>>      13468           +13.0%      15218 ±  2%  softirqs.CPU208.SCHED
>>>      13253 ±  2%     +13.1%      14992        softirqs.CPU209.SCHED
>>>      13319 ±  2%     +14.3%      15225 ±  7%  softirqs.CPU210.SCHED
>>>      13673 ±  5%     +16.3%      15895 ±  3%  softirqs.CPU211.SCHED
>>>      13290           +17.0%      15556 ±  5%  softirqs.CPU212.SCHED
>>>      13455 ±  4%     +14.4%      15392 ±  3%  softirqs.CPU213.SCHED
>>>      13454 ±  4%     +14.3%      15377 ±  3%  softirqs.CPU215.SCHED
>>>      13872 ±  7%      +9.7%      15221 ±  5%  softirqs.CPU220.SCHED
>>>      13555 ±  4%     +17.3%      15896 ±  5%  softirqs.CPU222.SCHED
>>>      13411 ±  4%     +20.8%      16197 ±  6%  softirqs.CPU223.SCHED
>>>       8472 ± 21%     -44.8%       4680 ±  3%  softirqs.CPU224.RCU
>>>      13141 ±  3%     +16.2%      15265 ±  7%  softirqs.CPU225.SCHED
>>>      14084 ±  3%      +8.2%      15242 ±  2%  softirqs.CPU226.SCHED
>>>      13528 ±  4%     +11.3%      15063 ±  4%  softirqs.CPU228.SCHED
>>>      13218 ±  3%     +16.3%      15377 ±  4%  softirqs.CPU229.SCHED
>>>      14031 ±  4%     +10.2%      15467 ±  2%  softirqs.CPU231.SCHED
>>>      13770 ±  3%     +14.0%      15700 ±  3%  softirqs.CPU232.SCHED
>>>      13456 ±  3%     +12.3%      15105 ±  3%  softirqs.CPU233.SCHED
>>>      13137 ±  4%     +13.5%      14909 ±  3%  softirqs.CPU234.SCHED
>>>      13318 ±  2%     +14.7%      15280 ±  2%  softirqs.CPU235.SCHED
>>>      13690 ±  2%     +13.7%      15563 ±  7%  softirqs.CPU238.SCHED
>>>      13771 ±  5%     +20.8%      16634 ±  7%  softirqs.CPU241.SCHED
>>>      13317 ±  7%     +19.5%      15919 ±  9%  softirqs.CPU243.SCHED
>>>       8234 ± 16%     -43.9%       4616 ±  5%  softirqs.CPU244.RCU
>>>      13845 ±  6%     +13.0%      15643 ±  3%  softirqs.CPU244.SCHED
>>>      13179 ±  3%     +16.3%      15323        softirqs.CPU246.SCHED
>>>      13754           +12.2%      15438 ±  3%  softirqs.CPU248.SCHED
>>>      13769 ±  4%     +10.9%      15276 ±  2%  softirqs.CPU252.SCHED
>>>      13702           +10.5%      15147 ±  2%  softirqs.CPU254.SCHED
>>>      13315 ±  2%     +12.5%      14980 ±  3%  softirqs.CPU255.SCHED
>>>      13785 ±  3%     +12.9%      15568 ±  5%  softirqs.CPU256.SCHED
>>>      13307 ±  3%     +15.0%      15298 ±  3%  softirqs.CPU257.SCHED
>>>      13864 ±  3%     +10.5%      15313 ±  2%  softirqs.CPU259.SCHED
>>>      13879 ±  2%     +11.4%      15465        softirqs.CPU261.SCHED
>>>      13815           +13.6%      15687 ±  5%  softirqs.CPU264.SCHED
>>>     119574 ±  2%     +11.8%     133693 ± 11%  softirqs.CPU266.TIMER
>>>      13688           +10.9%      15180 ±  6%  softirqs.CPU267.SCHED
>>>      11716 ±  4%     +19.3%      13974 ±  8%  softirqs.CPU27.SCHED
>>>      13866 ±  3%     +13.7%      15765 ±  4%  softirqs.CPU271.SCHED
>>>      13887 ±  5%     +12.5%      15621        softirqs.CPU272.SCHED
>>>      13383 ±  3%     +19.8%      16031 ±  2%  softirqs.CPU274.SCHED
>>>      13347           +14.1%      15232 ±  3%  softirqs.CPU275.SCHED
>>>      12884 ±  2%     +21.0%      15593 ±  4%  softirqs.CPU276.SCHED
>>>      13131 ±  5%     +13.4%      14891 ±  5%  softirqs.CPU277.SCHED
>>>      12891 ±  2%     +19.2%      15371 ±  4%  softirqs.CPU278.SCHED
>>>      13313 ±  4%     +13.0%      15049 ±  2%  softirqs.CPU279.SCHED
>>>      13514 ±  3%     +10.2%      14897 ±  2%  softirqs.CPU280.SCHED
>>>      13501 ±  3%     +13.7%      15346        softirqs.CPU281.SCHED
>>>      13261           +17.5%      15577        softirqs.CPU282.SCHED
>>>       8076 ± 15%     -43.7%       4546 ±  5%  softirqs.CPU283.RCU
>>>      13686 ±  3%     +12.6%      15413 ±  2%  softirqs.CPU284.SCHED
>>>      13439 ±  2%      +9.2%      14670 ±  4%  softirqs.CPU285.SCHED
>>>       8878 ±  9%     -35.4%       5735 ±  4%  softirqs.CPU35.RCU
>>>      11690 ±  2%     +13.6%      13274 ±  5%  softirqs.CPU40.SCHED
>>>      11714 ±  2%     +19.3%      13975 ± 13%  softirqs.CPU41.SCHED
>>>      11763           +12.5%      13239 ±  4%  softirqs.CPU45.SCHED
>>>      11662 ±  2%      +9.4%      12757 ±  3%  softirqs.CPU46.SCHED
>>>      11805 ±  2%      +9.3%      12902 ±  2%  softirqs.CPU50.SCHED
>>>      12158 ±  3%     +12.3%      13655 ±  8%  softirqs.CPU55.SCHED
>>>      11716 ±  4%      +8.8%      12751 ±  3%  softirqs.CPU58.SCHED
>>>      11922 ±  2%      +9.9%      13100 ±  4%  softirqs.CPU64.SCHED
>>>       9674 ± 17%     -41.8%       5625 ±  6%  softirqs.CPU66.RCU
>>>      11818           +12.0%      13237        softirqs.CPU66.SCHED
>>>     124682 ±  7%      -6.1%     117088 ±  5%  softirqs.CPU66.TIMER
>>>       8637 ±  9%     -34.0%       5700 ±  7%  softirqs.CPU70.RCU
>>>      11624 ±  2%     +11.0%      12901 ±  2%  softirqs.CPU70.SCHED
>>>      12372 ±  2%     +13.2%      14003 ±  3%  softirqs.CPU71.SCHED
>>>       9949 ± 25%     -33.9%       6574 ± 31%  softirqs.CPU72.RCU
>>>      10392 ± 26%     -35.1%       6745 ± 35%  softirqs.CPU73.RCU
>>>      12766 ±  3%     +11.1%      14188 ±  3%  softirqs.CPU76.SCHED
>>>      12611 ±  2%     +18.8%      14984 ±  5%  softirqs.CPU78.SCHED
>>>      12786 ±  3%     +17.9%      15079 ±  7%  softirqs.CPU79.SCHED
>>>      11947 ±  4%      +9.7%      13103 ±  4%  softirqs.CPU8.SCHED
>>>      13379 ±  7%     +11.8%      14962 ±  4%  softirqs.CPU83.SCHED
>>>      13438 ±  5%      +9.7%      14738 ±  2%  softirqs.CPU84.SCHED
>>>      12768           +19.4%      15241 ±  6%  softirqs.CPU88.SCHED
>>>       8604 ± 13%     -39.3%       5222 ±  3%  softirqs.CPU89.RCU
>>>      13077 ±  2%     +17.1%      15308 ±  7%  softirqs.CPU89.SCHED
>>>      11887 ±  3%     +20.1%      14272 ±  5%  softirqs.CPU9.SCHED
>>>      12723 ±  3%     +11.3%      14165 ±  4%  softirqs.CPU90.SCHED
>>>       8439 ± 12%     -38.9%       5153 ±  4%  softirqs.CPU91.RCU
>>>      13429 ±  3%     +10.3%      14806 ±  2%  softirqs.CPU95.SCHED
>>>      12852 ±  4%     +10.3%      14174 ±  5%  softirqs.CPU96.SCHED
>>>      13010 ±  2%     +14.4%      14888 ±  5%  softirqs.CPU97.SCHED
>>>    2315644 ±  4%     -36.2%    1477200 ±  4%  softirqs.RCU
>>>       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.NMI:Non-maskable_interrupts
>>>       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.PMI:Performance_monitoring_interrupts
>>>     252.00 ± 11%     -35.2%     163.25 ± 13%  interrupts.CPU104.RES:Rescheduling_interrupts
>>>       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.NMI:Non-maskable_interrupts
>>>       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.PMI:Performance_monitoring_interrupts
>>>     245.75 ± 19%     -31.0%     169.50 ±  7%  interrupts.CPU105.RES:Rescheduling_interrupts
>>>     228.75 ± 13%     -24.7%     172.25 ± 19%  interrupts.CPU106.RES:Rescheduling_interrupts
>>>       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.NMI:Non-maskable_interrupts
>>>       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.PMI:Performance_monitoring_interrupts
>>>       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.NMI:Non-maskable_interrupts
>>>       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.PMI:Performance_monitoring_interrupts
>>>       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.NMI:Non-maskable_interrupts
>>>       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.PMI:Performance_monitoring_interrupts
>>>     311.50 ± 23%     -47.7%     163.00 ±  9%  interrupts.CPU122.RES:Rescheduling_interrupts
>>>     266.75 ± 19%     -31.6%     182.50 ± 15%  interrupts.CPU124.RES:Rescheduling_interrupts
>>>     293.75 ± 33%     -32.3%     198.75 ± 19%  interrupts.CPU125.RES:Rescheduling_interrupts
>>>       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.NMI:Non-maskable_interrupts
>>>       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.PMI:Performance_monitoring_interrupts
>>>       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.NMI:Non-maskable_interrupts
>>>       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.PMI:Performance_monitoring_interrupts
>>>       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.NMI:Non-maskable_interrupts
>>>       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.PMI:Performance_monitoring_interrupts
>>>     219.50 ± 27%     -23.0%     169.00 ± 21%  interrupts.CPU139.RES:Rescheduling_interrupts
>>>     290.25 ± 25%     -32.5%     196.00 ± 11%  interrupts.CPU14.RES:Rescheduling_interrupts
>>>     243.50 ±  4%     -16.0%     204.50 ± 12%  interrupts.CPU140.RES:Rescheduling_interrupts
>>>       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.NMI:Non-maskable_interrupts
>>>       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.PMI:Performance_monitoring_interrupts
>>>       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.NMI:Non-maskable_interrupts
>>>       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.PMI:Performance_monitoring_interrupts
>>>     292.25 ± 34%     -33.9%     193.25 ±  6%  interrupts.CPU15.RES:Rescheduling_interrupts
>>>     424.25 ± 37%     -58.5%     176.25 ± 14%  interrupts.CPU158.RES:Rescheduling_interrupts
>>>     312.50 ± 42%     -54.2%     143.00 ± 18%  interrupts.CPU159.RES:Rescheduling_interrupts
>>>     725.00 ±118%     -75.7%     176.25 ± 14%  interrupts.CPU163.RES:Rescheduling_interrupts
>>>       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.NMI:Non-maskable_interrupts
>>>       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.PMI:Performance_monitoring_interrupts
>>>     239.50 ± 30%     -46.6%     128.00 ± 14%  interrupts.CPU179.RES:Rescheduling_interrupts
>>>     320.75 ± 15%     -24.0%     243.75 ± 20%  interrupts.CPU20.RES:Rescheduling_interrupts
>>>     302.50 ± 17%     -47.2%     159.75 ±  8%  interrupts.CPU200.RES:Rescheduling_interrupts
>>>       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.NMI:Non-maskable_interrupts
>>>       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.PMI:Performance_monitoring_interrupts
>>>     217.00 ± 11%     -34.6%     142.00 ± 12%  interrupts.CPU214.RES:Rescheduling_interrupts
>>>       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.NMI:Non-maskable_interrupts
>>>       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.PMI:Performance_monitoring_interrupts
>>>       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.NMI:Non-maskable_interrupts
>>>       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.PMI:Performance_monitoring_interrupts
>>>     289.50 ± 28%     -41.1%     170.50 ±  8%  interrupts.CPU22.RES:Rescheduling_interrupts
>>>       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.NMI:Non-maskable_interrupts
>>>       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.PMI:Performance_monitoring_interrupts
>>>       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.NMI:Non-maskable_interrupts
>>>       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.PMI:Performance_monitoring_interrupts
>>>       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.NMI:Non-maskable_interrupts
>>>       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.PMI:Performance_monitoring_interrupts
>>>       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.NMI:Non-maskable_interrupts
>>>       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.PMI:Performance_monitoring_interrupts
>>>     248.00 ± 36%     -36.3%     158.00 ± 19%  interrupts.CPU228.RES:Rescheduling_interrupts
>>>       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.NMI:Non-maskable_interrupts
>>>       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.PMI:Performance_monitoring_interrupts
>>>     404.25 ± 69%     -65.5%     139.50 ± 17%  interrupts.CPU236.RES:Rescheduling_interrupts
>>>     566.50 ± 40%     -73.6%     149.50 ± 31%  interrupts.CPU237.RES:Rescheduling_interrupts
>>>     243.50 ± 26%     -37.1%     153.25 ± 21%  interrupts.CPU248.RES:Rescheduling_interrupts
>>>     258.25 ± 12%     -53.5%     120.00 ± 18%  interrupts.CPU249.RES:Rescheduling_interrupts
>>>       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.NMI:Non-maskable_interrupts
>>>       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.PMI:Performance_monitoring_interrupts
>>>       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.NMI:Non-maskable_interrupts
>>>       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.PMI:Performance_monitoring_interrupts
>>>     425.00 ± 59%     -60.3%     168.75 ± 34%  interrupts.CPU258.RES:Rescheduling_interrupts
>>>       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.NMI:Non-maskable_interrupts
>>>       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.PMI:Performance_monitoring_interrupts
>>>       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.NMI:Non-maskable_interrupts
>>>       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.PMI:Performance_monitoring_interrupts
>>>       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.NMI:Non-maskable_interrupts
>>>       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.PMI:Performance_monitoring_interrupts
>>>       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.NMI:Non-maskable_interrupts
>>>       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.PMI:Performance_monitoring_interrupts
>>>       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.NMI:Non-maskable_interrupts
>>>       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.PMI:Performance_monitoring_interrupts
>>>       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.NMI:Non-maskable_interrupts
>>>       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.PMI:Performance_monitoring_interrupts
>>>       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.NMI:Non-maskable_interrupts
>>>       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.PMI:Performance_monitoring_interrupts
>>>     331.75 ± 32%     -39.8%     199.75 ± 17%  interrupts.CPU29.RES:Rescheduling_interrupts
>>>       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.NMI:Non-maskable_interrupts
>>>       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.PMI:Performance_monitoring_interrupts
>>>     298.50 ± 30%     -39.7%     180.00 ±  6%  interrupts.CPU34.RES:Rescheduling_interrupts
>>>       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.NMI:Non-maskable_interrupts
>>>       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.PMI:Performance_monitoring_interrupts
>>>     270.50 ± 24%     -31.1%     186.25 ±  3%  interrupts.CPU36.RES:Rescheduling_interrupts
>>>       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.NMI:Non-maskable_interrupts
>>>       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.PMI:Performance_monitoring_interrupts
>>>     286.75 ± 36%     -32.4%     193.75 ±  7%  interrupts.CPU45.RES:Rescheduling_interrupts
>>>     259.00 ± 12%     -23.6%     197.75 ± 13%  interrupts.CPU46.RES:Rescheduling_interrupts
>>>     244.00 ± 21%     -35.6%     157.25 ± 11%  interrupts.CPU47.RES:Rescheduling_interrupts
>>>     230.00 ±  7%     -21.3%     181.00 ± 11%  interrupts.CPU48.RES:Rescheduling_interrupts
>>>     281.00 ± 13%     -27.4%     204.00 ± 15%  interrupts.CPU53.RES:Rescheduling_interrupts
>>>     256.75 ±  5%     -18.4%     209.50 ± 12%  interrupts.CPU54.RES:Rescheduling_interrupts
>>>       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.NMI:Non-maskable_interrupts
>>>       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.PMI:Performance_monitoring_interrupts
>>>     316.00 ± 25%     -41.4%     185.25 ± 13%  interrupts.CPU59.RES:Rescheduling_interrupts
>>>       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.NMI:Non-maskable_interrupts
>>>       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.PMI:Performance_monitoring_interrupts
>>>       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.NMI:Non-maskable_interrupts
>>>       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.PMI:Performance_monitoring_interrupts
>>>       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.NMI:Non-maskable_interrupts
>>>       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.PMI:Performance_monitoring_interrupts
>>>       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.NMI:Non-maskable_interrupts
>>>       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.PMI:Performance_monitoring_interrupts
>>>     319.00 ± 40%     -44.7%     176.25 ±  9%  interrupts.CPU67.RES:Rescheduling_interrupts
>>>       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.NMI:Non-maskable_interrupts
>>>       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.PMI:Performance_monitoring_interrupts
>>>       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.NMI:Non-maskable_interrupts
>>>       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.PMI:Performance_monitoring_interrupts
>>>       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.NMI:Non-maskable_interrupts
>>>       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.PMI:Performance_monitoring_interrupts
>>>     426.75 ± 61%     -67.7%     138.00 ±  8%  interrupts.CPU75.RES:Rescheduling_interrupts
>>>     192.50 ± 13%     +45.6%     280.25 ± 45%  interrupts.CPU76.RES:Rescheduling_interrupts
>>>     274.25 ± 34%     -42.2%     158.50 ± 34%  interrupts.CPU77.RES:Rescheduling_interrupts
>>>       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.NMI:Non-maskable_interrupts
>>>       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.PMI:Performance_monitoring_interrupts
>>>     348.50 ± 53%     -47.3%     183.75 ± 29%  interrupts.CPU80.RES:Rescheduling_interrupts
>>>       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.NMI:Non-maskable_interrupts
>>>       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.PMI:Performance_monitoring_interrupts
>>>       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.NMI:Non-maskable_interrupts
>>>       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.PMI:Performance_monitoring_interrupts
>>>       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.NMI:Non-maskable_interrupts
>>>       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.PMI:Performance_monitoring_interrupts
>>>     408.75 ± 58%     -56.8%     176.75 ± 25%  interrupts.CPU92.RES:Rescheduling_interrupts
>>>     399.00 ± 64%     -63.6%     145.25 ± 16%  interrupts.CPU93.RES:Rescheduling_interrupts
>>>     314.75 ± 36%     -44.2%     175.75 ± 13%  interrupts.CPU94.RES:Rescheduling_interrupts
>>>     191.00 ± 15%     -29.1%     135.50 ±  9%  interrupts.CPU97.RES:Rescheduling_interrupts
>>>      94.00 ±  8%     +50.0%     141.00 ± 12%  interrupts.IWI:IRQ_work_interrupts
>>>     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.NMI:Non-maskable_interrupts
>>>     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.PMI:Performance_monitoring_interrupts
>>>      12.75 ± 11%      -4.1        8.67 ± 31%  perf-profile.calltrace.cycles-pp.do_rw_once
>>>       1.02 ± 16%      -0.6        0.47 ± 59%  perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle
>>>       1.10 ± 15%      -0.4        0.66 ± 14%  perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
>>>       1.05 ± 16%      -0.4        0.61 ± 14%  perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter
>>>       1.58 ±  4%      +0.3        1.91 ±  7%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page
>>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>>>       2.11 ±  4%      +0.5        2.60 ±  7%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault
>>>       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
>>>       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
>>>       1.90 ±  5%      +0.6        2.45 ±  7%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage
>>>       0.65 ± 62%      +0.6        1.20 ± 15%  perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
>>>       0.60 ± 62%      +0.6        1.16 ± 18%  perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap
>>>       0.95 ± 17%      +0.6        1.52 ±  8%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner
>>>       0.61 ± 62%      +0.6        1.18 ± 18%  perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput
>>>       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit
>>>       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit
>>>       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
>>>       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group
>>>       1.30 ±  9%      +0.6        1.92 ±  8%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock
>>>       0.19 ±173%      +0.7        0.89 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu
>>>       0.19 ±173%      +0.7        0.90 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu
>>>       0.00            +0.8        0.77 ± 30%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page
>>>       0.00            +0.8        0.78 ± 30%  perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page
>>>       0.00            +0.8        0.79 ± 29%  perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
>>>       0.82 ± 67%      +0.9        1.72 ± 22%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault
>>>       0.84 ± 66%      +0.9        1.74 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
>>>       2.52 ±  6%      +0.9        3.44 ±  9%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page
>>>       0.83 ± 67%      +0.9        1.75 ± 21%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
>>>       0.84 ± 66%      +0.9        1.77 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
>>>       1.64 ± 12%      +1.0        2.67 ±  7%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault
>>>       1.65 ± 45%      +1.3        2.99 ± 18%  perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
>>>       1.74 ± 13%      +1.4        3.16 ±  6%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault
>>>       2.56 ± 48%      +2.2        4.81 ± 19%  perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault
>>>      12.64 ± 14%      +3.6       16.20 ±  8%  perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault
>>>       2.97 ±  7%      +3.8        6.74 ±  9%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow
>>>      19.99 ±  9%      +4.1       24.05 ±  6%  perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault
>>>       1.37 ± 15%      -0.5        0.83 ± 13%  perf-profile.children.cycles-pp.sched_clock_cpu
>>>       1.31 ± 16%      -0.5        0.78 ± 13%  perf-profile.children.cycles-pp.sched_clock
>>>       1.29 ± 16%      -0.5        0.77 ± 13%  perf-profile.children.cycles-pp.native_sched_clock
>>>       1.80 ±  2%      -0.3        1.47 ± 10%  perf-profile.children.cycles-pp.task_tick_fair
>>>       0.73 ±  2%      -0.2        0.54 ± 11%  perf-profile.children.cycles-pp.update_curr
>>>       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.children.cycles-pp.account_process_tick
>>>       0.73 ± 10%      -0.2        0.58 ±  9%  perf-profile.children.cycles-pp.rcu_sched_clock_irq
>>>       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.children.cycles-pp.__acct_update_integrals
>>>       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs
>>>       0.40 ± 12%      -0.1        0.30 ± 14%  perf-profile.children.cycles-pp.__next_timer_interrupt
>>>       0.47 ±  7%      -0.1        0.39 ± 13%  perf-profile.children.cycles-pp.update_rq_clock
>>>       0.29 ± 12%      -0.1        0.21 ± 15%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
>>>       0.21 ±  7%      -0.1        0.14 ± 12%  perf-profile.children.cycles-pp.account_system_index_time
>>>       0.38 ±  2%      -0.1        0.31 ± 12%  perf-profile.children.cycles-pp.timerqueue_add
>>>       0.26 ± 11%      -0.1        0.20 ± 13%  perf-profile.children.cycles-pp.find_next_bit
>>>       0.23 ± 15%      -0.1        0.17 ± 15%  perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit
>>>       0.14 ±  8%      -0.1        0.07 ± 14%  perf-profile.children.cycles-pp.account_user_time
>>>       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.children.cycles-pp.cpuacct_charge
>>>       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.children.cycles-pp.irq_work_tick
>>>       0.11 ± 13%      -0.0        0.07 ± 25%  perf-profile.children.cycles-pp.tick_sched_do_timer
>>>       0.12 ± 10%      -0.0        0.08 ± 15%  perf-profile.children.cycles-pp.get_cpu_device
>>>       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.children.cycles-pp.raise_softirq
>>>       0.12 ±  3%      -0.0        0.09 ±  8%  perf-profile.children.cycles-pp.write
>>>       0.11 ± 13%      +0.0        0.14 ±  8%  perf-profile.children.cycles-pp.native_write_msr
>>>       0.09 ±  9%      +0.0        0.11 ±  7%  perf-profile.children.cycles-pp.finish_task_switch
>>>       0.10 ± 10%      +0.0        0.13 ±  5%  perf-profile.children.cycles-pp.schedule_idle
>>>       0.07 ±  6%      +0.0        0.10 ± 12%  perf-profile.children.cycles-pp.__read_nocancel
>>>       0.04 ± 58%      +0.0        0.07 ± 15%  perf-profile.children.cycles-pp.__free_pages_ok
>>>       0.06 ±  7%      +0.0        0.09 ± 13%  perf-profile.children.cycles-pp.perf_read
>>>       0.07            +0.0        0.11 ± 14%  perf-profile.children.cycles-pp.perf_evsel__read_counter
>>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.cmd_stat
>>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.__run_perf_stat
>>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.process_interval
>>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.read_counters
>>>       0.07 ± 22%      +0.0        0.11 ± 19%  perf-profile.children.cycles-pp.__handle_mm_fault
>>>       0.07 ± 19%      +0.1        0.13 ±  8%  perf-profile.children.cycles-pp.rb_erase
>>>       0.03 ±100%      +0.1        0.09 ±  9%  perf-profile.children.cycles-pp.smp_call_function_single
>>>       0.01 ±173%      +0.1        0.08 ± 11%  perf-profile.children.cycles-pp.perf_event_read
>>>       0.00            +0.1        0.07 ± 13%  perf-profile.children.cycles-pp.__perf_event_read_value
>>>       0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.__intel_pmu_enable_all
>>>       0.08 ± 17%      +0.1        0.15 ±  8%  perf-profile.children.cycles-pp.native_apic_msr_eoi_write
>>>       0.04 ±103%      +0.1        0.13 ± 58%  perf-profile.children.cycles-pp.shmem_getpage_gfp
>>>       0.38 ± 14%      +0.1        0.51 ±  6%  perf-profile.children.cycles-pp.run_timer_softirq
>>>       0.11 ±  4%      +0.3        0.37 ± 32%  perf-profile.children.cycles-pp.worker_thread
>>>       0.20 ±  5%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.ret_from_fork
>>>       0.20 ±  4%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.kthread
>>>       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.memcpy_erms
>>>       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.drm_fb_helper_dirty_work
>>>       0.00            +0.3        0.31 ± 37%  perf-profile.children.cycles-pp.process_one_work
>>>       0.47 ± 48%      +0.4        0.91 ± 19%  perf-profile.children.cycles-pp.prep_new_huge_page
>>>       0.70 ± 29%      +0.5        1.16 ± 18%  perf-profile.children.cycles-pp.free_huge_page
>>>       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_flush_mmu
>>>       0.72 ± 29%      +0.5        1.18 ± 18%  perf-profile.children.cycles-pp.release_pages
>>>       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_finish_mmu
>>>       0.76 ± 27%      +0.5        1.23 ± 18%  perf-profile.children.cycles-pp.exit_mmap
>>>       0.77 ± 27%      +0.5        1.24 ± 18%  perf-profile.children.cycles-pp.mmput
>>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.__x64_sys_exit_group
>>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_group_exit
>>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_exit
>>>       1.28 ± 29%      +0.5        1.76 ±  9%  perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
>>>       0.77 ± 28%      +0.5        1.26 ± 13%  perf-profile.children.cycles-pp.alloc_fresh_huge_page
>>>       1.53 ± 15%      +0.7        2.26 ± 14%  perf-profile.children.cycles-pp.do_syscall_64
>>>       1.53 ± 15%      +0.7        2.27 ± 14%  perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
>>>       1.13 ±  3%      +0.9        2.07 ± 14%  perf-profile.children.cycles-pp.interrupt_entry
>>>       0.79 ±  9%      +1.0        1.76 ±  5%  perf-profile.children.cycles-pp.perf_event_task_tick
>>>       1.71 ± 39%      +1.4        3.08 ± 16%  perf-profile.children.cycles-pp.alloc_surplus_huge_page
>>>       2.66 ± 42%      +2.3        4.94 ± 17%  perf-profile.children.cycles-pp.alloc_huge_page
>>>       2.89 ± 45%      +2.7        5.54 ± 18%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>>>       3.34 ± 35%      +2.7        6.02 ± 17%  perf-profile.children.cycles-pp._raw_spin_lock
>>>      12.77 ± 14%      +3.9       16.63 ±  7%  perf-profile.children.cycles-pp.mutex_spin_on_owner
>>>      20.12 ±  9%      +4.0       24.16 ±  6%  perf-profile.children.cycles-pp.hugetlb_cow
>>>      15.40 ± 10%      -3.6       11.84 ± 28%  perf-profile.self.cycles-pp.do_rw_once
>>>       4.02 ±  9%      -1.3        2.73 ± 30%  perf-profile.self.cycles-pp.do_access
>>>       2.00 ± 14%      -0.6        1.41 ± 13%  perf-profile.self.cycles-pp.cpuidle_enter_state
>>>       1.26 ± 16%      -0.5        0.74 ± 13%  perf-profile.self.cycles-pp.native_sched_clock
>>>       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.self.cycles-pp.account_process_tick
>>>       0.27 ± 19%      -0.2        0.12 ± 17%  perf-profile.self.cycles-pp.timerqueue_del
>>>       0.53 ±  3%      -0.1        0.38 ± 11%  perf-profile.self.cycles-pp.update_curr
>>>       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.self.cycles-pp.__acct_update_integrals
>>>       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs
>>>       0.61 ±  4%      -0.1        0.51 ±  8%  perf-profile.self.cycles-pp.task_tick_fair
>>>       0.20 ±  8%      -0.1        0.12 ± 14%  perf-profile.self.cycles-pp.account_system_index_time
>>>       0.23 ± 15%      -0.1        0.16 ± 17%  perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit
>>>       0.25 ± 11%      -0.1        0.18 ± 14%  perf-profile.self.cycles-pp.find_next_bit
>>>       0.10 ± 11%      -0.1        0.03 ±100%  perf-profile.self.cycles-pp.tick_sched_do_timer
>>>       0.29            -0.1        0.23 ± 11%  perf-profile.self.cycles-pp.timerqueue_add
>>>       0.12 ± 10%      -0.1        0.06 ± 17%  perf-profile.self.cycles-pp.account_user_time
>>>       0.22 ± 15%      -0.1        0.16 ±  6%  perf-profile.self.cycles-pp.scheduler_tick
>>>       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.self.cycles-pp.cpuacct_charge
>>>       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.self.cycles-pp.irq_work_tick
>>>       0.07 ± 13%      -0.0        0.03 ±100%  perf-profile.self.cycles-pp.update_process_times
>>>       0.12 ±  7%      -0.0        0.08 ± 15%  perf-profile.self.cycles-pp.get_cpu_device
>>>       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.self.cycles-pp.raise_softirq
>>>       0.12 ± 11%      -0.0        0.09 ±  7%  perf-profile.self.cycles-pp.tick_nohz_get_sleep_length
>>>       0.11 ± 11%      +0.0        0.14 ±  6%  perf-profile.self.cycles-pp.native_write_msr
>>>       0.10 ±  5%      +0.1        0.15 ±  8%  perf-profile.self.cycles-pp.__remove_hrtimer
>>>       0.07 ± 23%      +0.1        0.13 ±  8%  perf-profile.self.cycles-pp.rb_erase
>>>       0.08 ± 17%      +0.1        0.15 ±  7%  perf-profile.self.cycles-pp.native_apic_msr_eoi_write
>>>       0.00            +0.1        0.08 ± 10%  perf-profile.self.cycles-pp.smp_call_function_single
>>>       0.32 ± 17%      +0.1        0.42 ±  7%  perf-profile.self.cycles-pp.run_timer_softirq
>>>       0.22 ±  5%      +0.1        0.34 ±  4%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
>>>       0.45 ± 15%      +0.2        0.60 ± 12%  perf-profile.self.cycles-pp.rcu_irq_enter
>>>       0.31 ±  8%      +0.2        0.46 ± 16%  perf-profile.self.cycles-pp.irq_enter
>>>       0.29 ± 10%      +0.2        0.44 ± 16%  perf-profile.self.cycles-pp.apic_timer_interrupt
>>>       0.71 ± 30%      +0.2        0.92 ±  8%  perf-profile.self.cycles-pp.perf_mux_hrtimer_handler
>>>       0.00            +0.3        0.28 ± 37%  perf-profile.self.cycles-pp.memcpy_erms
>>>       1.12 ±  3%      +0.9        2.02 ± 15%  perf-profile.self.cycles-pp.interrupt_entry
>>>       0.79 ±  9%      +0.9        1.73 ±  5%  perf-profile.self.cycles-pp.perf_event_task_tick
>>>       2.49 ± 45%      +2.1        4.55 ± 20%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
>>>      10.95 ± 15%      +2.7       13.61 ±  8%  perf-profile.self.cycles-pp.mutex_spin_on_owner
>>>
>>>
>>>
>>>                                vm-scalability.throughput
>>>
>>>   1.6e+07 +-+---------------------------------------------------------------+
>>>           |..+.+    +..+.+..+.+.   +.      +..+.+..+.+..+.+..+.+..+    +    |
>>>   1.4e+07 +-+  :    :  O      O    O                           O            |
>>>   1.2e+07 O-+O O  O O    O  O    O    O O  O  O    O    O    O      O  O O  O
>>>           |     :   :                           O    O    O       O         |
>>>     1e+07 +-+   :  :                                                        |
>>>           |     :  :                                                        |
>>>     8e+06 +-+   :  :                                                        |
>>>           |      : :                                                        |
>>>     6e+06 +-+    : :                                                        |
>>>     4e+06 +-+    : :                                                        |
>>>           |      ::                                                         |
>>>     2e+06 +-+     :                                                         |
>>>           |       :                                                         |
>>>         0 +-+---------------------------------------------------------------+
>>>
>>>
>>>                          vm-scalability.time.minor_page_faults
>>>
>>>   2.5e+06 +-+---------------------------------------------------------------+
>>>           |                                                                 |
>>>           |..+.+    +..+.+..+.+..+.+..+.+..  .+.  .+.+..+.+..+.+..+.+..+    |
>>>     2e+06 +-+  :    :                      +.   +.                          |
>>>           O  O O: O O  O O  O O  O O                    O      O            |
>>>           |     :   :                 O O  O  O O  O O    O  O    O O  O O  O
>>>   1.5e+06 +-+   :  :                                                        |
>>>           |     :  :                                                        |
>>>     1e+06 +-+    : :                                                        |
>>>           |      : :                                                        |
>>>           |      : :                                                        |
>>>    500000 +-+    : :                                                        |
>>>           |       :                                                         |
>>>           |       :                                                         |
>>>         0 +-+---------------------------------------------------------------+
>>>
>>>
>>>                                 vm-scalability.workload
>>>
>>>   3.5e+09 +-+---------------------------------------------------------------+
>>>           | .+.                      .+.+..                        .+..     |
>>>     3e+09 +-+  +    +..+.+..+.+..+.+.      +..+.+..+.+..+.+..+.+..+    +    |
>>>           |    :    :       O O                                O            |
>>>   2.5e+09 O-+O O: O O  O O       O O  O    O            O                   |
>>>           |     :   :                   O     O O  O O    O  O    O O  O O  O
>>>     2e+09 +-+   :  :                                                        |
>>>           |     :  :                                                        |
>>>   1.5e+09 +-+    : :                                                        |
>>>           |      : :                                                        |
>>>     1e+09 +-+    : :                                                        |
>>>           |      : :                                                        |
>>>     5e+08 +-+     :                                                         |
>>>           |       :                                                         |
>>>         0 +-+---------------------------------------------------------------+
>>>
>>>
>>> [*] bisect-good sample
>>> [O] bisect-bad  sample
>>>
>>>
>>>
>>> Disclaimer:
>>> Results have been estimated based on internal Intel analysis and are provided
>>> for informational purposes only. Any difference in system hardware or software
>>> design or configuration may affect actual performance.
>>>
>>>
>>> Thanks,
>>> Rong Chen
>>>
>>
>> --
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>> HRB 21284 (AG Nürnberg)
>>
> 
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-30 18:50     ` Thomas Zimmermann
@ 2019-07-30 18:59       ` Daniel Vetter
  2019-07-30 20:26         ` Dave Airlie
  0 siblings, 1 reply; 61+ messages in thread
From: Daniel Vetter @ 2019-07-30 18:59 UTC (permalink / raw)
  To: Thomas Zimmermann; +Cc: Stephen Rothwell, LKP, dri-devel, kernel test robot

On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> Hi
>
> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
> > On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >> Am 29.07.19 um 11:51 schrieb kernel test robot:
> >>> Greeting,
> >>>
> >>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
> >>>
> >>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> >>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
> >>
> >> Daniel, Noralf, we may have to revert this patch.
> >>
> >> I expected some change in display performance, but not in VM. Since it's
> >> a server chipset, probably no one cares much about display performance.
> >> So that seemed like a good trade-off for re-using shared code.
> >>
> >> Part of the patch set is that the generic fb emulation now maps and
> >> unmaps the fbdev BO when updating the screen. I guess that's the cause
> >> of the performance regression. And it should be visible with other
> >> drivers as well if they use a shadow FB for fbdev emulation.
> >
> > For fbcon we should need to do any maps/unamps at all, this is for the
> > fbdev mmap support only. If the testcase mentioned here tests fbdev
> > mmap handling it's pretty badly misnamed :-) And as long as you don't
> > have an fbdev mmap there shouldn't be any impact at all.
>
> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
> fbdev BO out if it's not being displayed. If not being mapped, it can be
> evicted and make room for X, etc.
>
> To make this work, the BO's memory is mapped and unmapped in
> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
> That fbdev mapping is established on each screen update, more or less.
> From my (yet unverified) understanding, this causes the performance
> regression in the VM code.
>
> The original code in mgag200 used to kmap the fbdev BO while it's being
> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
> not being display). [3]

Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
cache this.

> I think this could be added for VRAM helpers as well, but it's still a
> workaround and non-VRAM drivers might also run into such a performance
> regression if they use the fbdev's shadow fb.

Yeah agreed, fbdev emulation should try to cache the vmap.

> Noralf mentioned that there are plans for other DRM clients besides the
> console. They would as well run into similar problems.
>
> >> The thing is that we'd need another generic fbdev emulation for ast and
> >> mgag200 that handles this issue properly.
> >
> > Yeah I dont think we want to jump the gun here.  If you can try to
> > repro locally and profile where we're wasting cpu time I hope that
> > should sched a light what's going wrong here.
>
> I don't have much time ATM and I'm not even officially at work until
> late Aug. I'd send you the revert and investigate later. I agree that
> using generic fbdev emulation would be preferable.

Still not sure that's the right thing to do really. Yes it's a
regression, but vm testcases shouldn run a single line of fbcon or drm
code. So why this is impacted so heavily by a silly drm change is very
confusing to me. We might be papering over a deeper and much more
serious issue ...
-Daniel

>
> Best regards
> Thomas
>
>
> [1]
> https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/drm_fb_helper.c?id=90f479ae51afa45efab97afdde9b94b9660dd3e4#n419
> [2]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/mgag200/mgag200_mode.c?h=v5.2#n897
> [3]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?h=v5.2#n75
>
> > -Daniel
> >
> >>
> >> Best regards
> >> Thomas
> >>
> >>>
> >>> in testcase: vm-scalability
> >>> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
> >>> with following parameters:
> >>>
> >>>       runtime: 300s
> >>>       size: 8T
> >>>       test: anon-cow-seq-hugetlb
> >>>       cpufreq_governor: performance
> >>>
> >>> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> >>> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
> >>>
> >>>
> >>>
> >>> Details are as below:
> >>> -------------------------------------------------------------------------------------------------->
> >>>
> >>>
> >>> To reproduce:
> >>>
> >>>         git clone https://github.com/intel/lkp-tests.git
> >>>         cd lkp-tests
> >>>         bin/lkp install job.yaml  # job file is attached in this email
> >>>         bin/lkp run     job.yaml
> >>>
> >>> =========================================================================================
> >>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
> >>>   gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
> >>>
> >>> commit:
> >>>   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
> >>>   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> >>>
> >>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
> >>> ---------------- ---------------------------
> >>>        fail:runs  %reproduction    fail:runs
> >>>            |             |             |
> >>>           2:4          -50%            :4     dmesg.WARNING:at#for_ip_interrupt_entry/0x
> >>>            :4           25%           1:4     dmesg.WARNING:at_ip___perf_sw_event/0x
> >>>            :4           25%           1:4     dmesg.WARNING:at_ip__fsnotify_parent/0x
> >>>          %stddev     %change         %stddev
> >>>              \          |                \
> >>>      43955 ±  2%     -18.8%      35691        vm-scalability.median
> >>>       0.06 ±  7%    +193.0%       0.16 ±  2%  vm-scalability.median_stddev
> >>>   14906559 ±  2%     -17.9%   12237079        vm-scalability.throughput
> >>>      87651 ±  2%     -17.4%      72374        vm-scalability.time.involuntary_context_switches
> >>>    2086168           -23.6%    1594224        vm-scalability.time.minor_page_faults
> >>>      15082 ±  2%     -10.4%      13517        vm-scalability.time.percent_of_cpu_this_job_got
> >>>      29987            -8.9%      27327        vm-scalability.time.system_time
> >>>      15755           -12.4%      13795        vm-scalability.time.user_time
> >>>     122011           -19.3%      98418        vm-scalability.time.voluntary_context_switches
> >>>  3.034e+09           -23.6%  2.318e+09        vm-scalability.workload
> >>>     242478 ± 12%     +68.5%     408518 ± 23%  cpuidle.POLL.time
> >>>       2788 ± 21%    +117.4%       6062 ± 26%  cpuidle.POLL.usage
> >>>      56653 ± 10%     +64.4%      93144 ± 20%  meminfo.Mapped
> >>>     120392 ±  7%     +14.0%     137212 ±  4%  meminfo.Shmem
> >>>      47221 ± 11%     +77.1%      83634 ± 22%  numa-meminfo.node0.Mapped
> >>>     120465 ±  7%     +13.9%     137205 ±  4%  numa-meminfo.node0.Shmem
> >>>    2885513           -16.5%    2409384        numa-numastat.node0.local_node
> >>>    2885471           -16.5%    2409354        numa-numastat.node0.numa_hit
> >>>      11813 ± 11%     +76.3%      20824 ± 22%  numa-vmstat.node0.nr_mapped
> >>>      30096 ±  7%     +13.8%      34238 ±  4%  numa-vmstat.node0.nr_shmem
> >>>      43.72 ±  2%      +5.5       49.20        mpstat.cpu.all.idle%
> >>>       0.03 ±  4%      +0.0        0.05 ±  6%  mpstat.cpu.all.soft%
> >>>      19.51            -2.4       17.08        mpstat.cpu.all.usr%
> >>>       1012            -7.9%     932.75        turbostat.Avg_MHz
> >>>      32.38 ± 10%     +25.8%      40.73        turbostat.CPU%c1
> >>>     145.51            -3.1%     141.01        turbostat.PkgWatt
> >>>      15.09           -19.2%      12.19        turbostat.RAMWatt
> >>>      43.50 ±  2%     +13.2%      49.25        vmstat.cpu.id
> >>>      18.75 ±  2%     -13.3%      16.25 ±  2%  vmstat.cpu.us
> >>>     152.00 ±  2%      -9.5%     137.50        vmstat.procs.r
> >>>       4800           -13.1%       4173        vmstat.system.cs
> >>>     156170           -11.9%     137594        slabinfo.anon_vma.active_objs
> >>>       3395           -11.9%       2991        slabinfo.anon_vma.active_slabs
> >>>     156190           -11.9%     137606        slabinfo.anon_vma.num_objs
> >>>       3395           -11.9%       2991        slabinfo.anon_vma.num_slabs
> >>>       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.active_objs
> >>>       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.num_objs
> >>>       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.active_objs
> >>>       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.num_objs
> >>>       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.active_objs
> >>>       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.num_objs
> >>>    1330122           -23.6%    1016557        proc-vmstat.htlb_buddy_alloc_success
> >>>      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_active_anon
> >>>      67277            +2.9%      69246        proc-vmstat.nr_anon_pages
> >>>     218.50 ±  3%     -10.6%     195.25        proc-vmstat.nr_dirtied
> >>>     288628            +1.4%     292755        proc-vmstat.nr_file_pages
> >>>     360.50            -2.7%     350.75        proc-vmstat.nr_inactive_file
> >>>      14225 ±  9%     +63.8%      23304 ± 20%  proc-vmstat.nr_mapped
> >>>      30109 ±  7%     +13.8%      34259 ±  4%  proc-vmstat.nr_shmem
> >>>      99870            -1.3%      98597        proc-vmstat.nr_slab_unreclaimable
> >>>     204.00 ±  4%     -12.1%     179.25        proc-vmstat.nr_written
> >>>      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_zone_active_anon
> >>>     360.50            -2.7%     350.75        proc-vmstat.nr_zone_inactive_file
> >>>       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults
> >>>       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults_local
> >>>    2904082           -16.4%    2427026        proc-vmstat.numa_hit
> >>>    2904081           -16.4%    2427025        proc-vmstat.numa_local
> >>>  6.828e+08           -23.5%  5.221e+08        proc-vmstat.pgalloc_normal
> >>>    2900008           -17.2%    2400195        proc-vmstat.pgfault
> >>>  6.827e+08           -23.5%   5.22e+08        proc-vmstat.pgfree
> >>>  1.635e+10           -17.0%  1.357e+10        perf-stat.i.branch-instructions
> >>>       1.53 ±  4%      -0.1        1.45 ±  3%  perf-stat.i.branch-miss-rate%
> >>>  2.581e+08 ±  3%     -20.5%  2.051e+08 ±  2%  perf-stat.i.branch-misses
> >>>      12.66            +1.1       13.78        perf-stat.i.cache-miss-rate%
> >>>   72720849           -12.0%   63958986        perf-stat.i.cache-misses
> >>>  5.766e+08           -18.6%  4.691e+08        perf-stat.i.cache-references
> >>>       4674 ±  2%     -13.0%       4064        perf-stat.i.context-switches
> >>>       4.29           +12.5%       4.83        perf-stat.i.cpi
> >>>  2.573e+11            -7.4%  2.383e+11        perf-stat.i.cpu-cycles
> >>>     231.35           -21.5%     181.56        perf-stat.i.cpu-migrations
> >>>       3522            +4.4%       3677        perf-stat.i.cycles-between-cache-misses
> >>>       0.09 ± 13%      +0.0        0.12 ±  5%  perf-stat.i.iTLB-load-miss-rate%
> >>>  5.894e+10           -15.8%  4.961e+10        perf-stat.i.iTLB-loads
> >>>  5.901e+10           -15.8%  4.967e+10        perf-stat.i.instructions
> >>>       1291 ± 14%     -21.8%       1010        perf-stat.i.instructions-per-iTLB-miss
> >>>       0.24           -11.0%       0.21        perf-stat.i.ipc
> >>>       9476           -17.5%       7821        perf-stat.i.minor-faults
> >>>       9478           -17.5%       7821        perf-stat.i.page-faults
> >>>       9.76            -3.6%       9.41        perf-stat.overall.MPKI
> >>>       1.59 ±  4%      -0.1        1.52        perf-stat.overall.branch-miss-rate%
> >>>      12.61            +1.1       13.71        perf-stat.overall.cache-miss-rate%
> >>>       4.38           +10.5%       4.83        perf-stat.overall.cpi
> >>>       3557            +5.3%       3747        perf-stat.overall.cycles-between-cache-misses
> >>>       0.08 ± 12%      +0.0        0.10        perf-stat.overall.iTLB-load-miss-rate%
> >>>       1268 ± 15%     -23.0%     976.22        perf-stat.overall.instructions-per-iTLB-miss
> >>>       0.23            -9.5%       0.21        perf-stat.overall.ipc
> >>>       5815            +9.7%       6378        perf-stat.overall.path-length
> >>>  1.634e+10           -17.5%  1.348e+10        perf-stat.ps.branch-instructions
> >>>  2.595e+08 ±  3%     -21.2%  2.043e+08 ±  2%  perf-stat.ps.branch-misses
> >>>   72565205           -12.2%   63706339        perf-stat.ps.cache-misses
> >>>  5.754e+08           -19.2%  4.646e+08        perf-stat.ps.cache-references
> >>>       4640 ±  2%     -12.5%       4060        perf-stat.ps.context-switches
> >>>  2.581e+11            -7.5%  2.387e+11        perf-stat.ps.cpu-cycles
> >>>     229.91           -22.0%     179.42        perf-stat.ps.cpu-migrations
> >>>  5.889e+10           -16.3%  4.927e+10        perf-stat.ps.iTLB-loads
> >>>  5.899e+10           -16.3%  4.938e+10        perf-stat.ps.instructions
> >>>       9388           -18.2%       7677        perf-stat.ps.minor-faults
> >>>       9389           -18.2%       7677        perf-stat.ps.page-faults
> >>>  1.764e+13           -16.2%  1.479e+13        perf-stat.total.instructions
> >>>      46803 ±  3%     -18.8%      37982 ±  6%  sched_debug.cfs_rq:/.exec_clock.min
> >>>       5320 ±  3%     +23.7%       6581 ±  3%  sched_debug.cfs_rq:/.exec_clock.stddev
> >>>       6737 ± 14%     +58.1%      10649 ± 10%  sched_debug.cfs_rq:/.load.avg
> >>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.load.max
> >>>      46952 ± 16%     +64.8%      77388 ± 11%  sched_debug.cfs_rq:/.load.stddev
> >>>       7.12 ±  4%     +49.1%      10.62 ±  6%  sched_debug.cfs_rq:/.load_avg.avg
> >>>     474.40 ± 23%     +67.5%     794.60 ± 10%  sched_debug.cfs_rq:/.load_avg.max
> >>>      37.70 ± 11%     +74.8%      65.90 ±  9%  sched_debug.cfs_rq:/.load_avg.stddev
> >>>   13424269 ±  4%     -15.6%   11328098 ±  2%  sched_debug.cfs_rq:/.min_vruntime.avg
> >>>   15411275 ±  3%     -12.4%   13505072 ±  2%  sched_debug.cfs_rq:/.min_vruntime.max
> >>>    7939295 ±  6%     -17.5%    6551322 ±  7%  sched_debug.cfs_rq:/.min_vruntime.min
> >>>      21.44 ±  7%     -56.1%       9.42 ±  4%  sched_debug.cfs_rq:/.nr_spread_over.avg
> >>>     117.45 ± 11%     -60.6%      46.30 ± 14%  sched_debug.cfs_rq:/.nr_spread_over.max
> >>>      19.33 ±  8%     -66.4%       6.49 ±  9%  sched_debug.cfs_rq:/.nr_spread_over.stddev
> >>>       4.32 ± 15%     +84.4%       7.97 ±  3%  sched_debug.cfs_rq:/.runnable_load_avg.avg
> >>>     353.85 ± 29%    +118.8%     774.35 ± 11%  sched_debug.cfs_rq:/.runnable_load_avg.max
> >>>      27.30 ± 24%    +118.5%      59.64 ±  9%  sched_debug.cfs_rq:/.runnable_load_avg.stddev
> >>>       6729 ± 14%     +58.2%      10644 ± 10%  sched_debug.cfs_rq:/.runnable_weight.avg
> >>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.runnable_weight.max
> >>>      46950 ± 16%     +64.8%      77387 ± 11%  sched_debug.cfs_rq:/.runnable_weight.stddev
> >>>    5305069 ±  4%     -17.4%    4380376 ±  7%  sched_debug.cfs_rq:/.spread0.avg
> >>>    7328745 ±  3%      -9.9%    6600897 ±  3%  sched_debug.cfs_rq:/.spread0.max
> >>>    2220837 ±  4%     +55.8%    3460596 ±  5%  sched_debug.cpu.avg_idle.avg
> >>>    4590666 ±  9%     +76.8%    8117037 ± 15%  sched_debug.cpu.avg_idle.max
> >>>     485052 ±  7%     +80.3%     874679 ± 10%  sched_debug.cpu.avg_idle.stddev
> >>>     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock.stddev
> >>>     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock_task.stddev
> >>>       3.20 ± 10%    +109.6%       6.70 ±  3%  sched_debug.cpu.cpu_load[0].avg
> >>>     309.10 ± 20%    +150.3%     773.75 ± 12%  sched_debug.cpu.cpu_load[0].max
> >>>      21.02 ± 14%    +160.8%      54.80 ±  9%  sched_debug.cpu.cpu_load[0].stddev
> >>>       3.19 ±  8%    +109.8%       6.70 ±  3%  sched_debug.cpu.cpu_load[1].avg
> >>>     299.75 ± 19%    +158.0%     773.30 ± 12%  sched_debug.cpu.cpu_load[1].max
> >>>      20.32 ± 12%    +168.7%      54.62 ±  9%  sched_debug.cpu.cpu_load[1].stddev
> >>>       3.20 ±  8%    +109.1%       6.69 ±  4%  sched_debug.cpu.cpu_load[2].avg
> >>>     288.90 ± 20%    +167.0%     771.40 ± 12%  sched_debug.cpu.cpu_load[2].max
> >>>      19.70 ± 12%    +175.4%      54.27 ±  9%  sched_debug.cpu.cpu_load[2].stddev
> >>>       3.16 ±  8%    +110.9%       6.66 ±  6%  sched_debug.cpu.cpu_load[3].avg
> >>>     275.50 ± 24%    +178.4%     766.95 ± 12%  sched_debug.cpu.cpu_load[3].max
> >>>      18.92 ± 15%    +184.2%      53.77 ± 10%  sched_debug.cpu.cpu_load[3].stddev
> >>>       3.08 ±  8%    +115.7%       6.65 ±  7%  sched_debug.cpu.cpu_load[4].avg
> >>>     263.55 ± 28%    +188.7%     760.85 ± 12%  sched_debug.cpu.cpu_load[4].max
> >>>      18.03 ± 18%    +196.6%      53.46 ± 11%  sched_debug.cpu.cpu_load[4].stddev
> >>>      14543            -9.6%      13150        sched_debug.cpu.curr->pid.max
> >>>       5293 ± 16%     +74.7%       9248 ± 11%  sched_debug.cpu.load.avg
> >>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cpu.load.max
> >>>      40887 ± 19%     +78.3%      72891 ±  9%  sched_debug.cpu.load.stddev
> >>>    1141679 ±  4%     +56.9%    1790907 ±  5%  sched_debug.cpu.max_idle_balance_cost.avg
> >>>    2432100 ±  9%     +72.6%    4196779 ± 13%  sched_debug.cpu.max_idle_balance_cost.max
> >>>     745656           +29.3%     964170 ±  5%  sched_debug.cpu.max_idle_balance_cost.min
> >>>     239032 ±  9%     +81.9%     434806 ± 10%  sched_debug.cpu.max_idle_balance_cost.stddev
> >>>       0.00 ± 27%     +92.1%       0.00 ± 31%  sched_debug.cpu.next_balance.stddev
> >>>       1030 ±  4%     -10.4%     924.00 ±  2%  sched_debug.cpu.nr_switches.min
> >>>       0.04 ± 26%    +139.0%       0.09 ± 41%  sched_debug.cpu.nr_uninterruptible.avg
> >>>     830.35 ±  6%     -12.0%     730.50 ±  2%  sched_debug.cpu.sched_count.min
> >>>     912.00 ±  2%      -9.5%     825.38        sched_debug.cpu.ttwu_count.avg
> >>>     433.05 ±  3%     -19.2%     350.05 ±  3%  sched_debug.cpu.ttwu_count.min
> >>>     160.70 ±  3%     -12.5%     140.60 ±  4%  sched_debug.cpu.ttwu_local.min
> >>>       9072 ± 11%     -36.4%       5767 ±  8%  softirqs.CPU1.RCU
> >>>      12769 ±  5%     +15.3%      14718 ±  3%  softirqs.CPU101.SCHED
> >>>      13198           +11.5%      14717 ±  3%  softirqs.CPU102.SCHED
> >>>      12981 ±  4%     +13.9%      14788 ±  3%  softirqs.CPU105.SCHED
> >>>      13486 ±  3%     +11.8%      15071 ±  4%  softirqs.CPU111.SCHED
> >>>      12794 ±  4%     +14.1%      14601 ±  9%  softirqs.CPU112.SCHED
> >>>      12999 ±  4%     +10.1%      14314 ±  4%  softirqs.CPU115.SCHED
> >>>      12844 ±  4%     +10.6%      14202 ±  2%  softirqs.CPU120.SCHED
> >>>      13336 ±  3%      +9.4%      14585 ±  3%  softirqs.CPU122.SCHED
> >>>      12639 ±  4%     +20.2%      15195        softirqs.CPU123.SCHED
> >>>      13040 ±  5%     +15.2%      15024 ±  5%  softirqs.CPU126.SCHED
> >>>      13123           +15.1%      15106 ±  5%  softirqs.CPU127.SCHED
> >>>       9188 ±  6%     -35.7%       5911 ±  2%  softirqs.CPU13.RCU
> >>>      13054 ±  3%     +13.1%      14761 ±  5%  softirqs.CPU130.SCHED
> >>>      13158 ±  2%     +13.9%      14985 ±  5%  softirqs.CPU131.SCHED
> >>>      12797 ±  6%     +13.5%      14524 ±  3%  softirqs.CPU133.SCHED
> >>>      12452 ±  5%     +14.8%      14297        softirqs.CPU134.SCHED
> >>>      13078 ±  3%     +10.4%      14439 ±  3%  softirqs.CPU138.SCHED
> >>>      12617 ±  2%     +14.5%      14442 ±  5%  softirqs.CPU139.SCHED
> >>>      12974 ±  3%     +13.7%      14752 ±  4%  softirqs.CPU142.SCHED
> >>>      12579 ±  4%     +19.1%      14983 ±  3%  softirqs.CPU143.SCHED
> >>>       9122 ± 24%     -44.6%       5053 ±  5%  softirqs.CPU144.RCU
> >>>      13366 ±  2%     +11.1%      14848 ±  3%  softirqs.CPU149.SCHED
> >>>      13246 ±  2%     +22.0%      16162 ±  7%  softirqs.CPU150.SCHED
> >>>      13452 ±  3%     +20.5%      16210 ±  7%  softirqs.CPU151.SCHED
> >>>      13507           +10.1%      14869        softirqs.CPU156.SCHED
> >>>      13808 ±  3%      +9.2%      15079 ±  4%  softirqs.CPU157.SCHED
> >>>      13442 ±  2%     +13.4%      15248 ±  4%  softirqs.CPU160.SCHED
> >>>      13311           +12.1%      14920 ±  2%  softirqs.CPU162.SCHED
> >>>      13544 ±  3%      +8.5%      14695 ±  4%  softirqs.CPU163.SCHED
> >>>      13648 ±  3%     +11.2%      15179 ±  2%  softirqs.CPU166.SCHED
> >>>      13404 ±  4%     +12.5%      15079 ±  3%  softirqs.CPU168.SCHED
> >>>      13421 ±  6%     +16.0%      15568 ±  8%  softirqs.CPU169.SCHED
> >>>      13115 ±  3%     +23.1%      16139 ± 10%  softirqs.CPU171.SCHED
> >>>      13424 ±  6%     +10.4%      14822 ±  3%  softirqs.CPU175.SCHED
> >>>      13274 ±  3%     +13.7%      15087 ±  9%  softirqs.CPU185.SCHED
> >>>      13409 ±  3%     +12.3%      15063 ±  3%  softirqs.CPU190.SCHED
> >>>      13181 ±  7%     +13.4%      14946 ±  3%  softirqs.CPU196.SCHED
> >>>      13578 ±  3%     +10.9%      15061        softirqs.CPU197.SCHED
> >>>      13323 ±  5%     +24.8%      16627 ±  6%  softirqs.CPU198.SCHED
> >>>      14072 ±  2%     +12.3%      15798 ±  7%  softirqs.CPU199.SCHED
> >>>      12604 ± 13%     +17.9%      14865        softirqs.CPU201.SCHED
> >>>      13380 ±  4%     +14.8%      15356 ±  3%  softirqs.CPU203.SCHED
> >>>      13481 ±  8%     +14.2%      15390 ±  3%  softirqs.CPU204.SCHED
> >>>      12921 ±  2%     +13.8%      14710 ±  3%  softirqs.CPU206.SCHED
> >>>      13468           +13.0%      15218 ±  2%  softirqs.CPU208.SCHED
> >>>      13253 ±  2%     +13.1%      14992        softirqs.CPU209.SCHED
> >>>      13319 ±  2%     +14.3%      15225 ±  7%  softirqs.CPU210.SCHED
> >>>      13673 ±  5%     +16.3%      15895 ±  3%  softirqs.CPU211.SCHED
> >>>      13290           +17.0%      15556 ±  5%  softirqs.CPU212.SCHED
> >>>      13455 ±  4%     +14.4%      15392 ±  3%  softirqs.CPU213.SCHED
> >>>      13454 ±  4%     +14.3%      15377 ±  3%  softirqs.CPU215.SCHED
> >>>      13872 ±  7%      +9.7%      15221 ±  5%  softirqs.CPU220.SCHED
> >>>      13555 ±  4%     +17.3%      15896 ±  5%  softirqs.CPU222.SCHED
> >>>      13411 ±  4%     +20.8%      16197 ±  6%  softirqs.CPU223.SCHED
> >>>       8472 ± 21%     -44.8%       4680 ±  3%  softirqs.CPU224.RCU
> >>>      13141 ±  3%     +16.2%      15265 ±  7%  softirqs.CPU225.SCHED
> >>>      14084 ±  3%      +8.2%      15242 ±  2%  softirqs.CPU226.SCHED
> >>>      13528 ±  4%     +11.3%      15063 ±  4%  softirqs.CPU228.SCHED
> >>>      13218 ±  3%     +16.3%      15377 ±  4%  softirqs.CPU229.SCHED
> >>>      14031 ±  4%     +10.2%      15467 ±  2%  softirqs.CPU231.SCHED
> >>>      13770 ±  3%     +14.0%      15700 ±  3%  softirqs.CPU232.SCHED
> >>>      13456 ±  3%     +12.3%      15105 ±  3%  softirqs.CPU233.SCHED
> >>>      13137 ±  4%     +13.5%      14909 ±  3%  softirqs.CPU234.SCHED
> >>>      13318 ±  2%     +14.7%      15280 ±  2%  softirqs.CPU235.SCHED
> >>>      13690 ±  2%     +13.7%      15563 ±  7%  softirqs.CPU238.SCHED
> >>>      13771 ±  5%     +20.8%      16634 ±  7%  softirqs.CPU241.SCHED
> >>>      13317 ±  7%     +19.5%      15919 ±  9%  softirqs.CPU243.SCHED
> >>>       8234 ± 16%     -43.9%       4616 ±  5%  softirqs.CPU244.RCU
> >>>      13845 ±  6%     +13.0%      15643 ±  3%  softirqs.CPU244.SCHED
> >>>      13179 ±  3%     +16.3%      15323        softirqs.CPU246.SCHED
> >>>      13754           +12.2%      15438 ±  3%  softirqs.CPU248.SCHED
> >>>      13769 ±  4%     +10.9%      15276 ±  2%  softirqs.CPU252.SCHED
> >>>      13702           +10.5%      15147 ±  2%  softirqs.CPU254.SCHED
> >>>      13315 ±  2%     +12.5%      14980 ±  3%  softirqs.CPU255.SCHED
> >>>      13785 ±  3%     +12.9%      15568 ±  5%  softirqs.CPU256.SCHED
> >>>      13307 ±  3%     +15.0%      15298 ±  3%  softirqs.CPU257.SCHED
> >>>      13864 ±  3%     +10.5%      15313 ±  2%  softirqs.CPU259.SCHED
> >>>      13879 ±  2%     +11.4%      15465        softirqs.CPU261.SCHED
> >>>      13815           +13.6%      15687 ±  5%  softirqs.CPU264.SCHED
> >>>     119574 ±  2%     +11.8%     133693 ± 11%  softirqs.CPU266.TIMER
> >>>      13688           +10.9%      15180 ±  6%  softirqs.CPU267.SCHED
> >>>      11716 ±  4%     +19.3%      13974 ±  8%  softirqs.CPU27.SCHED
> >>>      13866 ±  3%     +13.7%      15765 ±  4%  softirqs.CPU271.SCHED
> >>>      13887 ±  5%     +12.5%      15621        softirqs.CPU272.SCHED
> >>>      13383 ±  3%     +19.8%      16031 ±  2%  softirqs.CPU274.SCHED
> >>>      13347           +14.1%      15232 ±  3%  softirqs.CPU275.SCHED
> >>>      12884 ±  2%     +21.0%      15593 ±  4%  softirqs.CPU276.SCHED
> >>>      13131 ±  5%     +13.4%      14891 ±  5%  softirqs.CPU277.SCHED
> >>>      12891 ±  2%     +19.2%      15371 ±  4%  softirqs.CPU278.SCHED
> >>>      13313 ±  4%     +13.0%      15049 ±  2%  softirqs.CPU279.SCHED
> >>>      13514 ±  3%     +10.2%      14897 ±  2%  softirqs.CPU280.SCHED
> >>>      13501 ±  3%     +13.7%      15346        softirqs.CPU281.SCHED
> >>>      13261           +17.5%      15577        softirqs.CPU282.SCHED
> >>>       8076 ± 15%     -43.7%       4546 ±  5%  softirqs.CPU283.RCU
> >>>      13686 ±  3%     +12.6%      15413 ±  2%  softirqs.CPU284.SCHED
> >>>      13439 ±  2%      +9.2%      14670 ±  4%  softirqs.CPU285.SCHED
> >>>       8878 ±  9%     -35.4%       5735 ±  4%  softirqs.CPU35.RCU
> >>>      11690 ±  2%     +13.6%      13274 ±  5%  softirqs.CPU40.SCHED
> >>>      11714 ±  2%     +19.3%      13975 ± 13%  softirqs.CPU41.SCHED
> >>>      11763           +12.5%      13239 ±  4%  softirqs.CPU45.SCHED
> >>>      11662 ±  2%      +9.4%      12757 ±  3%  softirqs.CPU46.SCHED
> >>>      11805 ±  2%      +9.3%      12902 ±  2%  softirqs.CPU50.SCHED
> >>>      12158 ±  3%     +12.3%      13655 ±  8%  softirqs.CPU55.SCHED
> >>>      11716 ±  4%      +8.8%      12751 ±  3%  softirqs.CPU58.SCHED
> >>>      11922 ±  2%      +9.9%      13100 ±  4%  softirqs.CPU64.SCHED
> >>>       9674 ± 17%     -41.8%       5625 ±  6%  softirqs.CPU66.RCU
> >>>      11818           +12.0%      13237        softirqs.CPU66.SCHED
> >>>     124682 ±  7%      -6.1%     117088 ±  5%  softirqs.CPU66.TIMER
> >>>       8637 ±  9%     -34.0%       5700 ±  7%  softirqs.CPU70.RCU
> >>>      11624 ±  2%     +11.0%      12901 ±  2%  softirqs.CPU70.SCHED
> >>>      12372 ±  2%     +13.2%      14003 ±  3%  softirqs.CPU71.SCHED
> >>>       9949 ± 25%     -33.9%       6574 ± 31%  softirqs.CPU72.RCU
> >>>      10392 ± 26%     -35.1%       6745 ± 35%  softirqs.CPU73.RCU
> >>>      12766 ±  3%     +11.1%      14188 ±  3%  softirqs.CPU76.SCHED
> >>>      12611 ±  2%     +18.8%      14984 ±  5%  softirqs.CPU78.SCHED
> >>>      12786 ±  3%     +17.9%      15079 ±  7%  softirqs.CPU79.SCHED
> >>>      11947 ±  4%      +9.7%      13103 ±  4%  softirqs.CPU8.SCHED
> >>>      13379 ±  7%     +11.8%      14962 ±  4%  softirqs.CPU83.SCHED
> >>>      13438 ±  5%      +9.7%      14738 ±  2%  softirqs.CPU84.SCHED
> >>>      12768           +19.4%      15241 ±  6%  softirqs.CPU88.SCHED
> >>>       8604 ± 13%     -39.3%       5222 ±  3%  softirqs.CPU89.RCU
> >>>      13077 ±  2%     +17.1%      15308 ±  7%  softirqs.CPU89.SCHED
> >>>      11887 ±  3%     +20.1%      14272 ±  5%  softirqs.CPU9.SCHED
> >>>      12723 ±  3%     +11.3%      14165 ±  4%  softirqs.CPU90.SCHED
> >>>       8439 ± 12%     -38.9%       5153 ±  4%  softirqs.CPU91.RCU
> >>>      13429 ±  3%     +10.3%      14806 ±  2%  softirqs.CPU95.SCHED
> >>>      12852 ±  4%     +10.3%      14174 ±  5%  softirqs.CPU96.SCHED
> >>>      13010 ±  2%     +14.4%      14888 ±  5%  softirqs.CPU97.SCHED
> >>>    2315644 ±  4%     -36.2%    1477200 ±  4%  softirqs.RCU
> >>>       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.NMI:Non-maskable_interrupts
> >>>       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.PMI:Performance_monitoring_interrupts
> >>>     252.00 ± 11%     -35.2%     163.25 ± 13%  interrupts.CPU104.RES:Rescheduling_interrupts
> >>>       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.NMI:Non-maskable_interrupts
> >>>       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.PMI:Performance_monitoring_interrupts
> >>>     245.75 ± 19%     -31.0%     169.50 ±  7%  interrupts.CPU105.RES:Rescheduling_interrupts
> >>>     228.75 ± 13%     -24.7%     172.25 ± 19%  interrupts.CPU106.RES:Rescheduling_interrupts
> >>>       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.NMI:Non-maskable_interrupts
> >>>       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.PMI:Performance_monitoring_interrupts
> >>>       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.NMI:Non-maskable_interrupts
> >>>       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.PMI:Performance_monitoring_interrupts
> >>>       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.NMI:Non-maskable_interrupts
> >>>       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.PMI:Performance_monitoring_interrupts
> >>>     311.50 ± 23%     -47.7%     163.00 ±  9%  interrupts.CPU122.RES:Rescheduling_interrupts
> >>>     266.75 ± 19%     -31.6%     182.50 ± 15%  interrupts.CPU124.RES:Rescheduling_interrupts
> >>>     293.75 ± 33%     -32.3%     198.75 ± 19%  interrupts.CPU125.RES:Rescheduling_interrupts
> >>>       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.NMI:Non-maskable_interrupts
> >>>       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.PMI:Performance_monitoring_interrupts
> >>>       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.NMI:Non-maskable_interrupts
> >>>       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.PMI:Performance_monitoring_interrupts
> >>>       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.NMI:Non-maskable_interrupts
> >>>       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.PMI:Performance_monitoring_interrupts
> >>>     219.50 ± 27%     -23.0%     169.00 ± 21%  interrupts.CPU139.RES:Rescheduling_interrupts
> >>>     290.25 ± 25%     -32.5%     196.00 ± 11%  interrupts.CPU14.RES:Rescheduling_interrupts
> >>>     243.50 ±  4%     -16.0%     204.50 ± 12%  interrupts.CPU140.RES:Rescheduling_interrupts
> >>>       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.NMI:Non-maskable_interrupts
> >>>       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.PMI:Performance_monitoring_interrupts
> >>>       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.NMI:Non-maskable_interrupts
> >>>       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.PMI:Performance_monitoring_interrupts
> >>>     292.25 ± 34%     -33.9%     193.25 ±  6%  interrupts.CPU15.RES:Rescheduling_interrupts
> >>>     424.25 ± 37%     -58.5%     176.25 ± 14%  interrupts.CPU158.RES:Rescheduling_interrupts
> >>>     312.50 ± 42%     -54.2%     143.00 ± 18%  interrupts.CPU159.RES:Rescheduling_interrupts
> >>>     725.00 ±118%     -75.7%     176.25 ± 14%  interrupts.CPU163.RES:Rescheduling_interrupts
> >>>       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.NMI:Non-maskable_interrupts
> >>>       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.PMI:Performance_monitoring_interrupts
> >>>     239.50 ± 30%     -46.6%     128.00 ± 14%  interrupts.CPU179.RES:Rescheduling_interrupts
> >>>     320.75 ± 15%     -24.0%     243.75 ± 20%  interrupts.CPU20.RES:Rescheduling_interrupts
> >>>     302.50 ± 17%     -47.2%     159.75 ±  8%  interrupts.CPU200.RES:Rescheduling_interrupts
> >>>       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.NMI:Non-maskable_interrupts
> >>>       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.PMI:Performance_monitoring_interrupts
> >>>     217.00 ± 11%     -34.6%     142.00 ± 12%  interrupts.CPU214.RES:Rescheduling_interrupts
> >>>       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.NMI:Non-maskable_interrupts
> >>>       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.PMI:Performance_monitoring_interrupts
> >>>       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.NMI:Non-maskable_interrupts
> >>>       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.PMI:Performance_monitoring_interrupts
> >>>     289.50 ± 28%     -41.1%     170.50 ±  8%  interrupts.CPU22.RES:Rescheduling_interrupts
> >>>       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.NMI:Non-maskable_interrupts
> >>>       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.PMI:Performance_monitoring_interrupts
> >>>       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.NMI:Non-maskable_interrupts
> >>>       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.PMI:Performance_monitoring_interrupts
> >>>       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.NMI:Non-maskable_interrupts
> >>>       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.PMI:Performance_monitoring_interrupts
> >>>       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.NMI:Non-maskable_interrupts
> >>>       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.PMI:Performance_monitoring_interrupts
> >>>     248.00 ± 36%     -36.3%     158.00 ± 19%  interrupts.CPU228.RES:Rescheduling_interrupts
> >>>       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.NMI:Non-maskable_interrupts
> >>>       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.PMI:Performance_monitoring_interrupts
> >>>     404.25 ± 69%     -65.5%     139.50 ± 17%  interrupts.CPU236.RES:Rescheduling_interrupts
> >>>     566.50 ± 40%     -73.6%     149.50 ± 31%  interrupts.CPU237.RES:Rescheduling_interrupts
> >>>     243.50 ± 26%     -37.1%     153.25 ± 21%  interrupts.CPU248.RES:Rescheduling_interrupts
> >>>     258.25 ± 12%     -53.5%     120.00 ± 18%  interrupts.CPU249.RES:Rescheduling_interrupts
> >>>       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.NMI:Non-maskable_interrupts
> >>>       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.PMI:Performance_monitoring_interrupts
> >>>       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.NMI:Non-maskable_interrupts
> >>>       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.PMI:Performance_monitoring_interrupts
> >>>     425.00 ± 59%     -60.3%     168.75 ± 34%  interrupts.CPU258.RES:Rescheduling_interrupts
> >>>       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.NMI:Non-maskable_interrupts
> >>>       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.PMI:Performance_monitoring_interrupts
> >>>       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.NMI:Non-maskable_interrupts
> >>>       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.PMI:Performance_monitoring_interrupts
> >>>       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.NMI:Non-maskable_interrupts
> >>>       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.PMI:Performance_monitoring_interrupts
> >>>       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.NMI:Non-maskable_interrupts
> >>>       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.PMI:Performance_monitoring_interrupts
> >>>       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.NMI:Non-maskable_interrupts
> >>>       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.PMI:Performance_monitoring_interrupts
> >>>       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.NMI:Non-maskable_interrupts
> >>>       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.PMI:Performance_monitoring_interrupts
> >>>       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.NMI:Non-maskable_interrupts
> >>>       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.PMI:Performance_monitoring_interrupts
> >>>     331.75 ± 32%     -39.8%     199.75 ± 17%  interrupts.CPU29.RES:Rescheduling_interrupts
> >>>       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.NMI:Non-maskable_interrupts
> >>>       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.PMI:Performance_monitoring_interrupts
> >>>     298.50 ± 30%     -39.7%     180.00 ±  6%  interrupts.CPU34.RES:Rescheduling_interrupts
> >>>       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.NMI:Non-maskable_interrupts
> >>>       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.PMI:Performance_monitoring_interrupts
> >>>     270.50 ± 24%     -31.1%     186.25 ±  3%  interrupts.CPU36.RES:Rescheduling_interrupts
> >>>       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.NMI:Non-maskable_interrupts
> >>>       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.PMI:Performance_monitoring_interrupts
> >>>     286.75 ± 36%     -32.4%     193.75 ±  7%  interrupts.CPU45.RES:Rescheduling_interrupts
> >>>     259.00 ± 12%     -23.6%     197.75 ± 13%  interrupts.CPU46.RES:Rescheduling_interrupts
> >>>     244.00 ± 21%     -35.6%     157.25 ± 11%  interrupts.CPU47.RES:Rescheduling_interrupts
> >>>     230.00 ±  7%     -21.3%     181.00 ± 11%  interrupts.CPU48.RES:Rescheduling_interrupts
> >>>     281.00 ± 13%     -27.4%     204.00 ± 15%  interrupts.CPU53.RES:Rescheduling_interrupts
> >>>     256.75 ±  5%     -18.4%     209.50 ± 12%  interrupts.CPU54.RES:Rescheduling_interrupts
> >>>       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.NMI:Non-maskable_interrupts
> >>>       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.PMI:Performance_monitoring_interrupts
> >>>     316.00 ± 25%     -41.4%     185.25 ± 13%  interrupts.CPU59.RES:Rescheduling_interrupts
> >>>       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.NMI:Non-maskable_interrupts
> >>>       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.PMI:Performance_monitoring_interrupts
> >>>       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.NMI:Non-maskable_interrupts
> >>>       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.PMI:Performance_monitoring_interrupts
> >>>       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.NMI:Non-maskable_interrupts
> >>>       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.PMI:Performance_monitoring_interrupts
> >>>       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.NMI:Non-maskable_interrupts
> >>>       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.PMI:Performance_monitoring_interrupts
> >>>     319.00 ± 40%     -44.7%     176.25 ±  9%  interrupts.CPU67.RES:Rescheduling_interrupts
> >>>       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.NMI:Non-maskable_interrupts
> >>>       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.PMI:Performance_monitoring_interrupts
> >>>       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.NMI:Non-maskable_interrupts
> >>>       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.PMI:Performance_monitoring_interrupts
> >>>       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.NMI:Non-maskable_interrupts
> >>>       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.PMI:Performance_monitoring_interrupts
> >>>     426.75 ± 61%     -67.7%     138.00 ±  8%  interrupts.CPU75.RES:Rescheduling_interrupts
> >>>     192.50 ± 13%     +45.6%     280.25 ± 45%  interrupts.CPU76.RES:Rescheduling_interrupts
> >>>     274.25 ± 34%     -42.2%     158.50 ± 34%  interrupts.CPU77.RES:Rescheduling_interrupts
> >>>       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.NMI:Non-maskable_interrupts
> >>>       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.PMI:Performance_monitoring_interrupts
> >>>     348.50 ± 53%     -47.3%     183.75 ± 29%  interrupts.CPU80.RES:Rescheduling_interrupts
> >>>       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.NMI:Non-maskable_interrupts
> >>>       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.PMI:Performance_monitoring_interrupts
> >>>       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.NMI:Non-maskable_interrupts
> >>>       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.PMI:Performance_monitoring_interrupts
> >>>       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.NMI:Non-maskable_interrupts
> >>>       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.PMI:Performance_monitoring_interrupts
> >>>     408.75 ± 58%     -56.8%     176.75 ± 25%  interrupts.CPU92.RES:Rescheduling_interrupts
> >>>     399.00 ± 64%     -63.6%     145.25 ± 16%  interrupts.CPU93.RES:Rescheduling_interrupts
> >>>     314.75 ± 36%     -44.2%     175.75 ± 13%  interrupts.CPU94.RES:Rescheduling_interrupts
> >>>     191.00 ± 15%     -29.1%     135.50 ±  9%  interrupts.CPU97.RES:Rescheduling_interrupts
> >>>      94.00 ±  8%     +50.0%     141.00 ± 12%  interrupts.IWI:IRQ_work_interrupts
> >>>     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.NMI:Non-maskable_interrupts
> >>>     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.PMI:Performance_monitoring_interrupts
> >>>      12.75 ± 11%      -4.1        8.67 ± 31%  perf-profile.calltrace.cycles-pp.do_rw_once
> >>>       1.02 ± 16%      -0.6        0.47 ± 59%  perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle
> >>>       1.10 ± 15%      -0.4        0.66 ± 14%  perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
> >>>       1.05 ± 16%      -0.4        0.61 ± 14%  perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter
> >>>       1.58 ±  4%      +0.3        1.91 ±  7%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page
> >>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >>>       2.11 ±  4%      +0.5        2.60 ±  7%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault
> >>>       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
> >>>       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >>>       1.90 ±  5%      +0.6        2.45 ±  7%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage
> >>>       0.65 ± 62%      +0.6        1.20 ± 15%  perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
> >>>       0.60 ± 62%      +0.6        1.16 ± 18%  perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap
> >>>       0.95 ± 17%      +0.6        1.52 ±  8%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner
> >>>       0.61 ± 62%      +0.6        1.18 ± 18%  perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput
> >>>       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit
> >>>       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit
> >>>       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
> >>>       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group
> >>>       1.30 ±  9%      +0.6        1.92 ±  8%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock
> >>>       0.19 ±173%      +0.7        0.89 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu
> >>>       0.19 ±173%      +0.7        0.90 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu
> >>>       0.00            +0.8        0.77 ± 30%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page
> >>>       0.00            +0.8        0.78 ± 30%  perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page
> >>>       0.00            +0.8        0.79 ± 29%  perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
> >>>       0.82 ± 67%      +0.9        1.72 ± 22%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault
> >>>       0.84 ± 66%      +0.9        1.74 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
> >>>       2.52 ±  6%      +0.9        3.44 ±  9%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page
> >>>       0.83 ± 67%      +0.9        1.75 ± 21%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
> >>>       0.84 ± 66%      +0.9        1.77 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
> >>>       1.64 ± 12%      +1.0        2.67 ±  7%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault
> >>>       1.65 ± 45%      +1.3        2.99 ± 18%  perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
> >>>       1.74 ± 13%      +1.4        3.16 ±  6%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault
> >>>       2.56 ± 48%      +2.2        4.81 ± 19%  perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault
> >>>      12.64 ± 14%      +3.6       16.20 ±  8%  perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault
> >>>       2.97 ±  7%      +3.8        6.74 ±  9%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow
> >>>      19.99 ±  9%      +4.1       24.05 ±  6%  perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault
> >>>       1.37 ± 15%      -0.5        0.83 ± 13%  perf-profile.children.cycles-pp.sched_clock_cpu
> >>>       1.31 ± 16%      -0.5        0.78 ± 13%  perf-profile.children.cycles-pp.sched_clock
> >>>       1.29 ± 16%      -0.5        0.77 ± 13%  perf-profile.children.cycles-pp.native_sched_clock
> >>>       1.80 ±  2%      -0.3        1.47 ± 10%  perf-profile.children.cycles-pp.task_tick_fair
> >>>       0.73 ±  2%      -0.2        0.54 ± 11%  perf-profile.children.cycles-pp.update_curr
> >>>       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.children.cycles-pp.account_process_tick
> >>>       0.73 ± 10%      -0.2        0.58 ±  9%  perf-profile.children.cycles-pp.rcu_sched_clock_irq
> >>>       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.children.cycles-pp.__acct_update_integrals
> >>>       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs
> >>>       0.40 ± 12%      -0.1        0.30 ± 14%  perf-profile.children.cycles-pp.__next_timer_interrupt
> >>>       0.47 ±  7%      -0.1        0.39 ± 13%  perf-profile.children.cycles-pp.update_rq_clock
> >>>       0.29 ± 12%      -0.1        0.21 ± 15%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
> >>>       0.21 ±  7%      -0.1        0.14 ± 12%  perf-profile.children.cycles-pp.account_system_index_time
> >>>       0.38 ±  2%      -0.1        0.31 ± 12%  perf-profile.children.cycles-pp.timerqueue_add
> >>>       0.26 ± 11%      -0.1        0.20 ± 13%  perf-profile.children.cycles-pp.find_next_bit
> >>>       0.23 ± 15%      -0.1        0.17 ± 15%  perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit
> >>>       0.14 ±  8%      -0.1        0.07 ± 14%  perf-profile.children.cycles-pp.account_user_time
> >>>       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.children.cycles-pp.cpuacct_charge
> >>>       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.children.cycles-pp.irq_work_tick
> >>>       0.11 ± 13%      -0.0        0.07 ± 25%  perf-profile.children.cycles-pp.tick_sched_do_timer
> >>>       0.12 ± 10%      -0.0        0.08 ± 15%  perf-profile.children.cycles-pp.get_cpu_device
> >>>       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.children.cycles-pp.raise_softirq
> >>>       0.12 ±  3%      -0.0        0.09 ±  8%  perf-profile.children.cycles-pp.write
> >>>       0.11 ± 13%      +0.0        0.14 ±  8%  perf-profile.children.cycles-pp.native_write_msr
> >>>       0.09 ±  9%      +0.0        0.11 ±  7%  perf-profile.children.cycles-pp.finish_task_switch
> >>>       0.10 ± 10%      +0.0        0.13 ±  5%  perf-profile.children.cycles-pp.schedule_idle
> >>>       0.07 ±  6%      +0.0        0.10 ± 12%  perf-profile.children.cycles-pp.__read_nocancel
> >>>       0.04 ± 58%      +0.0        0.07 ± 15%  perf-profile.children.cycles-pp.__free_pages_ok
> >>>       0.06 ±  7%      +0.0        0.09 ± 13%  perf-profile.children.cycles-pp.perf_read
> >>>       0.07            +0.0        0.11 ± 14%  perf-profile.children.cycles-pp.perf_evsel__read_counter
> >>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.cmd_stat
> >>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.__run_perf_stat
> >>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.process_interval
> >>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.read_counters
> >>>       0.07 ± 22%      +0.0        0.11 ± 19%  perf-profile.children.cycles-pp.__handle_mm_fault
> >>>       0.07 ± 19%      +0.1        0.13 ±  8%  perf-profile.children.cycles-pp.rb_erase
> >>>       0.03 ±100%      +0.1        0.09 ±  9%  perf-profile.children.cycles-pp.smp_call_function_single
> >>>       0.01 ±173%      +0.1        0.08 ± 11%  perf-profile.children.cycles-pp.perf_event_read
> >>>       0.00            +0.1        0.07 ± 13%  perf-profile.children.cycles-pp.__perf_event_read_value
> >>>       0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.__intel_pmu_enable_all
> >>>       0.08 ± 17%      +0.1        0.15 ±  8%  perf-profile.children.cycles-pp.native_apic_msr_eoi_write
> >>>       0.04 ±103%      +0.1        0.13 ± 58%  perf-profile.children.cycles-pp.shmem_getpage_gfp
> >>>       0.38 ± 14%      +0.1        0.51 ±  6%  perf-profile.children.cycles-pp.run_timer_softirq
> >>>       0.11 ±  4%      +0.3        0.37 ± 32%  perf-profile.children.cycles-pp.worker_thread
> >>>       0.20 ±  5%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.ret_from_fork
> >>>       0.20 ±  4%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.kthread
> >>>       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.memcpy_erms
> >>>       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.drm_fb_helper_dirty_work
> >>>       0.00            +0.3        0.31 ± 37%  perf-profile.children.cycles-pp.process_one_work
> >>>       0.47 ± 48%      +0.4        0.91 ± 19%  perf-profile.children.cycles-pp.prep_new_huge_page
> >>>       0.70 ± 29%      +0.5        1.16 ± 18%  perf-profile.children.cycles-pp.free_huge_page
> >>>       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_flush_mmu
> >>>       0.72 ± 29%      +0.5        1.18 ± 18%  perf-profile.children.cycles-pp.release_pages
> >>>       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_finish_mmu
> >>>       0.76 ± 27%      +0.5        1.23 ± 18%  perf-profile.children.cycles-pp.exit_mmap
> >>>       0.77 ± 27%      +0.5        1.24 ± 18%  perf-profile.children.cycles-pp.mmput
> >>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.__x64_sys_exit_group
> >>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_group_exit
> >>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_exit
> >>>       1.28 ± 29%      +0.5        1.76 ±  9%  perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
> >>>       0.77 ± 28%      +0.5        1.26 ± 13%  perf-profile.children.cycles-pp.alloc_fresh_huge_page
> >>>       1.53 ± 15%      +0.7        2.26 ± 14%  perf-profile.children.cycles-pp.do_syscall_64
> >>>       1.53 ± 15%      +0.7        2.27 ± 14%  perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
> >>>       1.13 ±  3%      +0.9        2.07 ± 14%  perf-profile.children.cycles-pp.interrupt_entry
> >>>       0.79 ±  9%      +1.0        1.76 ±  5%  perf-profile.children.cycles-pp.perf_event_task_tick
> >>>       1.71 ± 39%      +1.4        3.08 ± 16%  perf-profile.children.cycles-pp.alloc_surplus_huge_page
> >>>       2.66 ± 42%      +2.3        4.94 ± 17%  perf-profile.children.cycles-pp.alloc_huge_page
> >>>       2.89 ± 45%      +2.7        5.54 ± 18%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> >>>       3.34 ± 35%      +2.7        6.02 ± 17%  perf-profile.children.cycles-pp._raw_spin_lock
> >>>      12.77 ± 14%      +3.9       16.63 ±  7%  perf-profile.children.cycles-pp.mutex_spin_on_owner
> >>>      20.12 ±  9%      +4.0       24.16 ±  6%  perf-profile.children.cycles-pp.hugetlb_cow
> >>>      15.40 ± 10%      -3.6       11.84 ± 28%  perf-profile.self.cycles-pp.do_rw_once
> >>>       4.02 ±  9%      -1.3        2.73 ± 30%  perf-profile.self.cycles-pp.do_access
> >>>       2.00 ± 14%      -0.6        1.41 ± 13%  perf-profile.self.cycles-pp.cpuidle_enter_state
> >>>       1.26 ± 16%      -0.5        0.74 ± 13%  perf-profile.self.cycles-pp.native_sched_clock
> >>>       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.self.cycles-pp.account_process_tick
> >>>       0.27 ± 19%      -0.2        0.12 ± 17%  perf-profile.self.cycles-pp.timerqueue_del
> >>>       0.53 ±  3%      -0.1        0.38 ± 11%  perf-profile.self.cycles-pp.update_curr
> >>>       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.self.cycles-pp.__acct_update_integrals
> >>>       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs
> >>>       0.61 ±  4%      -0.1        0.51 ±  8%  perf-profile.self.cycles-pp.task_tick_fair
> >>>       0.20 ±  8%      -0.1        0.12 ± 14%  perf-profile.self.cycles-pp.account_system_index_time
> >>>       0.23 ± 15%      -0.1        0.16 ± 17%  perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit
> >>>       0.25 ± 11%      -0.1        0.18 ± 14%  perf-profile.self.cycles-pp.find_next_bit
> >>>       0.10 ± 11%      -0.1        0.03 ±100%  perf-profile.self.cycles-pp.tick_sched_do_timer
> >>>       0.29            -0.1        0.23 ± 11%  perf-profile.self.cycles-pp.timerqueue_add
> >>>       0.12 ± 10%      -0.1        0.06 ± 17%  perf-profile.self.cycles-pp.account_user_time
> >>>       0.22 ± 15%      -0.1        0.16 ±  6%  perf-profile.self.cycles-pp.scheduler_tick
> >>>       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.self.cycles-pp.cpuacct_charge
> >>>       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.self.cycles-pp.irq_work_tick
> >>>       0.07 ± 13%      -0.0        0.03 ±100%  perf-profile.self.cycles-pp.update_process_times
> >>>       0.12 ±  7%      -0.0        0.08 ± 15%  perf-profile.self.cycles-pp.get_cpu_device
> >>>       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.self.cycles-pp.raise_softirq
> >>>       0.12 ± 11%      -0.0        0.09 ±  7%  perf-profile.self.cycles-pp.tick_nohz_get_sleep_length
> >>>       0.11 ± 11%      +0.0        0.14 ±  6%  perf-profile.self.cycles-pp.native_write_msr
> >>>       0.10 ±  5%      +0.1        0.15 ±  8%  perf-profile.self.cycles-pp.__remove_hrtimer
> >>>       0.07 ± 23%      +0.1        0.13 ±  8%  perf-profile.self.cycles-pp.rb_erase
> >>>       0.08 ± 17%      +0.1        0.15 ±  7%  perf-profile.self.cycles-pp.native_apic_msr_eoi_write
> >>>       0.00            +0.1        0.08 ± 10%  perf-profile.self.cycles-pp.smp_call_function_single
> >>>       0.32 ± 17%      +0.1        0.42 ±  7%  perf-profile.self.cycles-pp.run_timer_softirq
> >>>       0.22 ±  5%      +0.1        0.34 ±  4%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
> >>>       0.45 ± 15%      +0.2        0.60 ± 12%  perf-profile.self.cycles-pp.rcu_irq_enter
> >>>       0.31 ±  8%      +0.2        0.46 ± 16%  perf-profile.self.cycles-pp.irq_enter
> >>>       0.29 ± 10%      +0.2        0.44 ± 16%  perf-profile.self.cycles-pp.apic_timer_interrupt
> >>>       0.71 ± 30%      +0.2        0.92 ±  8%  perf-profile.self.cycles-pp.perf_mux_hrtimer_handler
> >>>       0.00            +0.3        0.28 ± 37%  perf-profile.self.cycles-pp.memcpy_erms
> >>>       1.12 ±  3%      +0.9        2.02 ± 15%  perf-profile.self.cycles-pp.interrupt_entry
> >>>       0.79 ±  9%      +0.9        1.73 ±  5%  perf-profile.self.cycles-pp.perf_event_task_tick
> >>>       2.49 ± 45%      +2.1        4.55 ± 20%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
> >>>      10.95 ± 15%      +2.7       13.61 ±  8%  perf-profile.self.cycles-pp.mutex_spin_on_owner
> >>>
> >>>
> >>>
> >>>                                vm-scalability.throughput
> >>>
> >>>   1.6e+07 +-+---------------------------------------------------------------+
> >>>           |..+.+    +..+.+..+.+.   +.      +..+.+..+.+..+.+..+.+..+    +    |
> >>>   1.4e+07 +-+  :    :  O      O    O                           O            |
> >>>   1.2e+07 O-+O O  O O    O  O    O    O O  O  O    O    O    O      O  O O  O
> >>>           |     :   :                           O    O    O       O         |
> >>>     1e+07 +-+   :  :                                                        |
> >>>           |     :  :                                                        |
> >>>     8e+06 +-+   :  :                                                        |
> >>>           |      : :                                                        |
> >>>     6e+06 +-+    : :                                                        |
> >>>     4e+06 +-+    : :                                                        |
> >>>           |      ::                                                         |
> >>>     2e+06 +-+     :                                                         |
> >>>           |       :                                                         |
> >>>         0 +-+---------------------------------------------------------------+
> >>>
> >>>
> >>>                          vm-scalability.time.minor_page_faults
> >>>
> >>>   2.5e+06 +-+---------------------------------------------------------------+
> >>>           |                                                                 |
> >>>           |..+.+    +..+.+..+.+..+.+..+.+..  .+.  .+.+..+.+..+.+..+.+..+    |
> >>>     2e+06 +-+  :    :                      +.   +.                          |
> >>>           O  O O: O O  O O  O O  O O                    O      O            |
> >>>           |     :   :                 O O  O  O O  O O    O  O    O O  O O  O
> >>>   1.5e+06 +-+   :  :                                                        |
> >>>           |     :  :                                                        |
> >>>     1e+06 +-+    : :                                                        |
> >>>           |      : :                                                        |
> >>>           |      : :                                                        |
> >>>    500000 +-+    : :                                                        |
> >>>           |       :                                                         |
> >>>           |       :                                                         |
> >>>         0 +-+---------------------------------------------------------------+
> >>>
> >>>
> >>>                                 vm-scalability.workload
> >>>
> >>>   3.5e+09 +-+---------------------------------------------------------------+
> >>>           | .+.                      .+.+..                        .+..     |
> >>>     3e+09 +-+  +    +..+.+..+.+..+.+.      +..+.+..+.+..+.+..+.+..+    +    |
> >>>           |    :    :       O O                                O            |
> >>>   2.5e+09 O-+O O: O O  O O       O O  O    O            O                   |
> >>>           |     :   :                   O     O O  O O    O  O    O O  O O  O
> >>>     2e+09 +-+   :  :                                                        |
> >>>           |     :  :                                                        |
> >>>   1.5e+09 +-+    : :                                                        |
> >>>           |      : :                                                        |
> >>>     1e+09 +-+    : :                                                        |
> >>>           |      : :                                                        |
> >>>     5e+08 +-+     :                                                         |
> >>>           |       :                                                         |
> >>>         0 +-+---------------------------------------------------------------+
> >>>
> >>>
> >>> [*] bisect-good sample
> >>> [O] bisect-bad  sample
> >>>
> >>>
> >>>
> >>> Disclaimer:
> >>> Results have been estimated based on internal Intel analysis and are provided
> >>> for informational purposes only. Any difference in system hardware or software
> >>> design or configuration may affect actual performance.
> >>>
> >>>
> >>> Thanks,
> >>> Rong Chen
> >>>
> >>
> >> --
> >> Thomas Zimmermann
> >> Graphics Driver Developer
> >> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> >> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> >> HRB 21284 (AG Nürnberg)
> >>
> >
> >
>
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-30 18:59       ` Daniel Vetter
@ 2019-07-30 20:26         ` Dave Airlie
  2019-07-31  8:13           ` Daniel Vetter
  0 siblings, 1 reply; 61+ messages in thread
From: Dave Airlie @ 2019-07-30 20:26 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Stephen Rothwell, LKP, dri-devel, Thomas Zimmermann, kernel test robot

On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >
> > Hi
> >
> > Am 30.07.19 um 20:12 schrieb Daniel Vetter:
> > > On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > >> Am 29.07.19 um 11:51 schrieb kernel test robot:
> > >>> Greeting,
> > >>>
> > >>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
> > >>>
> > >>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> > >>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
> > >>
> > >> Daniel, Noralf, we may have to revert this patch.
> > >>
> > >> I expected some change in display performance, but not in VM. Since it's
> > >> a server chipset, probably no one cares much about display performance.
> > >> So that seemed like a good trade-off for re-using shared code.
> > >>
> > >> Part of the patch set is that the generic fb emulation now maps and
> > >> unmaps the fbdev BO when updating the screen. I guess that's the cause
> > >> of the performance regression. And it should be visible with other
> > >> drivers as well if they use a shadow FB for fbdev emulation.
> > >
> > > For fbcon we should need to do any maps/unamps at all, this is for the
> > > fbdev mmap support only. If the testcase mentioned here tests fbdev
> > > mmap handling it's pretty badly misnamed :-) And as long as you don't
> > > have an fbdev mmap there shouldn't be any impact at all.
> >
> > The ast and mgag200 have only a few MiB of VRAM, so we have to get the
> > fbdev BO out if it's not being displayed. If not being mapped, it can be
> > evicted and make room for X, etc.
> >
> > To make this work, the BO's memory is mapped and unmapped in
> > drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
> > That fbdev mapping is established on each screen update, more or less.
> > From my (yet unverified) understanding, this causes the performance
> > regression in the VM code.
> >
> > The original code in mgag200 used to kmap the fbdev BO while it's being
> > displayed; [2] and the drawing code only mapped it when necessary (i.e.,
> > not being display). [3]
>
> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
> cache this.
>
> > I think this could be added for VRAM helpers as well, but it's still a
> > workaround and non-VRAM drivers might also run into such a performance
> > regression if they use the fbdev's shadow fb.
>
> Yeah agreed, fbdev emulation should try to cache the vmap.
>
> > Noralf mentioned that there are plans for other DRM clients besides the
> > console. They would as well run into similar problems.
> >
> > >> The thing is that we'd need another generic fbdev emulation for ast and
> > >> mgag200 that handles this issue properly.
> > >
> > > Yeah I dont think we want to jump the gun here.  If you can try to
> > > repro locally and profile where we're wasting cpu time I hope that
> > > should sched a light what's going wrong here.
> >
> > I don't have much time ATM and I'm not even officially at work until
> > late Aug. I'd send you the revert and investigate later. I agree that
> > using generic fbdev emulation would be preferable.
>
> Still not sure that's the right thing to do really. Yes it's a
> regression, but vm testcases shouldn run a single line of fbcon or drm
> code. So why this is impacted so heavily by a silly drm change is very
> confusing to me. We might be papering over a deeper and much more
> serious issue ...

It's a regression, the right thing is to revert first and then work
out the right thing to do.

It's likely the test runs on the console and printfs stuff out while running.

Dave.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-30 20:26         ` Dave Airlie
@ 2019-07-31  8:13           ` Daniel Vetter
  2019-07-31  9:25             ` [LKP] " Huang, Ying
  2019-07-31 10:10             ` Thomas Zimmermann
  0 siblings, 2 replies; 61+ messages in thread
From: Daniel Vetter @ 2019-07-31  8:13 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Stephen Rothwell, LKP, dri-devel, Thomas Zimmermann, kernel test robot

On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > >
> > > Hi
> > >
> > > Am 30.07.19 um 20:12 schrieb Daniel Vetter:
> > > > On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > > >> Am 29.07.19 um 11:51 schrieb kernel test robot:
> > > >>> Greeting,
> > > >>>
> > > >>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
> > > >>>
> > > >>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> > > >>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
> > > >>
> > > >> Daniel, Noralf, we may have to revert this patch.
> > > >>
> > > >> I expected some change in display performance, but not in VM. Since it's
> > > >> a server chipset, probably no one cares much about display performance.
> > > >> So that seemed like a good trade-off for re-using shared code.
> > > >>
> > > >> Part of the patch set is that the generic fb emulation now maps and
> > > >> unmaps the fbdev BO when updating the screen. I guess that's the cause
> > > >> of the performance regression. And it should be visible with other
> > > >> drivers as well if they use a shadow FB for fbdev emulation.
> > > >
> > > > For fbcon we should need to do any maps/unamps at all, this is for the
> > > > fbdev mmap support only. If the testcase mentioned here tests fbdev
> > > > mmap handling it's pretty badly misnamed :-) And as long as you don't
> > > > have an fbdev mmap there shouldn't be any impact at all.
> > >
> > > The ast and mgag200 have only a few MiB of VRAM, so we have to get the
> > > fbdev BO out if it's not being displayed. If not being mapped, it can be
> > > evicted and make room for X, etc.
> > >
> > > To make this work, the BO's memory is mapped and unmapped in
> > > drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
> > > That fbdev mapping is established on each screen update, more or less.
> > > From my (yet unverified) understanding, this causes the performance
> > > regression in the VM code.
> > >
> > > The original code in mgag200 used to kmap the fbdev BO while it's being
> > > displayed; [2] and the drawing code only mapped it when necessary (i.e.,
> > > not being display). [3]
> >
> > Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
> > cache this.
> >
> > > I think this could be added for VRAM helpers as well, but it's still a
> > > workaround and non-VRAM drivers might also run into such a performance
> > > regression if they use the fbdev's shadow fb.
> >
> > Yeah agreed, fbdev emulation should try to cache the vmap.
> >
> > > Noralf mentioned that there are plans for other DRM clients besides the
> > > console. They would as well run into similar problems.
> > >
> > > >> The thing is that we'd need another generic fbdev emulation for ast and
> > > >> mgag200 that handles this issue properly.
> > > >
> > > > Yeah I dont think we want to jump the gun here.  If you can try to
> > > > repro locally and profile where we're wasting cpu time I hope that
> > > > should sched a light what's going wrong here.
> > >
> > > I don't have much time ATM and I'm not even officially at work until
> > > late Aug. I'd send you the revert and investigate later. I agree that
> > > using generic fbdev emulation would be preferable.
> >
> > Still not sure that's the right thing to do really. Yes it's a
> > regression, but vm testcases shouldn run a single line of fbcon or drm
> > code. So why this is impacted so heavily by a silly drm change is very
> > confusing to me. We might be papering over a deeper and much more
> > serious issue ...
>
> It's a regression, the right thing is to revert first and then work
> out the right thing to do.

Sure, but I have no idea whether the testcase is doing something
reasonable. If it's accidentally testing vm scalability of fbdev and
there's no one else doing something this pointless, then it's not a
real bug. Plus I think we're shooting the messenger here.

> It's likely the test runs on the console and printfs stuff out while running.

But why did we not regress the world if a few prints on the console
have such a huge impact? We didn't get an entire stream of mails about
breaking stuff ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-31  8:13           ` Daniel Vetter
@ 2019-07-31  9:25             ` Huang, Ying
  2019-07-31 10:12               ` Thomas Zimmermann
  2019-07-31 10:21               ` Michel Dänzer
  2019-07-31 10:10             ` Thomas Zimmermann
  1 sibling, 2 replies; 61+ messages in thread
From: Huang, Ying @ 2019-07-31  9:25 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Stephen Rothwell, Rong A. Chen, LKP, dri-devel, Thomas Zimmermann

Hi, Daniel,

Daniel Vetter <daniel@ffwll.ch> writes:

> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>
>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>> >
>> > On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> > >
>> > > Hi
>> > >
>> > > Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>> > > > On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> > > >> Am 29.07.19 um 11:51 schrieb kernel test robot:
>> > > >>> Greeting,
>> > > >>>
>> > > >>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
>> > > >>>
>> > > >>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>> > > >>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>> > > >>
>> > > >> Daniel, Noralf, we may have to revert this patch.
>> > > >>
>> > > >> I expected some change in display performance, but not in VM. Since it's
>> > > >> a server chipset, probably no one cares much about display performance.
>> > > >> So that seemed like a good trade-off for re-using shared code.
>> > > >>
>> > > >> Part of the patch set is that the generic fb emulation now maps and
>> > > >> unmaps the fbdev BO when updating the screen. I guess that's the cause
>> > > >> of the performance regression. And it should be visible with other
>> > > >> drivers as well if they use a shadow FB for fbdev emulation.
>> > > >
>> > > > For fbcon we should need to do any maps/unamps at all, this is for the
>> > > > fbdev mmap support only. If the testcase mentioned here tests fbdev
>> > > > mmap handling it's pretty badly misnamed :-) And as long as you don't
>> > > > have an fbdev mmap there shouldn't be any impact at all.
>> > >
>> > > The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>> > > fbdev BO out if it's not being displayed. If not being mapped, it can be
>> > > evicted and make room for X, etc.
>> > >
>> > > To make this work, the BO's memory is mapped and unmapped in
>> > > drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>> > > That fbdev mapping is established on each screen update, more or less.
>> > > From my (yet unverified) understanding, this causes the performance
>> > > regression in the VM code.
>> > >
>> > > The original code in mgag200 used to kmap the fbdev BO while it's being
>> > > displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>> > > not being display). [3]
>> >
>> > Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>> > cache this.
>> >
>> > > I think this could be added for VRAM helpers as well, but it's still a
>> > > workaround and non-VRAM drivers might also run into such a performance
>> > > regression if they use the fbdev's shadow fb.
>> >
>> > Yeah agreed, fbdev emulation should try to cache the vmap.
>> >
>> > > Noralf mentioned that there are plans for other DRM clients besides the
>> > > console. They would as well run into similar problems.
>> > >
>> > > >> The thing is that we'd need another generic fbdev emulation for ast and
>> > > >> mgag200 that handles this issue properly.
>> > > >
>> > > > Yeah I dont think we want to jump the gun here.  If you can try to
>> > > > repro locally and profile where we're wasting cpu time I hope that
>> > > > should sched a light what's going wrong here.
>> > >
>> > > I don't have much time ATM and I'm not even officially at work until
>> > > late Aug. I'd send you the revert and investigate later. I agree that
>> > > using generic fbdev emulation would be preferable.
>> >
>> > Still not sure that's the right thing to do really. Yes it's a
>> > regression, but vm testcases shouldn run a single line of fbcon or drm
>> > code. So why this is impacted so heavily by a silly drm change is very
>> > confusing to me. We might be papering over a deeper and much more
>> > serious issue ...
>>
>> It's a regression, the right thing is to revert first and then work
>> out the right thing to do.
>
> Sure, but I have no idea whether the testcase is doing something
> reasonable. If it's accidentally testing vm scalability of fbdev and
> there's no one else doing something this pointless, then it's not a
> real bug. Plus I think we're shooting the messenger here.
>
>> It's likely the test runs on the console and printfs stuff out while running.
>
> But why did we not regress the world if a few prints on the console
> have such a huge impact? We didn't get an entire stream of mails about
> breaking stuff ...

The regression seems not related to the commit.  But we have retested
and confirmed the regression.  Hard to understand what happens.

Best Regards,
Huang, Ying
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-31  8:13           ` Daniel Vetter
  2019-07-31  9:25             ` [LKP] " Huang, Ying
@ 2019-07-31 10:10             ` Thomas Zimmermann
  2019-08-02  9:11               ` Daniel Vetter
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-07-31 10:10 UTC (permalink / raw)
  To: Daniel Vetter, Dave Airlie
  Cc: Stephen Rothwell, LKP, dri-devel, kernel test robot


[-- Attachment #1.1.1: Type: text/plain, Size: 5020 bytes --]

Hi

Am 31.07.19 um 10:13 schrieb Daniel Vetter:
> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>
>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>
>>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>
>>>> Hi
>>>>
>>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>>>>>> Greeting,
>>>>>>>
>>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
>>>>>>>
>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>>
>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>
>>>>>> I expected some change in display performance, but not in VM. Since it's
>>>>>> a server chipset, probably no one cares much about display performance.
>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>
>>>>>> Part of the patch set is that the generic fb emulation now maps and
>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>>>>>> of the performance regression. And it should be visible with other
>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>
>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>
>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
>>>> evicted and make room for X, etc.
>>>>
>>>> To make this work, the BO's memory is mapped and unmapped in
>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>>>> That fbdev mapping is established on each screen update, more or less.
>>>> From my (yet unverified) understanding, this causes the performance
>>>> regression in the VM code.
>>>>
>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>>>> not being display). [3]
>>>
>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>> cache this.
>>>
>>>> I think this could be added for VRAM helpers as well, but it's still a
>>>> workaround and non-VRAM drivers might also run into such a performance
>>>> regression if they use the fbdev's shadow fb.
>>>
>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>
>>>> Noralf mentioned that there are plans for other DRM clients besides the
>>>> console. They would as well run into similar problems.
>>>>
>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
>>>>>> mgag200 that handles this issue properly.
>>>>>
>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>> should sched a light what's going wrong here.
>>>>
>>>> I don't have much time ATM and I'm not even officially at work until
>>>> late Aug. I'd send you the revert and investigate later. I agree that
>>>> using generic fbdev emulation would be preferable.
>>>
>>> Still not sure that's the right thing to do really. Yes it's a
>>> regression, but vm testcases shouldn run a single line of fbcon or drm
>>> code. So why this is impacted so heavily by a silly drm change is very
>>> confusing to me. We might be papering over a deeper and much more
>>> serious issue ...
>>
>> It's a regression, the right thing is to revert first and then work
>> out the right thing to do.
> 
> Sure, but I have no idea whether the testcase is doing something
> reasonable. If it's accidentally testing vm scalability of fbdev and
> there's no one else doing something this pointless, then it's not a
> real bug. Plus I think we're shooting the messenger here.
> 
>> It's likely the test runs on the console and printfs stuff out while running.
> 
> But why did we not regress the world if a few prints on the console
> have such a huge impact? We didn't get an entire stream of mails about
> breaking stuff ...

The vmap/vunmap pair is only executed for fbdev emulation with a shadow
FB. And most of those are with shmem helpers, which ref-count the vmap
calls internally. My guess is that VRAM helpers are currently the only
BOs triggering this problem.

Best regards
Thomas

> -Daniel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-31  9:25             ` [LKP] " Huang, Ying
@ 2019-07-31 10:12               ` Thomas Zimmermann
  2019-07-31 10:21               ` Michel Dänzer
  1 sibling, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-07-31 10:12 UTC (permalink / raw)
  To: Huang, Ying, Daniel Vetter; +Cc: Stephen Rothwell, LKP, dri-devel, Rong A. Chen


[-- Attachment #1.1.1: Type: text/plain, Size: 5329 bytes --]

Hi

Am 31.07.19 um 11:25 schrieb Huang, Ying:
> Hi, Daniel,
> 
> Daniel Vetter <daniel@ffwll.ch> writes:
> 
>> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>>
>>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>
>>>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>>>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>>>>>>> Greeting,
>>>>>>>>
>>>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
>>>>>>>>
>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>>>
>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>
>>>>>>> I expected some change in display performance, but not in VM. Since it's
>>>>>>> a server chipset, probably no one cares much about display performance.
>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>
>>>>>>> Part of the patch set is that the generic fb emulation now maps and
>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>>>>>>> of the performance regression. And it should be visible with other
>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>
>>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>
>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
>>>>> evicted and make room for X, etc.
>>>>>
>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>>>>> That fbdev mapping is established on each screen update, more or less.
>>>>> From my (yet unverified) understanding, this causes the performance
>>>>> regression in the VM code.
>>>>>
>>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
>>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>>>>> not being display). [3]
>>>>
>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>> cache this.
>>>>
>>>>> I think this could be added for VRAM helpers as well, but it's still a
>>>>> workaround and non-VRAM drivers might also run into such a performance
>>>>> regression if they use the fbdev's shadow fb.
>>>>
>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>
>>>>> Noralf mentioned that there are plans for other DRM clients besides the
>>>>> console. They would as well run into similar problems.
>>>>>
>>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
>>>>>>> mgag200 that handles this issue properly.
>>>>>>
>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>> should sched a light what's going wrong here.
>>>>>
>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>> late Aug. I'd send you the revert and investigate later. I agree that
>>>>> using generic fbdev emulation would be preferable.
>>>>
>>>> Still not sure that's the right thing to do really. Yes it's a
>>>> regression, but vm testcases shouldn run a single line of fbcon or drm
>>>> code. So why this is impacted so heavily by a silly drm change is very
>>>> confusing to me. We might be papering over a deeper and much more
>>>> serious issue ...
>>>
>>> It's a regression, the right thing is to revert first and then work
>>> out the right thing to do.
>>
>> Sure, but I have no idea whether the testcase is doing something
>> reasonable. If it's accidentally testing vm scalability of fbdev and
>> there's no one else doing something this pointless, then it's not a
>> real bug. Plus I think we're shooting the messenger here.
>>
>>> It's likely the test runs on the console and printfs stuff out while running.
>>
>> But why did we not regress the world if a few prints on the console
>> have such a huge impact? We didn't get an entire stream of mails about
>> breaking stuff ...
> 
> The regression seems not related to the commit.  But we have retested
> and confirmed the regression.  Hard to understand what happens.

Take a look at commit cf1ca9aeb930df074bb5bbcde55f935fec04e529

Best regards
Thomas

> 
> Best Regards,
> Huang, Ying
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-31  9:25             ` [LKP] " Huang, Ying
  2019-07-31 10:12               ` Thomas Zimmermann
@ 2019-07-31 10:21               ` Michel Dänzer
  2019-08-01  6:19                 ` Rong Chen
  1 sibling, 1 reply; 61+ messages in thread
From: Michel Dänzer @ 2019-07-31 10:21 UTC (permalink / raw)
  To: Huang, Ying, Daniel Vetter
  Cc: Stephen Rothwell, LKP, Thomas Zimmermann, dri-devel, Rong A. Chen

On 2019-07-31 11:25 a.m., Huang, Ying wrote:
> Hi, Daniel,
> 
> Daniel Vetter <daniel@ffwll.ch> writes:
> 
>> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>>
>>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>
>>>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>>>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>>>>>>> Greeting,
>>>>>>>>
>>>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
>>>>>>>>
>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>>>
>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>
>>>>>>> I expected some change in display performance, but not in VM. Since it's
>>>>>>> a server chipset, probably no one cares much about display performance.
>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>
>>>>>>> Part of the patch set is that the generic fb emulation now maps and
>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>>>>>>> of the performance regression. And it should be visible with other
>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>
>>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>
>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
>>>>> evicted and make room for X, etc.
>>>>>
>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>>>>> That fbdev mapping is established on each screen update, more or less.
>>>>> From my (yet unverified) understanding, this causes the performance
>>>>> regression in the VM code.
>>>>>
>>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
>>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>>>>> not being display). [3]
>>>>
>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>> cache this.
>>>>
>>>>> I think this could be added for VRAM helpers as well, but it's still a
>>>>> workaround and non-VRAM drivers might also run into such a performance
>>>>> regression if they use the fbdev's shadow fb.
>>>>
>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>
>>>>> Noralf mentioned that there are plans for other DRM clients besides the
>>>>> console. They would as well run into similar problems.
>>>>>
>>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
>>>>>>> mgag200 that handles this issue properly.
>>>>>>
>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>> should sched a light what's going wrong here.
>>>>>
>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>> late Aug. I'd send you the revert and investigate later. I agree that
>>>>> using generic fbdev emulation would be preferable.
>>>>
>>>> Still not sure that's the right thing to do really. Yes it's a
>>>> regression, but vm testcases shouldn run a single line of fbcon or drm
>>>> code. So why this is impacted so heavily by a silly drm change is very
>>>> confusing to me. We might be papering over a deeper and much more
>>>> serious issue ...
>>>
>>> It's a regression, the right thing is to revert first and then work
>>> out the right thing to do.
>>
>> Sure, but I have no idea whether the testcase is doing something
>> reasonable. If it's accidentally testing vm scalability of fbdev and
>> there's no one else doing something this pointless, then it's not a
>> real bug. Plus I think we're shooting the messenger here.
>>
>>> It's likely the test runs on the console and printfs stuff out while running.
>>
>> But why did we not regress the world if a few prints on the console
>> have such a huge impact? We didn't get an entire stream of mails about
>> breaking stuff ...
> 
> The regression seems not related to the commit.  But we have retested
> and confirmed the regression.  Hard to understand what happens.

Does the regressed test cause any output on console while it's
measuring? If so, it's probably accidentally measuring fbcon/DRM code in
addition to the workload it's trying to measure.


-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-31 10:21               ` Michel Dänzer
@ 2019-08-01  6:19                 ` Rong Chen
  2019-08-01  8:37                   ` Feng Tang
                                     ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Rong Chen @ 2019-08-01  6:19 UTC (permalink / raw)
  To: Michel Dänzer, Huang, Ying, Daniel Vetter
  Cc: Stephen Rothwell, LKP, Thomas Zimmermann, dri-devel

[-- Attachment #1: Type: text/plain, Size: 5213 bytes --]

Hi,

On 7/31/19 6:21 PM, Michel Dänzer wrote:
> On 2019-07-31 11:25 a.m., Huang, Ying wrote:
>> Hi, Daniel,
>>
>> Daniel Vetter <daniel@ffwll.ch> writes:
>>
>>> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>> Hi
>>>>>>
>>>>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>>>>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>>>>>>>> Greeting,
>>>>>>>>>
>>>>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
>>>>>>>>>
>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>
>>>>>>>> I expected some change in display performance, but not in VM. Since it's
>>>>>>>> a server chipset, probably no one cares much about display performance.
>>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>>
>>>>>>>> Part of the patch set is that the generic fb emulation now maps and
>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>>>>>>>> of the performance regression. And it should be visible with other
>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
>>>>>> evicted and make room for X, etc.
>>>>>>
>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>>>>>> That fbdev mapping is established on each screen update, more or less.
>>>>>>  From my (yet unverified) understanding, this causes the performance
>>>>>> regression in the VM code.
>>>>>>
>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
>>>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>>>>>> not being display). [3]
>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>>> cache this.
>>>>>
>>>>>> I think this could be added for VRAM helpers as well, but it's still a
>>>>>> workaround and non-VRAM drivers might also run into such a performance
>>>>>> regression if they use the fbdev's shadow fb.
>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>
>>>>>> Noralf mentioned that there are plans for other DRM clients besides the
>>>>>> console. They would as well run into similar problems.
>>>>>>
>>>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
>>>>>>>> mgag200 that handles this issue properly.
>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>>> should sched a light what's going wrong here.
>>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>>> late Aug. I'd send you the revert and investigate later. I agree that
>>>>>> using generic fbdev emulation would be preferable.
>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>> regression, but vm testcases shouldn run a single line of fbcon or drm
>>>>> code. So why this is impacted so heavily by a silly drm change is very
>>>>> confusing to me. We might be papering over a deeper and much more
>>>>> serious issue ...
>>>> It's a regression, the right thing is to revert first and then work
>>>> out the right thing to do.
>>> Sure, but I have no idea whether the testcase is doing something
>>> reasonable. If it's accidentally testing vm scalability of fbdev and
>>> there's no one else doing something this pointless, then it's not a
>>> real bug. Plus I think we're shooting the messenger here.
>>>
>>>> It's likely the test runs on the console and printfs stuff out while running.
>>> But why did we not regress the world if a few prints on the console
>>> have such a huge impact? We didn't get an entire stream of mails about
>>> breaking stuff ...
>> The regression seems not related to the commit.  But we have retested
>> and confirmed the regression.  Hard to understand what happens.
> Does the regressed test cause any output on console while it's
> measuring? If so, it's probably accidentally measuring fbcon/DRM code in
> addition to the workload it's trying to measure.
>

Sorry, I'm not familiar with DRM, we enabled the console to output logs, 
and attached please find the log file.

"Command line: ... console=tty0 earlyprintk=ttyS0,115200 
console=ttyS0,115200 vga=normal rw"

Best Regards,
Rong Chen


[-- Attachment #2: kmsg.xz --]
[-- Type: application/x-xz, Size: 82252 bytes --]

[-- Attachment #3: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-01  6:19                 ` Rong Chen
@ 2019-08-01  8:37                   ` Feng Tang
  2019-08-01  9:59                     ` Thomas Zimmermann
  2019-08-01  9:57                   ` Thomas Zimmermann
  2019-08-01 13:30                   ` Michel Dänzer
  2 siblings, 1 reply; 61+ messages in thread
From: Feng Tang @ 2019-08-01  8:37 UTC (permalink / raw)
  To: Rong Chen
  Cc: Stephen Rothwell, Michel Dänzer, dri-devel,
	Thomas Zimmermann, Huang, Ying, LKP

On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
> >>>>>>>>>
> >>>>>>>>>commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> >>>>>>>>>https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
> >>>>>>>>Daniel, Noralf, we may have to revert this patch.
> >>>>>>>>
> >>>>>>>>I expected some change in display performance, but not in VM. Since it's
> >>>>>>>>a server chipset, probably no one cares much about display performance.
> >>>>>>>>So that seemed like a good trade-off for re-using shared code.
> >>>>>>>>
> >>>>>>>>Part of the patch set is that the generic fb emulation now maps and
> >>>>>>>>unmaps the fbdev BO when updating the screen. I guess that's the cause
> >>>>>>>>of the performance regression. And it should be visible with other
> >>>>>>>>drivers as well if they use a shadow FB for fbdev emulation.
> >>>>>>>For fbcon we should need to do any maps/unamps at all, this is for the
> >>>>>>>fbdev mmap support only. If the testcase mentioned here tests fbdev
> >>>>>>>mmap handling it's pretty badly misnamed :-) And as long as you don't
> >>>>>>>have an fbdev mmap there shouldn't be any impact at all.
> >>>>>>The ast and mgag200 have only a few MiB of VRAM, so we have to get the
> >>>>>>fbdev BO out if it's not being displayed. If not being mapped, it can be
> >>>>>>evicted and make room for X, etc.
> >>>>>>
> >>>>>>To make this work, the BO's memory is mapped and unmapped in
> >>>>>>drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
> >>>>>>That fbdev mapping is established on each screen update, more or less.
> >>>>>> From my (yet unverified) understanding, this causes the performance
> >>>>>>regression in the VM code.
> >>>>>>
> >>>>>>The original code in mgag200 used to kmap the fbdev BO while it's being
> >>>>>>displayed; [2] and the drawing code only mapped it when necessary (i.e.,
> >>>>>>not being display). [3]
> >>>>>Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
> >>>>>cache this.
> >>>>>
> >>>>>>I think this could be added for VRAM helpers as well, but it's still a
> >>>>>>workaround and non-VRAM drivers might also run into such a performance
> >>>>>>regression if they use the fbdev's shadow fb.
> >>>>>Yeah agreed, fbdev emulation should try to cache the vmap.
> >>>>>
> >>>>>>Noralf mentioned that there are plans for other DRM clients besides the
> >>>>>>console. They would as well run into similar problems.
> >>>>>>
> >>>>>>>>The thing is that we'd need another generic fbdev emulation for ast and
> >>>>>>>>mgag200 that handles this issue properly.
> >>>>>>>Yeah I dont think we want to jump the gun here.  If you can try to
> >>>>>>>repro locally and profile where we're wasting cpu time I hope that
> >>>>>>>should sched a light what's going wrong here.
> >>>>>>I don't have much time ATM and I'm not even officially at work until
> >>>>>>late Aug. I'd send you the revert and investigate later. I agree that
> >>>>>>using generic fbdev emulation would be preferable.
> >>>>>Still not sure that's the right thing to do really. Yes it's a
> >>>>>regression, but vm testcases shouldn run a single line of fbcon or drm
> >>>>>code. So why this is impacted so heavily by a silly drm change is very
> >>>>>confusing to me. We might be papering over a deeper and much more
> >>>>>serious issue ...
> >>>>It's a regression, the right thing is to revert first and then work
> >>>>out the right thing to do.
> >>>Sure, but I have no idea whether the testcase is doing something
> >>>reasonable. If it's accidentally testing vm scalability of fbdev and
> >>>there's no one else doing something this pointless, then it's not a
> >>>real bug. Plus I think we're shooting the messenger here.
> >>>
> >>>>It's likely the test runs on the console and printfs stuff out while running.
> >>>But why did we not regress the world if a few prints on the console
> >>>have such a huge impact? We didn't get an entire stream of mails about
> >>>breaking stuff ...
> >>The regression seems not related to the commit.  But we have retested
> >>and confirmed the regression.  Hard to understand what happens.
> >Does the regressed test cause any output on console while it's
> >measuring? If so, it's probably accidentally measuring fbcon/DRM code in
> >addition to the workload it's trying to measure.
> >
> 
> Sorry, I'm not familiar with DRM, we enabled the console to output logs, and
> attached please find the log file.
> 
> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
> console=ttyS0,115200 vga=normal rw"

We did more check, and found this test machine does use the
mgag200 driver. 

And we are suspecting the regression is caused by 

commit cf1ca9aeb930df074bb5bbcde55f935fec04e529
Author: Thomas Zimmermann <tzimmermann@suse.de>
Date:   Wed Jul 3 09:58:24 2019 +0200

    drm/fb-helper: Map DRM client buffer only when required
    
    This patch changes DRM clients to not map the buffer by default. The
    buffer, like any buffer object, should be mapped and unmapped when
    needed.
    
    An unmapped buffer object can be evicted to system memory and does
    not consume video ram until displayed. This allows to use generic fbdev
    emulation with drivers for low-memory devices, such as ast and mgag200.
    
    This change affects the generic framebuffer console. HW-based consoles
    map their console buffer once and keep it mapped. Userspace can mmap this
    buffer into its address space. The shadow-buffered framebuffer console
    only needs the buffer object to be mapped during updates. While not being
    updated from the shadow buffer, the buffer object can remain unmapped.
    Userspace will always mmap the shadow buffer.
 
which may add more load when fbcon is busy printing out messages.

We are doing more test inside 0day to confirm.

Thanks,
Feng
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-01  6:19                 ` Rong Chen
  2019-08-01  8:37                   ` Feng Tang
@ 2019-08-01  9:57                   ` Thomas Zimmermann
  2019-08-01 13:30                   ` Michel Dänzer
  2 siblings, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-01  9:57 UTC (permalink / raw)
  To: Rong Chen, Michel Dänzer, Huang, Ying, Daniel Vetter
  Cc: Stephen Rothwell, LKP, dri-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 6373 bytes --]

Hi

Am 01.08.19 um 08:19 schrieb Rong Chen:
> Hi,
> 
> On 7/31/19 6:21 PM, Michel Dänzer wrote:
>> On 2019-07-31 11:25 a.m., Huang, Ying wrote:
>>> Hi, Daniel,
>>>
>>> Daniel Vetter <daniel@ffwll.ch> writes:
>>>
>>>> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>>>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann
>>>>>> <tzimmermann@suse.de> wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>>>>>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann
>>>>>>>> <tzimmermann@suse.de> wrote:
>>>>>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>>>>>>>>> Greeting,
>>>>>>>>>>
>>>>>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median
>>>>>>>>>> due to commit:>
>>>>>>>>>>
>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4
>>>>>>>>>> ("drm/mgag200: Replace struct mga_fbdev with generic
>>>>>>>>>> framebuffer emulation")
>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
>>>>>>>>>> master
>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>>
>>>>>>>>> I expected some change in display performance, but not in VM.
>>>>>>>>> Since it's
>>>>>>>>> a server chipset, probably no one cares much about display
>>>>>>>>> performance.
>>>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>>>
>>>>>>>>> Part of the patch set is that the generic fb emulation now maps
>>>>>>>>> and
>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's
>>>>>>>>> the cause
>>>>>>>>> of the performance regression. And it should be visible with other
>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>> For fbcon we should need to do any maps/unamps at all, this is
>>>>>>>> for the
>>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you
>>>>>>>> don't
>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to
>>>>>>> get the
>>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it
>>>>>>> can be
>>>>>>> evicted and make room for X, etc.
>>>>>>>
>>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow
>>>>>>> FB. [1]
>>>>>>> That fbdev mapping is established on each screen update, more or
>>>>>>> less.
>>>>>>>  From my (yet unverified) understanding, this causes the performance
>>>>>>> regression in the VM code.
>>>>>>>
>>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's
>>>>>>> being
>>>>>>> displayed; [2] and the drawing code only mapped it when necessary
>>>>>>> (i.e.,
>>>>>>> not being display). [3]
>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>>>> cache this.
>>>>>>
>>>>>>> I think this could be added for VRAM helpers as well, but it's
>>>>>>> still a
>>>>>>> workaround and non-VRAM drivers might also run into such a
>>>>>>> performance
>>>>>>> regression if they use the fbdev's shadow fb.
>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>>
>>>>>>> Noralf mentioned that there are plans for other DRM clients
>>>>>>> besides the
>>>>>>> console. They would as well run into similar problems.
>>>>>>>
>>>>>>>>> The thing is that we'd need another generic fbdev emulation for
>>>>>>>>> ast and
>>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>>>> should sched a light what's going wrong here.
>>>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>>>> late Aug. I'd send you the revert and investigate later. I agree
>>>>>>> that
>>>>>>> using generic fbdev emulation would be preferable.
>>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>>> regression, but vm testcases shouldn run a single line of fbcon or
>>>>>> drm
>>>>>> code. So why this is impacted so heavily by a silly drm change is
>>>>>> very
>>>>>> confusing to me. We might be papering over a deeper and much more
>>>>>> serious issue ...
>>>>> It's a regression, the right thing is to revert first and then work
>>>>> out the right thing to do.
>>>> Sure, but I have no idea whether the testcase is doing something
>>>> reasonable. If it's accidentally testing vm scalability of fbdev and
>>>> there's no one else doing something this pointless, then it's not a
>>>> real bug. Plus I think we're shooting the messenger here.
>>>>
>>>>> It's likely the test runs on the console and printfs stuff out
>>>>> while running.
>>>> But why did we not regress the world if a few prints on the console
>>>> have such a huge impact? We didn't get an entire stream of mails about
>>>> breaking stuff ...
>>> The regression seems not related to the commit.  But we have retested
>>> and confirmed the regression.  Hard to understand what happens.
>> Does the regressed test cause any output on console while it's
>> measuring? If so, it's probably accidentally measuring fbcon/DRM code in
>> addition to the workload it's trying to measure.
>>
> 
> Sorry, I'm not familiar with DRM, we enabled the console to output logs,
> and attached please find the log file.

I have a patch set for fixing this problem. But I cannot reproduce the
issue locally, because my machine is not for testing scalability.

If I send you the patches, could you run them on the machine to test
whether they solve the problem?

Best regards
Thomas

> 
> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
> console=ttyS0,115200 vga=normal rw"
> 
> Best Regards,
> Rong Chen
> 
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-01  8:37                   ` Feng Tang
@ 2019-08-01  9:59                     ` Thomas Zimmermann
  2019-08-01 11:25                       ` Feng Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-01  9:59 UTC (permalink / raw)
  To: Feng Tang, Rong Chen
  Cc: Stephen Rothwell, Michel Dänzer, LKP, dri-devel, Huang, Ying


[-- Attachment #1.1.1: Type: text/plain, Size: 6796 bytes --]

Hi

Am 01.08.19 um 10:37 schrieb Feng Tang:
> On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>>>>>>
>>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>>>
>>>>>>>>>> I expected some change in display performance, but not in VM. Since it's
>>>>>>>>>> a server chipset, probably no one cares much about display performance.
>>>>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>>>>
>>>>>>>>>> Part of the patch set is that the generic fb emulation now maps and
>>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>>>>>>>>>> of the performance regression. And it should be visible with other
>>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
>>>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
>>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>>>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
>>>>>>>> evicted and make room for X, etc.
>>>>>>>>
>>>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>>>>>>>> That fbdev mapping is established on each screen update, more or less.
>>>>>>>> From my (yet unverified) understanding, this causes the performance
>>>>>>>> regression in the VM code.
>>>>>>>>
>>>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
>>>>>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>>>>>>>> not being display). [3]
>>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>>>>> cache this.
>>>>>>>
>>>>>>>> I think this could be added for VRAM helpers as well, but it's still a
>>>>>>>> workaround and non-VRAM drivers might also run into such a performance
>>>>>>>> regression if they use the fbdev's shadow fb.
>>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>>>
>>>>>>>> Noralf mentioned that there are plans for other DRM clients besides the
>>>>>>>> console. They would as well run into similar problems.
>>>>>>>>
>>>>>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
>>>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>>>>> should sched a light what's going wrong here.
>>>>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>>>>> late Aug. I'd send you the revert and investigate later. I agree that
>>>>>>>> using generic fbdev emulation would be preferable.
>>>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>>>> regression, but vm testcases shouldn run a single line of fbcon or drm
>>>>>>> code. So why this is impacted so heavily by a silly drm change is very
>>>>>>> confusing to me. We might be papering over a deeper and much more
>>>>>>> serious issue ...
>>>>>> It's a regression, the right thing is to revert first and then work
>>>>>> out the right thing to do.
>>>>> Sure, but I have no idea whether the testcase is doing something
>>>>> reasonable. If it's accidentally testing vm scalability of fbdev and
>>>>> there's no one else doing something this pointless, then it's not a
>>>>> real bug. Plus I think we're shooting the messenger here.
>>>>>
>>>>>> It's likely the test runs on the console and printfs stuff out while running.
>>>>> But why did we not regress the world if a few prints on the console
>>>>> have such a huge impact? We didn't get an entire stream of mails about
>>>>> breaking stuff ...
>>>> The regression seems not related to the commit.  But we have retested
>>>> and confirmed the regression.  Hard to understand what happens.
>>> Does the regressed test cause any output on console while it's
>>> measuring? If so, it's probably accidentally measuring fbcon/DRM code in
>>> addition to the workload it's trying to measure.
>>>
>>
>> Sorry, I'm not familiar with DRM, we enabled the console to output logs, and
>> attached please find the log file.
>>
>> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
>> console=ttyS0,115200 vga=normal rw"
> 
> We did more check, and found this test machine does use the
> mgag200 driver. 
> 
> And we are suspecting the regression is caused by 
> 
> commit cf1ca9aeb930df074bb5bbcde55f935fec04e529
> Author: Thomas Zimmermann <tzimmermann@suse.de>
> Date:   Wed Jul 3 09:58:24 2019 +0200

Yes, that's the commit. Unfortunately reverting it would require
reverting a hand full of other patches as well.

I have a potential fix for the problem. Could you run and verify that it
resolves the problem?

Best regards
Thomas

> 
>     drm/fb-helper: Map DRM client buffer only when required
>     
>     This patch changes DRM clients to not map the buffer by default. The
>     buffer, like any buffer object, should be mapped and unmapped when
>     needed.
>     
>     An unmapped buffer object can be evicted to system memory and does
>     not consume video ram until displayed. This allows to use generic fbdev
>     emulation with drivers for low-memory devices, such as ast and mgag200.
>     
>     This change affects the generic framebuffer console. HW-based consoles
>     map their console buffer once and keep it mapped. Userspace can mmap this
>     buffer into its address space. The shadow-buffered framebuffer console
>     only needs the buffer object to be mapped during updates. While not being
>     updated from the shadow buffer, the buffer object can remain unmapped.
>     Userspace will always mmap the shadow buffer.
>  
> which may add more load when fbcon is busy printing out messages.
> 
> We are doing more test inside 0day to confirm.
> 
> Thanks,
> Feng
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-01  9:59                     ` Thomas Zimmermann
@ 2019-08-01 11:25                       ` Feng Tang
  2019-08-01 11:58                         ` Thomas Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Feng Tang @ 2019-08-01 11:25 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Stephen Rothwell, Rong Chen, Michel Dänzer, dri-devel,
	Huang, Ying, LKP

Hi Thomas,

On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
> Hi
> 
> Am 01.08.19 um 10:37 schrieb Feng Tang:
> > On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> >>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
> >>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
> >>>>>>>>>>
> >>>>>>>>>> I expected some change in display performance, but not in VM. Since it's
> >>>>>>>>>> a server chipset, probably no one cares much about display performance.
> >>>>>>>>>> So that seemed like a good trade-off for re-using shared code.
> >>>>>>>>>>
> >>>>>>>>>> Part of the patch set is that the generic fb emulation now maps and
> >>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
> >>>>>>>>>> of the performance regression. And it should be visible with other
> >>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
> >>>>>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
> >>>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
> >>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
> >>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
> >>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
> >>>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
> >>>>>>>> evicted and make room for X, etc.
> >>>>>>>>
> >>>>>>>> To make this work, the BO's memory is mapped and unmapped in
> >>>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
> >>>>>>>> That fbdev mapping is established on each screen update, more or less.
> >>>>>>>> From my (yet unverified) understanding, this causes the performance
> >>>>>>>> regression in the VM code.
> >>>>>>>>
> >>>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
> >>>>>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
> >>>>>>>> not being display). [3]
> >>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
> >>>>>>> cache this.
> >>>>>>>
> >>>>>>>> I think this could be added for VRAM helpers as well, but it's still a
> >>>>>>>> workaround and non-VRAM drivers might also run into such a performance
> >>>>>>>> regression if they use the fbdev's shadow fb.
> >>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
> >>>>>>>
> >>>>>>>> Noralf mentioned that there are plans for other DRM clients besides the
> >>>>>>>> console. They would as well run into similar problems.
> >>>>>>>>
> >>>>>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
> >>>>>>>>>> mgag200 that handles this issue properly.
> >>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
> >>>>>>>>> repro locally and profile where we're wasting cpu time I hope that
> >>>>>>>>> should sched a light what's going wrong here.
> >>>>>>>> I don't have much time ATM and I'm not even officially at work until
> >>>>>>>> late Aug. I'd send you the revert and investigate later. I agree that
> >>>>>>>> using generic fbdev emulation would be preferable.
> >>>>>>> Still not sure that's the right thing to do really. Yes it's a
> >>>>>>> regression, but vm testcases shouldn run a single line of fbcon or drm
> >>>>>>> code. So why this is impacted so heavily by a silly drm change is very
> >>>>>>> confusing to me. We might be papering over a deeper and much more
> >>>>>>> serious issue ...
> >>>>>> It's a regression, the right thing is to revert first and then work
> >>>>>> out the right thing to do.
> >>>>> Sure, but I have no idea whether the testcase is doing something
> >>>>> reasonable. If it's accidentally testing vm scalability of fbdev and
> >>>>> there's no one else doing something this pointless, then it's not a
> >>>>> real bug. Plus I think we're shooting the messenger here.
> >>>>>
> >>>>>> It's likely the test runs on the console and printfs stuff out while running.
> >>>>> But why did we not regress the world if a few prints on the console
> >>>>> have such a huge impact? We didn't get an entire stream of mails about
> >>>>> breaking stuff ...
> >>>> The regression seems not related to the commit.  But we have retested
> >>>> and confirmed the regression.  Hard to understand what happens.
> >>> Does the regressed test cause any output on console while it's
> >>> measuring? If so, it's probably accidentally measuring fbcon/DRM code in
> >>> addition to the workload it's trying to measure.
> >>>
> >>
> >> Sorry, I'm not familiar with DRM, we enabled the console to output logs, and
> >> attached please find the log file.
> >>
> >> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
> >> console=ttyS0,115200 vga=normal rw"
> > 
> > We did more check, and found this test machine does use the
> > mgag200 driver. 
> > 
> > And we are suspecting the regression is caused by 
> > 
> > commit cf1ca9aeb930df074bb5bbcde55f935fec04e529
> > Author: Thomas Zimmermann <tzimmermann@suse.de>
> > Date:   Wed Jul 3 09:58:24 2019 +0200
> 
> Yes, that's the commit. Unfortunately reverting it would require
> reverting a hand full of other patches as well.
> 
> I have a potential fix for the problem. Could you run and verify that it
> resolves the problem?

Sure, please send it to us. Rong and I will try it.

Thanks,
Feng


> Best regards
> Thomas
> 
> > 
> >     drm/fb-helper: Map DRM client buffer only when required
> >     
> >     This patch changes DRM clients to not map the buffer by default. The
> >     buffer, like any buffer object, should be mapped and unmapped when
> >     needed.
> >     
> >     An unmapped buffer object can be evicted to system memory and does
> >     not consume video ram until displayed. This allows to use generic fbdev
> >     emulation with drivers for low-memory devices, such as ast and mgag200.
> >     
> >     This change affects the generic framebuffer console. HW-based consoles
> >     map their console buffer once and keep it mapped. Userspace can mmap this
> >     buffer into its address space. The shadow-buffered framebuffer console
> >     only needs the buffer object to be mapped during updates. While not being
> >     updated from the shadow buffer, the buffer object can remain unmapped.
> >     Userspace will always mmap the shadow buffer.
> >  
> > which may add more load when fbcon is busy printing out messages.
> > 
> > We are doing more test inside 0day to confirm.
> > 
> > Thanks,
> > Feng
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > 
> 
> -- 
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
> 



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-01 11:25                       ` Feng Tang
@ 2019-08-01 11:58                         ` Thomas Zimmermann
  2019-08-02  7:11                           ` Rong Chen
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-01 11:58 UTC (permalink / raw)
  To: Feng Tang
  Cc: Stephen Rothwell, Rong Chen, Michel Dänzer, dri-devel,
	Huang, Ying, LKP


[-- Attachment #1.1.1: Type: text/plain, Size: 7850 bytes --]

Hi

Am 01.08.19 um 13:25 schrieb Feng Tang:
> Hi Thomas,
> 
> On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 01.08.19 um 10:37 schrieb Feng Tang:
>>> On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>>>>>
>>>>>>>>>>>> I expected some change in display performance, but not in VM. Since it's
>>>>>>>>>>>> a server chipset, probably no one cares much about display performance.
>>>>>>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>>>>>>
>>>>>>>>>>>> Part of the patch set is that the generic fb emulation now maps and
>>>>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>>>>>>>>>>>> of the performance regression. And it should be visible with other
>>>>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
>>>>>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
>>>>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>>>>>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
>>>>>>>>>> evicted and make room for X, etc.
>>>>>>>>>>
>>>>>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>>>>>>>>>> That fbdev mapping is established on each screen update, more or less.
>>>>>>>>>> From my (yet unverified) understanding, this causes the performance
>>>>>>>>>> regression in the VM code.
>>>>>>>>>>
>>>>>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
>>>>>>>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>>>>>>>>>> not being display). [3]
>>>>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>>>>>>> cache this.
>>>>>>>>>
>>>>>>>>>> I think this could be added for VRAM helpers as well, but it's still a
>>>>>>>>>> workaround and non-VRAM drivers might also run into such a performance
>>>>>>>>>> regression if they use the fbdev's shadow fb.
>>>>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>>>>>
>>>>>>>>>> Noralf mentioned that there are plans for other DRM clients besides the
>>>>>>>>>> console. They would as well run into similar problems.
>>>>>>>>>>
>>>>>>>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
>>>>>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>>>>>>> should sched a light what's going wrong here.
>>>>>>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>>>>>>> late Aug. I'd send you the revert and investigate later. I agree that
>>>>>>>>>> using generic fbdev emulation would be preferable.
>>>>>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>>>>>> regression, but vm testcases shouldn run a single line of fbcon or drm
>>>>>>>>> code. So why this is impacted so heavily by a silly drm change is very
>>>>>>>>> confusing to me. We might be papering over a deeper and much more
>>>>>>>>> serious issue ...
>>>>>>>> It's a regression, the right thing is to revert first and then work
>>>>>>>> out the right thing to do.
>>>>>>> Sure, but I have no idea whether the testcase is doing something
>>>>>>> reasonable. If it's accidentally testing vm scalability of fbdev and
>>>>>>> there's no one else doing something this pointless, then it's not a
>>>>>>> real bug. Plus I think we're shooting the messenger here.
>>>>>>>
>>>>>>>> It's likely the test runs on the console and printfs stuff out while running.
>>>>>>> But why did we not regress the world if a few prints on the console
>>>>>>> have such a huge impact? We didn't get an entire stream of mails about
>>>>>>> breaking stuff ...
>>>>>> The regression seems not related to the commit.  But we have retested
>>>>>> and confirmed the regression.  Hard to understand what happens.
>>>>> Does the regressed test cause any output on console while it's
>>>>> measuring? If so, it's probably accidentally measuring fbcon/DRM code in
>>>>> addition to the workload it's trying to measure.
>>>>>
>>>>
>>>> Sorry, I'm not familiar with DRM, we enabled the console to output logs, and
>>>> attached please find the log file.
>>>>
>>>> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
>>>> console=ttyS0,115200 vga=normal rw"
>>>
>>> We did more check, and found this test machine does use the
>>> mgag200 driver. 
>>>
>>> And we are suspecting the regression is caused by 
>>>
>>> commit cf1ca9aeb930df074bb5bbcde55f935fec04e529
>>> Author: Thomas Zimmermann <tzimmermann@suse.de>
>>> Date:   Wed Jul 3 09:58:24 2019 +0200
>>
>> Yes, that's the commit. Unfortunately reverting it would require
>> reverting a hand full of other patches as well.
>>
>> I have a potential fix for the problem. Could you run and verify that it
>> resolves the problem?
> 
> Sure, please send it to us. Rong and I will try it.

Fantastic, thank you! The patch set is available on dri-devel at

  https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html

Best regards
Thomas

> 
> Thanks,
> Feng
> 
> 
>> Best regards
>> Thomas
>>
>>>
>>>     drm/fb-helper: Map DRM client buffer only when required
>>>     
>>>     This patch changes DRM clients to not map the buffer by default. The
>>>     buffer, like any buffer object, should be mapped and unmapped when
>>>     needed.
>>>     
>>>     An unmapped buffer object can be evicted to system memory and does
>>>     not consume video ram until displayed. This allows to use generic fbdev
>>>     emulation with drivers for low-memory devices, such as ast and mgag200.
>>>     
>>>     This change affects the generic framebuffer console. HW-based consoles
>>>     map their console buffer once and keep it mapped. Userspace can mmap this
>>>     buffer into its address space. The shadow-buffered framebuffer console
>>>     only needs the buffer object to be mapped during updates. While not being
>>>     updated from the shadow buffer, the buffer object can remain unmapped.
>>>     Userspace will always mmap the shadow buffer.
>>>  
>>> which may add more load when fbcon is busy printing out messages.
>>>
>>> We are doing more test inside 0day to confirm.
>>>
>>> Thanks,
>>> Feng
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>
>> -- 
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>> HRB 21284 (AG Nürnberg)
>>
> 
> 
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-01  6:19                 ` Rong Chen
  2019-08-01  8:37                   ` Feng Tang
  2019-08-01  9:57                   ` Thomas Zimmermann
@ 2019-08-01 13:30                   ` Michel Dänzer
  2019-08-02  8:17                     ` Thomas Zimmermann
  2 siblings, 1 reply; 61+ messages in thread
From: Michel Dänzer @ 2019-08-01 13:30 UTC (permalink / raw)
  To: Rong Chen, Huang, Ying, Daniel Vetter
  Cc: Stephen Rothwell, LKP, dri-devel, Thomas Zimmermann

On 2019-08-01 8:19 a.m., Rong Chen wrote:
> Hi,
> 
> On 7/31/19 6:21 PM, Michel Dänzer wrote:
>> On 2019-07-31 11:25 a.m., Huang, Ying wrote:
>>> Hi, Daniel,
>>>
>>> Daniel Vetter <daniel@ffwll.ch> writes:
>>>
>>>> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>>>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann
>>>>>> <tzimmermann@suse.de> wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>>>>>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann
>>>>>>>> <tzimmermann@suse.de> wrote:
>>>>>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>>>>>>>>> Greeting,
>>>>>>>>>>
>>>>>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median
>>>>>>>>>> due to commit:>
>>>>>>>>>>
>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4
>>>>>>>>>> ("drm/mgag200: Replace struct mga_fbdev with generic
>>>>>>>>>> framebuffer emulation")
>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
>>>>>>>>>> master
>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>>
>>>>>>>>> I expected some change in display performance, but not in VM.
>>>>>>>>> Since it's
>>>>>>>>> a server chipset, probably no one cares much about display
>>>>>>>>> performance.
>>>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>>>
>>>>>>>>> Part of the patch set is that the generic fb emulation now maps
>>>>>>>>> and
>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's
>>>>>>>>> the cause
>>>>>>>>> of the performance regression. And it should be visible with other
>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>> For fbcon we should need to do any maps/unamps at all, this is
>>>>>>>> for the
>>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you
>>>>>>>> don't
>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to
>>>>>>> get the
>>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it
>>>>>>> can be
>>>>>>> evicted and make room for X, etc.
>>>>>>>
>>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow
>>>>>>> FB. [1]
>>>>>>> That fbdev mapping is established on each screen update, more or
>>>>>>> less.
>>>>>>>  From my (yet unverified) understanding, this causes the performance
>>>>>>> regression in the VM code.
>>>>>>>
>>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's
>>>>>>> being
>>>>>>> displayed; [2] and the drawing code only mapped it when necessary
>>>>>>> (i.e.,
>>>>>>> not being display). [3]
>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>>>> cache this.
>>>>>>
>>>>>>> I think this could be added for VRAM helpers as well, but it's
>>>>>>> still a
>>>>>>> workaround and non-VRAM drivers might also run into such a
>>>>>>> performance
>>>>>>> regression if they use the fbdev's shadow fb.
>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>>
>>>>>>> Noralf mentioned that there are plans for other DRM clients
>>>>>>> besides the
>>>>>>> console. They would as well run into similar problems.
>>>>>>>
>>>>>>>>> The thing is that we'd need another generic fbdev emulation for
>>>>>>>>> ast and
>>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>>>> should sched a light what's going wrong here.
>>>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>>>> late Aug. I'd send you the revert and investigate later. I agree
>>>>>>> that
>>>>>>> using generic fbdev emulation would be preferable.
>>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>>> regression, but vm testcases shouldn run a single line of fbcon or
>>>>>> drm
>>>>>> code. So why this is impacted so heavily by a silly drm change is
>>>>>> very
>>>>>> confusing to me. We might be papering over a deeper and much more
>>>>>> serious issue ...
>>>>> It's a regression, the right thing is to revert first and then work
>>>>> out the right thing to do.
>>>> Sure, but I have no idea whether the testcase is doing something
>>>> reasonable. If it's accidentally testing vm scalability of fbdev and
>>>> there's no one else doing something this pointless, then it's not a
>>>> real bug. Plus I think we're shooting the messenger here.
>>>>
>>>>> It's likely the test runs on the console and printfs stuff out
>>>>> while running.
>>>> But why did we not regress the world if a few prints on the console
>>>> have such a huge impact? We didn't get an entire stream of mails about
>>>> breaking stuff ...
>>> The regression seems not related to the commit.  But we have retested
>>> and confirmed the regression.  Hard to understand what happens.
>> Does the regressed test cause any output on console while it's
>> measuring? If so, it's probably accidentally measuring fbcon/DRM code in
>> addition to the workload it's trying to measure.
>>
> 
> Sorry, I'm not familiar with DRM, we enabled the console to output logs,
> and attached please find the log file.
> 
> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
> console=ttyS0,115200 vga=normal rw"

I assume the

user  :notice: [  xxx.xxxx] xxxxxxxxx bytes / xxxxxxx usecs = xxxxx KB/s

lines are generated by the test?

If so, unless the test is intended to measure console performance, it
should be fixed not to generate output to console (while it's measuring).


-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-01 11:58                         ` Thomas Zimmermann
@ 2019-08-02  7:11                           ` Rong Chen
  2019-08-02  8:23                             ` Thomas Zimmermann
  2019-08-02  9:20                             ` Thomas Zimmermann
  0 siblings, 2 replies; 61+ messages in thread
From: Rong Chen @ 2019-08-02  7:11 UTC (permalink / raw)
  To: Thomas Zimmermann, Feng Tang
  Cc: Stephen Rothwell, Michel Dänzer, LKP, dri-devel, Huang, Ying

[-- Attachment #1: Type: text/plain, Size: 9173 bytes --]

Hi,

On 8/1/19 7:58 PM, Thomas Zimmermann wrote:
> Hi
>
> Am 01.08.19 um 13:25 schrieb Feng Tang:
>> Hi Thomas,
>>
>> On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> Am 01.08.19 um 10:37 schrieb Feng Tang:
>>>> On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>>>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I expected some change in display performance, but not in VM. Since it's
>>>>>>>>>>>>> a server chipset, probably no one cares much about display performance.
>>>>>>>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Part of the patch set is that the generic fb emulation now maps and
>>>>>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>>>>>>>>>>>>> of the performance regression. And it should be visible with other
>>>>>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
>>>>>>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
>>>>>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>>>>>>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
>>>>>>>>>>> evicted and make room for X, etc.
>>>>>>>>>>>
>>>>>>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>>>>>>>>>>> That fbdev mapping is established on each screen update, more or less.
>>>>>>>>>>>  From my (yet unverified) understanding, this causes the performance
>>>>>>>>>>> regression in the VM code.
>>>>>>>>>>>
>>>>>>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
>>>>>>>>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>>>>>>>>>>> not being display). [3]
>>>>>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>>>>>>>> cache this.
>>>>>>>>>>
>>>>>>>>>>> I think this could be added for VRAM helpers as well, but it's still a
>>>>>>>>>>> workaround and non-VRAM drivers might also run into such a performance
>>>>>>>>>>> regression if they use the fbdev's shadow fb.
>>>>>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>>>>>>
>>>>>>>>>>> Noralf mentioned that there are plans for other DRM clients besides the
>>>>>>>>>>> console. They would as well run into similar problems.
>>>>>>>>>>>
>>>>>>>>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
>>>>>>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>>>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>>>>>>>> should sched a light what's going wrong here.
>>>>>>>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>>>>>>>> late Aug. I'd send you the revert and investigate later. I agree that
>>>>>>>>>>> using generic fbdev emulation would be preferable.
>>>>>>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>>>>>>> regression, but vm testcases shouldn run a single line of fbcon or drm
>>>>>>>>>> code. So why this is impacted so heavily by a silly drm change is very
>>>>>>>>>> confusing to me. We might be papering over a deeper and much more
>>>>>>>>>> serious issue ...
>>>>>>>>> It's a regression, the right thing is to revert first and then work
>>>>>>>>> out the right thing to do.
>>>>>>>> Sure, but I have no idea whether the testcase is doing something
>>>>>>>> reasonable. If it's accidentally testing vm scalability of fbdev and
>>>>>>>> there's no one else doing something this pointless, then it's not a
>>>>>>>> real bug. Plus I think we're shooting the messenger here.
>>>>>>>>
>>>>>>>>> It's likely the test runs on the console and printfs stuff out while running.
>>>>>>>> But why did we not regress the world if a few prints on the console
>>>>>>>> have such a huge impact? We didn't get an entire stream of mails about
>>>>>>>> breaking stuff ...
>>>>>>> The regression seems not related to the commit.  But we have retested
>>>>>>> and confirmed the regression.  Hard to understand what happens.
>>>>>> Does the regressed test cause any output on console while it's
>>>>>> measuring? If so, it's probably accidentally measuring fbcon/DRM code in
>>>>>> addition to the workload it's trying to measure.
>>>>>>
>>>>> Sorry, I'm not familiar with DRM, we enabled the console to output logs, and
>>>>> attached please find the log file.
>>>>>
>>>>> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
>>>>> console=ttyS0,115200 vga=normal rw"
>>>> We did more check, and found this test machine does use the
>>>> mgag200 driver.
>>>>
>>>> And we are suspecting the regression is caused by
>>>>
>>>> commit cf1ca9aeb930df074bb5bbcde55f935fec04e529
>>>> Author: Thomas Zimmermann <tzimmermann@suse.de>
>>>> Date:   Wed Jul 3 09:58:24 2019 +0200
>>> Yes, that's the commit. Unfortunately reverting it would require
>>> reverting a hand full of other patches as well.
>>>
>>> I have a potential fix for the problem. Could you run and verify that it
>>> resolves the problem?
>> Sure, please send it to us. Rong and I will try it.
> Fantastic, thank you! The patch set is available on dri-devel at
>
>    https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html

The patch set improves the performance slightly, but the change is not 
very obvious.

$ git log --oneline 8f7ec6bcc7 -5
8f7ec6bcc75a9 drm/mgag200: Map fbdev framebuffer while it's being displayed
abcb1cf24033a drm/ast: Map fbdev framebuffer while it's being displayed
a92f80044c623 drm/vram-helpers: Add kmap ref-counting to GEM VRAM objects
90f479ae51afa drm/mgag200: Replace struct mga_fbdev with generic 
framebuffer emulation
f1f8555dfb9a7 drm/bochs: Use shadow buffer for bochs framebuffer console

commit:
   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic 
framebuffer emulation")
   8f7ec6bcc7 ("drm/mgag200: Map fbdev framebuffer while it's being 
displayed")

f1f8555dfb9a70a2  90f479ae51afa45efab97afdde 8f7ec6bcc75a996f5c6b39a9cf  
testcase/testparams/testbox
----------------  -------------------------- --------------------------  
---------------------------
          %stddev      change         %stddev      change %stddev
              \          |                \          | \
      43921             -18%      35884             -17% 36629 
vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
      43921             -18%      35884             -17% 36629        
GEO-MEAN vm-scalability.median

Best Regards,
Rong Chen

>
> Best regards
> Thomas
>
>> Thanks,
>> Feng
>>
>>
>>> Best regards
>>> Thomas
>>>
>>>>      drm/fb-helper: Map DRM client buffer only when required
>>>>      
>>>>      This patch changes DRM clients to not map the buffer by default. The
>>>>      buffer, like any buffer object, should be mapped and unmapped when
>>>>      needed.
>>>>      
>>>>      An unmapped buffer object can be evicted to system memory and does
>>>>      not consume video ram until displayed. This allows to use generic fbdev
>>>>      emulation with drivers for low-memory devices, such as ast and mgag200.
>>>>      
>>>>      This change affects the generic framebuffer console. HW-based consoles
>>>>      map their console buffer once and keep it mapped. Userspace can mmap this
>>>>      buffer into its address space. The shadow-buffered framebuffer console
>>>>      only needs the buffer object to be mapped during updates. While not being
>>>>      updated from the shadow buffer, the buffer object can remain unmapped.
>>>>      Userspace will always mmap the shadow buffer.
>>>>   
>>>> which may add more load when fbcon is busy printing out messages.
>>>>
>>>> We are doing more test inside 0day to confirm.
>>>>
>>>> Thanks,
>>>> Feng
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>
>>> -- 
>>> Thomas Zimmermann
>>> Graphics Driver Developer
>>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>>> HRB 21284 (AG Nürnberg)
>>>
>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>


[-- Attachment #2: kmsg.xz --]
[-- Type: application/x-xz, Size: 82932 bytes --]

[-- Attachment #3: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-01 13:30                   ` Michel Dänzer
@ 2019-08-02  8:17                     ` Thomas Zimmermann
  0 siblings, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-02  8:17 UTC (permalink / raw)
  To: Michel Dänzer, Rong Chen, Huang, Ying, Daniel Vetter
  Cc: Stephen Rothwell, LKP, dri-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 6463 bytes --]

Hi

Am 01.08.19 um 15:30 schrieb Michel Dänzer:
> On 2019-08-01 8:19 a.m., Rong Chen wrote:
>> Hi,
>>
>> On 7/31/19 6:21 PM, Michel Dänzer wrote:
>>> On 2019-07-31 11:25 a.m., Huang, Ying wrote:
>>>> Hi, Daniel,
>>>>
>>>> Daniel Vetter <daniel@ffwll.ch> writes:
>>>>
>>>>> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>>>>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann
>>>>>>> <tzimmermann@suse.de> wrote:
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>>>>>>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann
>>>>>>>>> <tzimmermann@suse.de> wrote:
>>>>>>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>>>>>>>>>> Greeting,
>>>>>>>>>>>
>>>>>>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median
>>>>>>>>>>> due to commit:>
>>>>>>>>>>>
>>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4
>>>>>>>>>>> ("drm/mgag200: Replace struct mga_fbdev with generic
>>>>>>>>>>> framebuffer emulation")
>>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
>>>>>>>>>>> master
>>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>>>
>>>>>>>>>> I expected some change in display performance, but not in VM.
>>>>>>>>>> Since it's
>>>>>>>>>> a server chipset, probably no one cares much about display
>>>>>>>>>> performance.
>>>>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>>>>
>>>>>>>>>> Part of the patch set is that the generic fb emulation now maps
>>>>>>>>>> and
>>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's
>>>>>>>>>> the cause
>>>>>>>>>> of the performance regression. And it should be visible with other
>>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>>> For fbcon we should need to do any maps/unamps at all, this is
>>>>>>>>> for the
>>>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you
>>>>>>>>> don't
>>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to
>>>>>>>> get the
>>>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it
>>>>>>>> can be
>>>>>>>> evicted and make room for X, etc.
>>>>>>>>
>>>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow
>>>>>>>> FB. [1]
>>>>>>>> That fbdev mapping is established on each screen update, more or
>>>>>>>> less.
>>>>>>>>  From my (yet unverified) understanding, this causes the performance
>>>>>>>> regression in the VM code.
>>>>>>>>
>>>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's
>>>>>>>> being
>>>>>>>> displayed; [2] and the drawing code only mapped it when necessary
>>>>>>>> (i.e.,
>>>>>>>> not being display). [3]
>>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>>>>> cache this.
>>>>>>>
>>>>>>>> I think this could be added for VRAM helpers as well, but it's
>>>>>>>> still a
>>>>>>>> workaround and non-VRAM drivers might also run into such a
>>>>>>>> performance
>>>>>>>> regression if they use the fbdev's shadow fb.
>>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>>>
>>>>>>>> Noralf mentioned that there are plans for other DRM clients
>>>>>>>> besides the
>>>>>>>> console. They would as well run into similar problems.
>>>>>>>>
>>>>>>>>>> The thing is that we'd need another generic fbdev emulation for
>>>>>>>>>> ast and
>>>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>>>>> should sched a light what's going wrong here.
>>>>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>>>>> late Aug. I'd send you the revert and investigate later. I agree
>>>>>>>> that
>>>>>>>> using generic fbdev emulation would be preferable.
>>>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>>>> regression, but vm testcases shouldn run a single line of fbcon or
>>>>>>> drm
>>>>>>> code. So why this is impacted so heavily by a silly drm change is
>>>>>>> very
>>>>>>> confusing to me. We might be papering over a deeper and much more
>>>>>>> serious issue ...
>>>>>> It's a regression, the right thing is to revert first and then work
>>>>>> out the right thing to do.
>>>>> Sure, but I have no idea whether the testcase is doing something
>>>>> reasonable. If it's accidentally testing vm scalability of fbdev and
>>>>> there's no one else doing something this pointless, then it's not a
>>>>> real bug. Plus I think we're shooting the messenger here.
>>>>>
>>>>>> It's likely the test runs on the console and printfs stuff out
>>>>>> while running.
>>>>> But why did we not regress the world if a few prints on the console
>>>>> have such a huge impact? We didn't get an entire stream of mails about
>>>>> breaking stuff ...
>>>> The regression seems not related to the commit.  But we have retested
>>>> and confirmed the regression.  Hard to understand what happens.
>>> Does the regressed test cause any output on console while it's
>>> measuring? If so, it's probably accidentally measuring fbcon/DRM code in
>>> addition to the workload it's trying to measure.
>>>
>>
>> Sorry, I'm not familiar with DRM, we enabled the console to output logs,
>> and attached please find the log file.
>>
>> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
>> console=ttyS0,115200 vga=normal rw"
> 
> I assume the
> 
> user  :notice: [  xxx.xxxx] xxxxxxxxx bytes / xxxxxxx usecs = xxxxx KB/s
> 
> lines are generated by the test?
> 
> If so, unless the test is intended to measure console performance, it
> should be fixed not to generate output to console (while it's measuring).

Yes, the test prints quite a lot of text to the console. It shouldn't do
that.

Best regards
Thomas

> 
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-02  7:11                           ` Rong Chen
@ 2019-08-02  8:23                             ` Thomas Zimmermann
  2019-08-02  9:20                             ` Thomas Zimmermann
  1 sibling, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-02  8:23 UTC (permalink / raw)
  To: Rong Chen, Feng Tang
  Cc: Stephen Rothwell, Michel Dänzer, LKP, dri-devel, Huang, Ying


[-- Attachment #1.1.1: Type: text/plain, Size: 10888 bytes --]

Hi

Am 02.08.19 um 09:11 schrieb Rong Chen:
> Hi,
> 
> On 8/1/19 7:58 PM, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 01.08.19 um 13:25 schrieb Feng Tang:
>>> Hi Thomas,
>>>
>>> On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
>>>> Hi
>>>>
>>>> Am 01.08.19 um 10:37 schrieb Feng Tang:
>>>>> On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4
>>>>>>>>>>>>>>> ("drm/mgag200: Replace struct mga_fbdev with generic
>>>>>>>>>>>>>>> framebuffer emulation")
>>>>>>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
>>>>>>>>>>>>>>> master
>>>>>>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I expected some change in display performance, but not in
>>>>>>>>>>>>>> VM. Since it's
>>>>>>>>>>>>>> a server chipset, probably no one cares much about display
>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>> So that seemed like a good trade-off for re-using shared
>>>>>>>>>>>>>> code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Part of the patch set is that the generic fb emulation now
>>>>>>>>>>>>>> maps and
>>>>>>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess
>>>>>>>>>>>>>> that's the cause
>>>>>>>>>>>>>> of the performance regression. And it should be visible
>>>>>>>>>>>>>> with other
>>>>>>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>>>>>>> For fbcon we should need to do any maps/unamps at all, this
>>>>>>>>>>>>> is for the
>>>>>>>>>>>>> fbdev mmap support only. If the testcase mentioned here
>>>>>>>>>>>>> tests fbdev
>>>>>>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as
>>>>>>>>>>>>> you don't
>>>>>>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have
>>>>>>>>>>>> to get the
>>>>>>>>>>>> fbdev BO out if it's not being displayed. If not being
>>>>>>>>>>>> mapped, it can be
>>>>>>>>>>>> evicted and make room for X, etc.
>>>>>>>>>>>>
>>>>>>>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>>>>>>>> drm_fb_helper_dirty_work() before being updated from the
>>>>>>>>>>>> shadow FB. [1]
>>>>>>>>>>>> That fbdev mapping is established on each screen update,
>>>>>>>>>>>> more or less.
>>>>>>>>>>>>  From my (yet unverified) understanding, this causes the
>>>>>>>>>>>> performance
>>>>>>>>>>>> regression in the VM code.
>>>>>>>>>>>>
>>>>>>>>>>>> The original code in mgag200 used to kmap the fbdev BO while
>>>>>>>>>>>> it's being
>>>>>>>>>>>> displayed; [2] and the drawing code only mapped it when
>>>>>>>>>>>> necessary (i.e.,
>>>>>>>>>>>> not being display). [3]
>>>>>>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We
>>>>>>>>>>> indeed should
>>>>>>>>>>> cache this.
>>>>>>>>>>>
>>>>>>>>>>>> I think this could be added for VRAM helpers as well, but
>>>>>>>>>>>> it's still a
>>>>>>>>>>>> workaround and non-VRAM drivers might also run into such a
>>>>>>>>>>>> performance
>>>>>>>>>>>> regression if they use the fbdev's shadow fb.
>>>>>>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>>>>>>>
>>>>>>>>>>>> Noralf mentioned that there are plans for other DRM clients
>>>>>>>>>>>> besides the
>>>>>>>>>>>> console. They would as well run into similar problems.
>>>>>>>>>>>>
>>>>>>>>>>>>>> The thing is that we'd need another generic fbdev
>>>>>>>>>>>>>> emulation for ast and
>>>>>>>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can
>>>>>>>>>>>>> try to
>>>>>>>>>>>>> repro locally and profile where we're wasting cpu time I
>>>>>>>>>>>>> hope that
>>>>>>>>>>>>> should sched a light what's going wrong here.
>>>>>>>>>>>> I don't have much time ATM and I'm not even officially at
>>>>>>>>>>>> work until
>>>>>>>>>>>> late Aug. I'd send you the revert and investigate later. I
>>>>>>>>>>>> agree that
>>>>>>>>>>>> using generic fbdev emulation would be preferable.
>>>>>>>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>>>>>>>> regression, but vm testcases shouldn run a single line of
>>>>>>>>>>> fbcon or drm
>>>>>>>>>>> code. So why this is impacted so heavily by a silly drm
>>>>>>>>>>> change is very
>>>>>>>>>>> confusing to me. We might be papering over a deeper and much
>>>>>>>>>>> more
>>>>>>>>>>> serious issue ...
>>>>>>>>>> It's a regression, the right thing is to revert first and then
>>>>>>>>>> work
>>>>>>>>>> out the right thing to do.
>>>>>>>>> Sure, but I have no idea whether the testcase is doing something
>>>>>>>>> reasonable. If it's accidentally testing vm scalability of
>>>>>>>>> fbdev and
>>>>>>>>> there's no one else doing something this pointless, then it's
>>>>>>>>> not a
>>>>>>>>> real bug. Plus I think we're shooting the messenger here.
>>>>>>>>>
>>>>>>>>>> It's likely the test runs on the console and printfs stuff out
>>>>>>>>>> while running.
>>>>>>>>> But why did we not regress the world if a few prints on the
>>>>>>>>> console
>>>>>>>>> have such a huge impact? We didn't get an entire stream of
>>>>>>>>> mails about
>>>>>>>>> breaking stuff ...
>>>>>>>> The regression seems not related to the commit.  But we have
>>>>>>>> retested
>>>>>>>> and confirmed the regression.  Hard to understand what happens.
>>>>>>> Does the regressed test cause any output on console while it's
>>>>>>> measuring? If so, it's probably accidentally measuring fbcon/DRM
>>>>>>> code in
>>>>>>> addition to the workload it's trying to measure.
>>>>>>>
>>>>>> Sorry, I'm not familiar with DRM, we enabled the console to output
>>>>>> logs, and
>>>>>> attached please find the log file.
>>>>>>
>>>>>> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
>>>>>> console=ttyS0,115200 vga=normal rw"
>>>>> We did more check, and found this test machine does use the
>>>>> mgag200 driver.
>>>>>
>>>>> And we are suspecting the regression is caused by
>>>>>
>>>>> commit cf1ca9aeb930df074bb5bbcde55f935fec04e529
>>>>> Author: Thomas Zimmermann <tzimmermann@suse.de>
>>>>> Date:   Wed Jul 3 09:58:24 2019 +0200
>>>> Yes, that's the commit. Unfortunately reverting it would require
>>>> reverting a hand full of other patches as well.
>>>>
>>>> I have a potential fix for the problem. Could you run and verify
>>>> that it
>>>> resolves the problem?
>>> Sure, please send it to us. Rong and I will try it.
>> Fantastic, thank you! The patch set is available on dri-devel at
>>
>>   
>> https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html
> 
> The patch set improves the performance slightly, but the change is not
> very obvious.
> 
> $ git log --oneline 8f7ec6bcc7 -5
> 8f7ec6bcc75a9 drm/mgag200: Map fbdev framebuffer while it's being displayed
> abcb1cf24033a drm/ast: Map fbdev framebuffer while it's being displayed
> a92f80044c623 drm/vram-helpers: Add kmap ref-counting to GEM VRAM objects
> 90f479ae51afa drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
> f1f8555dfb9a7 drm/bochs: Use shadow buffer for bochs framebuffer console
> 
> commit:
>   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
>   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation")
>   8f7ec6bcc7 ("drm/mgag200: Map fbdev framebuffer while it's being
> displayed")
> 
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde 8f7ec6bcc75a996f5c6b39a9cf 
> testcase/testparams/testbox
> ----------------  -------------------------- -------------------------- 
> ---------------------------
>          %stddev      change         %stddev      change %stddev
>              \          |                \          | \
>      43921             -18%      35884             -17% 36629
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      43921             -18%      35884             -17% 36629       
> GEO-MEAN vm-scalability.median
> 

The regression goes from -18% to -17%, if I understand this correctly.
This is strange, because the patch set restores the way that the
original code worked. The heavy map/unmap calls in the fbdev code are
gone. Performance should have been back to normal.

I'd like to prepare a patch set for entirely reverting all changes. Can
I send it to you for testing?

Best regards
Thomas

> Best Regards,
> Rong Chen
> 
>>
>> Best regards
>> Thomas
>>
>>> Thanks,
>>> Feng
>>>
>>>
>>>> Best regards
>>>> Thomas
>>>>
>>>>>      drm/fb-helper: Map DRM client buffer only when required
>>>>>           This patch changes DRM clients to not map the buffer by
>>>>> default. The
>>>>>      buffer, like any buffer object, should be mapped and unmapped
>>>>> when
>>>>>      needed.
>>>>>           An unmapped buffer object can be evicted to system memory
>>>>> and does
>>>>>      not consume video ram until displayed. This allows to use
>>>>> generic fbdev
>>>>>      emulation with drivers for low-memory devices, such as ast and
>>>>> mgag200.
>>>>>           This change affects the generic framebuffer console.
>>>>> HW-based consoles
>>>>>      map their console buffer once and keep it mapped. Userspace
>>>>> can mmap this
>>>>>      buffer into its address space. The shadow-buffered framebuffer
>>>>> console
>>>>>      only needs the buffer object to be mapped during updates.
>>>>> While not being
>>>>>      updated from the shadow buffer, the buffer object can remain
>>>>> unmapped.
>>>>>      Userspace will always mmap the shadow buffer.
>>>>>   which may add more load when fbcon is busy printing out messages.
>>>>>
>>>>> We are doing more test inside 0day to confirm.
>>>>>
>>>>> Thanks,
>>>>> Feng
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>
>>>> -- 
>>>> Thomas Zimmermann
>>>> Graphics Driver Developer
>>>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>>>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>>>> HRB 21284 (AG Nürnberg)
>>>>
>>>
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-31 10:10             ` Thomas Zimmermann
@ 2019-08-02  9:11               ` Daniel Vetter
  2019-08-02  9:26                 ` Thomas Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Daniel Vetter @ 2019-08-02  9:11 UTC (permalink / raw)
  To: Thomas Zimmermann; +Cc: Stephen Rothwell, kernel test robot, LKP, dri-devel

On Wed, Jul 31, 2019 at 12:10:54PM +0200, Thomas Zimmermann wrote:
> Hi
> 
> Am 31.07.19 um 10:13 schrieb Daniel Vetter:
> > On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
> >>
> >> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>
> >>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >>>>
> >>>> Hi
> >>>>
> >>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
> >>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
> >>>>>>> Greeting,
> >>>>>>>
> >>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
> >>>>>>>
> >>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> >>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
> >>>>>>
> >>>>>> Daniel, Noralf, we may have to revert this patch.
> >>>>>>
> >>>>>> I expected some change in display performance, but not in VM. Since it's
> >>>>>> a server chipset, probably no one cares much about display performance.
> >>>>>> So that seemed like a good trade-off for re-using shared code.
> >>>>>>
> >>>>>> Part of the patch set is that the generic fb emulation now maps and
> >>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
> >>>>>> of the performance regression. And it should be visible with other
> >>>>>> drivers as well if they use a shadow FB for fbdev emulation.
> >>>>>
> >>>>> For fbcon we should need to do any maps/unamps at all, this is for the
> >>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
> >>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
> >>>>> have an fbdev mmap there shouldn't be any impact at all.
> >>>>
> >>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
> >>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
> >>>> evicted and make room for X, etc.
> >>>>
> >>>> To make this work, the BO's memory is mapped and unmapped in
> >>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
> >>>> That fbdev mapping is established on each screen update, more or less.
> >>>> From my (yet unverified) understanding, this causes the performance
> >>>> regression in the VM code.
> >>>>
> >>>> The original code in mgag200 used to kmap the fbdev BO while it's being
> >>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
> >>>> not being display). [3]
> >>>
> >>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
> >>> cache this.
> >>>
> >>>> I think this could be added for VRAM helpers as well, but it's still a
> >>>> workaround and non-VRAM drivers might also run into such a performance
> >>>> regression if they use the fbdev's shadow fb.
> >>>
> >>> Yeah agreed, fbdev emulation should try to cache the vmap.
> >>>
> >>>> Noralf mentioned that there are plans for other DRM clients besides the
> >>>> console. They would as well run into similar problems.
> >>>>
> >>>>>> The thing is that we'd need another generic fbdev emulation for ast and
> >>>>>> mgag200 that handles this issue properly.
> >>>>>
> >>>>> Yeah I dont think we want to jump the gun here.  If you can try to
> >>>>> repro locally and profile where we're wasting cpu time I hope that
> >>>>> should sched a light what's going wrong here.
> >>>>
> >>>> I don't have much time ATM and I'm not even officially at work until
> >>>> late Aug. I'd send you the revert and investigate later. I agree that
> >>>> using generic fbdev emulation would be preferable.
> >>>
> >>> Still not sure that's the right thing to do really. Yes it's a
> >>> regression, but vm testcases shouldn run a single line of fbcon or drm
> >>> code. So why this is impacted so heavily by a silly drm change is very
> >>> confusing to me. We might be papering over a deeper and much more
> >>> serious issue ...
> >>
> >> It's a regression, the right thing is to revert first and then work
> >> out the right thing to do.
> > 
> > Sure, but I have no idea whether the testcase is doing something
> > reasonable. If it's accidentally testing vm scalability of fbdev and
> > there's no one else doing something this pointless, then it's not a
> > real bug. Plus I think we're shooting the messenger here.
> > 
> >> It's likely the test runs on the console and printfs stuff out while running.
> > 
> > But why did we not regress the world if a few prints on the console
> > have such a huge impact? We didn't get an entire stream of mails about
> > breaking stuff ...
> 
> The vmap/vunmap pair is only executed for fbdev emulation with a shadow
> FB. And most of those are with shmem helpers, which ref-count the vmap
> calls internally. My guess is that VRAM helpers are currently the only
> BOs triggering this problem.

I meant that surely this vm-scalability testcase isn't the only thing
that's being run by 0day on a machine with mga200g. If a few printks to
dmesg/console cause such a huge regression, I'd expect everything to
regress on that box. But seems to not be the case.
-Daniel

> 
> Best regards
> Thomas
> 
> > -Daniel
> > 
> 
> -- 
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
> 




-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-02  7:11                           ` Rong Chen
  2019-08-02  8:23                             ` Thomas Zimmermann
@ 2019-08-02  9:20                             ` Thomas Zimmermann
  1 sibling, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-02  9:20 UTC (permalink / raw)
  To: Rong Chen, Feng Tang
  Cc: Stephen Rothwell, Michel Dänzer, LKP, dri-devel, Huang, Ying


[-- Attachment #1.1.1: Type: text/plain, Size: 10914 bytes --]

Hi

Am 02.08.19 um 09:11 schrieb Rong Chen:
> Hi,
> 
> On 8/1/19 7:58 PM, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 01.08.19 um 13:25 schrieb Feng Tang:
>>> Hi Thomas,
>>>
>>> On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
>>>> Hi
>>>>
>>>> Am 01.08.19 um 10:37 schrieb Feng Tang:
>>>>> On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4
>>>>>>>>>>>>>>> ("drm/mgag200: Replace struct mga_fbdev with generic
>>>>>>>>>>>>>>> framebuffer emulation")
>>>>>>>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
>>>>>>>>>>>>>>> master
>>>>>>>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I expected some change in display performance, but not in
>>>>>>>>>>>>>> VM. Since it's
>>>>>>>>>>>>>> a server chipset, probably no one cares much about display
>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>> So that seemed like a good trade-off for re-using shared
>>>>>>>>>>>>>> code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Part of the patch set is that the generic fb emulation now
>>>>>>>>>>>>>> maps and
>>>>>>>>>>>>>> unmaps the fbdev BO when updating the screen. I guess
>>>>>>>>>>>>>> that's the cause
>>>>>>>>>>>>>> of the performance regression. And it should be visible
>>>>>>>>>>>>>> with other
>>>>>>>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>>>>>>> For fbcon we should need to do any maps/unamps at all, this
>>>>>>>>>>>>> is for the
>>>>>>>>>>>>> fbdev mmap support only. If the testcase mentioned here
>>>>>>>>>>>>> tests fbdev
>>>>>>>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as
>>>>>>>>>>>>> you don't
>>>>>>>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have
>>>>>>>>>>>> to get the
>>>>>>>>>>>> fbdev BO out if it's not being displayed. If not being
>>>>>>>>>>>> mapped, it can be
>>>>>>>>>>>> evicted and make room for X, etc.
>>>>>>>>>>>>
>>>>>>>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>>>>>>>> drm_fb_helper_dirty_work() before being updated from the
>>>>>>>>>>>> shadow FB. [1]
>>>>>>>>>>>> That fbdev mapping is established on each screen update,
>>>>>>>>>>>> more or less.
>>>>>>>>>>>>  From my (yet unverified) understanding, this causes the
>>>>>>>>>>>> performance
>>>>>>>>>>>> regression in the VM code.
>>>>>>>>>>>>
>>>>>>>>>>>> The original code in mgag200 used to kmap the fbdev BO while
>>>>>>>>>>>> it's being
>>>>>>>>>>>> displayed; [2] and the drawing code only mapped it when
>>>>>>>>>>>> necessary (i.e.,
>>>>>>>>>>>> not being display). [3]
>>>>>>>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We
>>>>>>>>>>> indeed should
>>>>>>>>>>> cache this.
>>>>>>>>>>>
>>>>>>>>>>>> I think this could be added for VRAM helpers as well, but
>>>>>>>>>>>> it's still a
>>>>>>>>>>>> workaround and non-VRAM drivers might also run into such a
>>>>>>>>>>>> performance
>>>>>>>>>>>> regression if they use the fbdev's shadow fb.
>>>>>>>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>>>>>>>
>>>>>>>>>>>> Noralf mentioned that there are plans for other DRM clients
>>>>>>>>>>>> besides the
>>>>>>>>>>>> console. They would as well run into similar problems.
>>>>>>>>>>>>
>>>>>>>>>>>>>> The thing is that we'd need another generic fbdev
>>>>>>>>>>>>>> emulation for ast and
>>>>>>>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>>>>>>> Yeah I dont think we want to jump the gun here.  If you can
>>>>>>>>>>>>> try to
>>>>>>>>>>>>> repro locally and profile where we're wasting cpu time I
>>>>>>>>>>>>> hope that
>>>>>>>>>>>>> should sched a light what's going wrong here.
>>>>>>>>>>>> I don't have much time ATM and I'm not even officially at
>>>>>>>>>>>> work until
>>>>>>>>>>>> late Aug. I'd send you the revert and investigate later. I
>>>>>>>>>>>> agree that
>>>>>>>>>>>> using generic fbdev emulation would be preferable.
>>>>>>>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>>>>>>>> regression, but vm testcases shouldn run a single line of
>>>>>>>>>>> fbcon or drm
>>>>>>>>>>> code. So why this is impacted so heavily by a silly drm
>>>>>>>>>>> change is very
>>>>>>>>>>> confusing to me. We might be papering over a deeper and much
>>>>>>>>>>> more
>>>>>>>>>>> serious issue ...
>>>>>>>>>> It's a regression, the right thing is to revert first and then
>>>>>>>>>> work
>>>>>>>>>> out the right thing to do.
>>>>>>>>> Sure, but I have no idea whether the testcase is doing something
>>>>>>>>> reasonable. If it's accidentally testing vm scalability of
>>>>>>>>> fbdev and
>>>>>>>>> there's no one else doing something this pointless, then it's
>>>>>>>>> not a
>>>>>>>>> real bug. Plus I think we're shooting the messenger here.
>>>>>>>>>
>>>>>>>>>> It's likely the test runs on the console and printfs stuff out
>>>>>>>>>> while running.
>>>>>>>>> But why did we not regress the world if a few prints on the
>>>>>>>>> console
>>>>>>>>> have such a huge impact? We didn't get an entire stream of
>>>>>>>>> mails about
>>>>>>>>> breaking stuff ...
>>>>>>>> The regression seems not related to the commit.  But we have
>>>>>>>> retested
>>>>>>>> and confirmed the regression.  Hard to understand what happens.
>>>>>>> Does the regressed test cause any output on console while it's
>>>>>>> measuring? If so, it's probably accidentally measuring fbcon/DRM
>>>>>>> code in
>>>>>>> addition to the workload it's trying to measure.
>>>>>>>
>>>>>> Sorry, I'm not familiar with DRM, we enabled the console to output
>>>>>> logs, and
>>>>>> attached please find the log file.
>>>>>>
>>>>>> "Command line: ... console=tty0 earlyprintk=ttyS0,115200
>>>>>> console=ttyS0,115200 vga=normal rw"
>>>>> We did more check, and found this test machine does use the
>>>>> mgag200 driver.
>>>>>
>>>>> And we are suspecting the regression is caused by
>>>>>
>>>>> commit cf1ca9aeb930df074bb5bbcde55f935fec04e529
>>>>> Author: Thomas Zimmermann <tzimmermann@suse.de>
>>>>> Date:   Wed Jul 3 09:58:24 2019 +0200
>>>> Yes, that's the commit. Unfortunately reverting it would require
>>>> reverting a hand full of other patches as well.
>>>>
>>>> I have a potential fix for the problem. Could you run and verify
>>>> that it
>>>> resolves the problem?
>>> Sure, please send it to us. Rong and I will try it.
>> Fantastic, thank you! The patch set is available on dri-devel at
>>
>>   
>> https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html
> 
> The patch set improves the performance slightly, but the change is not
> very obvious.
> 
> $ git log --oneline 8f7ec6bcc7 -5
> 8f7ec6bcc75a9 drm/mgag200: Map fbdev framebuffer while it's being displayed
> abcb1cf24033a drm/ast: Map fbdev framebuffer while it's being displayed
> a92f80044c623 drm/vram-helpers: Add kmap ref-counting to GEM VRAM objects
> 90f479ae51afa drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
> f1f8555dfb9a7 drm/bochs: Use shadow buffer for bochs framebuffer console
> 
> commit:
>   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
>   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation")
>   8f7ec6bcc7 ("drm/mgag200: Map fbdev framebuffer while it's being
> displayed")
> 
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde 8f7ec6bcc75a996f5c6b39a9cf 
> testcase/testparams/testbox
> ----------------  -------------------------- -------------------------- 
> ---------------------------
>          %stddev      change         %stddev      change %stddev
>              \          |                \          | \
>      43921             -18%      35884             -17% 36629
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      43921             -18%      35884             -17% 36629       
> GEO-MEAN vm-scalability.median

Thank you for testing.

There's another thing I'd like to ask: could you run the test without
console output on drm-tip (i.e., disable it or pipe it into /dev/null)?
I'd like to see how that impacts performance.

Best regards
Thomas

> Best Regards,
> Rong Chen
> 
>>
>> Best regards
>> Thomas
>>
>>> Thanks,
>>> Feng
>>>
>>>
>>>> Best regards
>>>> Thomas
>>>>
>>>>>      drm/fb-helper: Map DRM client buffer only when required
>>>>>           This patch changes DRM clients to not map the buffer by
>>>>> default. The
>>>>>      buffer, like any buffer object, should be mapped and unmapped
>>>>> when
>>>>>      needed.
>>>>>           An unmapped buffer object can be evicted to system memory
>>>>> and does
>>>>>      not consume video ram until displayed. This allows to use
>>>>> generic fbdev
>>>>>      emulation with drivers for low-memory devices, such as ast and
>>>>> mgag200.
>>>>>           This change affects the generic framebuffer console.
>>>>> HW-based consoles
>>>>>      map their console buffer once and keep it mapped. Userspace
>>>>> can mmap this
>>>>>      buffer into its address space. The shadow-buffered framebuffer
>>>>> console
>>>>>      only needs the buffer object to be mapped during updates.
>>>>> While not being
>>>>>      updated from the shadow buffer, the buffer object can remain
>>>>> unmapped.
>>>>>      Userspace will always mmap the shadow buffer.
>>>>>   which may add more load when fbcon is busy printing out messages.
>>>>>
>>>>> We are doing more test inside 0day to confirm.
>>>>>
>>>>> Thanks,
>>>>> Feng
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>
>>>> -- 
>>>> Thomas Zimmermann
>>>> Graphics Driver Developer
>>>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>>>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>>>> HRB 21284 (AG Nürnberg)
>>>>
>>>
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
> 
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-02  9:11               ` Daniel Vetter
@ 2019-08-02  9:26                 ` Thomas Zimmermann
  0 siblings, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-02  9:26 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Stephen Rothwell, LKP, dri-devel, kernel test robot


[-- Attachment #1.1.1: Type: text/plain, Size: 6304 bytes --]

Hi

Am 02.08.19 um 11:11 schrieb Daniel Vetter:
> On Wed, Jul 31, 2019 at 12:10:54PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 31.07.19 um 10:13 schrieb Daniel Vetter:
>>> On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie <airlied@gmail.com> wrote:
>>>>
>>>> On Wed, 31 Jul 2019 at 05:00, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>
>>>>> On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Am 30.07.19 um 20:12 schrieb Daniel Vetter:
>>>>>>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>>>> Am 29.07.19 um 11:51 schrieb kernel test robot:
>>>>>>>>> Greeting,
>>>>>>>>>
>>>>>>>>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
>>>>>>>>>
>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>>>>
>>>>>>>> Daniel, Noralf, we may have to revert this patch.
>>>>>>>>
>>>>>>>> I expected some change in display performance, but not in VM. Since it's
>>>>>>>> a server chipset, probably no one cares much about display performance.
>>>>>>>> So that seemed like a good trade-off for re-using shared code.
>>>>>>>>
>>>>>>>> Part of the patch set is that the generic fb emulation now maps and
>>>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause
>>>>>>>> of the performance regression. And it should be visible with other
>>>>>>>> drivers as well if they use a shadow FB for fbdev emulation.
>>>>>>>
>>>>>>> For fbcon we should need to do any maps/unamps at all, this is for the
>>>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev
>>>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't
>>>>>>> have an fbdev mmap there shouldn't be any impact at all.
>>>>>>
>>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the
>>>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be
>>>>>> evicted and make room for X, etc.
>>>>>>
>>>>>> To make this work, the BO's memory is mapped and unmapped in
>>>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1]
>>>>>> That fbdev mapping is established on each screen update, more or less.
>>>>>> From my (yet unverified) understanding, this causes the performance
>>>>>> regression in the VM code.
>>>>>>
>>>>>> The original code in mgag200 used to kmap the fbdev BO while it's being
>>>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e.,
>>>>>> not being display). [3]
>>>>>
>>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should
>>>>> cache this.
>>>>>
>>>>>> I think this could be added for VRAM helpers as well, but it's still a
>>>>>> workaround and non-VRAM drivers might also run into such a performance
>>>>>> regression if they use the fbdev's shadow fb.
>>>>>
>>>>> Yeah agreed, fbdev emulation should try to cache the vmap.
>>>>>
>>>>>> Noralf mentioned that there are plans for other DRM clients besides the
>>>>>> console. They would as well run into similar problems.
>>>>>>
>>>>>>>> The thing is that we'd need another generic fbdev emulation for ast and
>>>>>>>> mgag200 that handles this issue properly.
>>>>>>>
>>>>>>> Yeah I dont think we want to jump the gun here.  If you can try to
>>>>>>> repro locally and profile where we're wasting cpu time I hope that
>>>>>>> should sched a light what's going wrong here.
>>>>>>
>>>>>> I don't have much time ATM and I'm not even officially at work until
>>>>>> late Aug. I'd send you the revert and investigate later. I agree that
>>>>>> using generic fbdev emulation would be preferable.
>>>>>
>>>>> Still not sure that's the right thing to do really. Yes it's a
>>>>> regression, but vm testcases shouldn run a single line of fbcon or drm
>>>>> code. So why this is impacted so heavily by a silly drm change is very
>>>>> confusing to me. We might be papering over a deeper and much more
>>>>> serious issue ...
>>>>
>>>> It's a regression, the right thing is to revert first and then work
>>>> out the right thing to do.
>>>
>>> Sure, but I have no idea whether the testcase is doing something
>>> reasonable. If it's accidentally testing vm scalability of fbdev and
>>> there's no one else doing something this pointless, then it's not a
>>> real bug. Plus I think we're shooting the messenger here.
>>>
>>>> It's likely the test runs on the console and printfs stuff out while running.
>>>
>>> But why did we not regress the world if a few prints on the console
>>> have such a huge impact? We didn't get an entire stream of mails about
>>> breaking stuff ...
>>
>> The vmap/vunmap pair is only executed for fbdev emulation with a shadow
>> FB. And most of those are with shmem helpers, which ref-count the vmap
>> calls internally. My guess is that VRAM helpers are currently the only
>> BOs triggering this problem.
> 
> I meant that surely this vm-scalability testcase isn't the only thing
> that's being run by 0day on a machine with mga200g. If a few printks to
> dmesg/console cause such a huge regression, I'd expect everything to
> regress on that box. But seems to not be the case.

True. And according to Rong Chen's feedback, vmap and vunmap have only a
small impact. The other difference is that there's now a shadow FB for
the the console; including the dirty worker with an additional memcpy.
mgag200 used to update the console directly in VRAM.

I'd expect to see every driver with shadow-FB console to show bad
performance, but that doesn't seem to be the case either.

Best regards
Thomas

> -Daniel
> 
>>
>> Best regards
>> Thomas
>>
>>> -Daniel
>>>
>>
>> -- 
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>> HRB 21284 (AG Nürnberg)
>>
> 
> 
> 
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-07-30 17:50 ` [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression Thomas Zimmermann
  2019-07-30 18:12   ` Daniel Vetter
@ 2019-08-04 18:39   ` Thomas Zimmermann
  2019-08-05  7:02     ` Feng Tang
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-04 18:39 UTC (permalink / raw)
  To: Noralf Trønnes, Daniel Vetter
  Cc: Stephen Rothwell, Feng Tang, rong.a.chen, michel, dri-devel,
	ying.huang, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 63604 bytes --]

Hi

I did some further analysis on this problem and found that the blinking
cursor affects performance of the vm-scalability test case.

I only have a 4-core machine, so scalability is not really testable. Yet
I see the effects of running vm-scalibility against drm-tip, a revert of
the mgag200 patch and the vmap fixes that I posted a few days ago.

After reverting the mgag200 patch, running the test as described in the
report

  bin/lkp run job.yaml

gives results like

  2019-08-02 19:34:37  ./case-anon-cow-seq-hugetlb
  2019-08-02 19:34:37  ./usemem --runtime 300 -n 4 --prealloc --prefault
    -O -U 815395225
  917319627 bytes / 756534 usecs = 1184110 KB/s
  917319627 bytes / 764675 usecs = 1171504 KB/s
  917319627 bytes / 766414 usecs = 1168846 KB/s
  917319627 bytes / 777990 usecs = 1151454 KB/s

Running the test against current drm-tip gives slightly worse results,
such as.

  2019-08-03 19:17:06  ./case-anon-cow-seq-hugetlb
  2019-08-03 19:17:06  ./usemem --runtime 300 -n 4 --prealloc --prefault
    -O -U 815394406
  917318700 bytes / 871607 usecs = 1027778 KB/s
  917318700 bytes / 894173 usecs = 1001840 KB/s
  917318700 bytes / 919694 usecs = 974040 KB/s
  917318700 bytes / 923341 usecs = 970193 KB/s

The test puts out roughly one result per second. Strangely sending the
output to /dev/null can make results significantly worse.

  bin/lkp run job.yaml > /dev/null

  2019-08-03 19:23:04  ./case-anon-cow-seq-hugetlb
  2019-08-03 19:23:04  ./usemem --runtime 300 -n 4 --prealloc --prefault
    -O -U 815394406
  917318700 bytes / 1207358 usecs = 741966 KB/s
  917318700 bytes / 1210456 usecs = 740067 KB/s
  917318700 bytes / 1216572 usecs = 736346 KB/s
  917318700 bytes / 1239152 usecs = 722929 KB/s

I realized that there's still a blinking cursor on the screen, which I
disabled with

  tput civis

or alternatively

  echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink

Running the the test now gives the original or even better results, such as

  bin/lkp run job.yaml > /dev/null

  2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
  2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc --prefault
    -O -U 815394406
  917318700 bytes / 659419 usecs = 1358497 KB/s
  917318700 bytes / 659658 usecs = 1358005 KB/s
  917318700 bytes / 659916 usecs = 1357474 KB/s
  917318700 bytes / 660168 usecs = 1356956 KB/s

Rong, Feng, could you confirm this by disabling the cursor or blinking?


The difference between mgag200's original fbdev support and generic
fbdev emulation is generic fbdev's worker task that updates the VRAM
buffer from the shadow buffer. mgag200 does this immediately, but relies
on drm_can_sleep(), which is deprecated.

I think that the worker task interferes with the test case, as the
worker has been in fbdev emulation since forever and no performance
regressions have been reported so far.


So unless there's a report where this problem happens in a real-world
use case, I'd like to keep code as it is. And apparently there's always
the workaround of disabling the cursor blinking.

Best regards
Thomas


Am 30.07.19 um 19:50 schrieb Thomas Zimmermann:
> Am 29.07.19 um 11:51 schrieb kernel test robot:
>> Greeting,
>>
>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
>>
>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
> 
> Daniel, Noralf, we may have to revert this patch.
> 
> I expected some change in display performance, but not in VM. Since it's
> a server chipset, probably no one cares much about display performance.
> So that seemed like a good trade-off for re-using shared code.
> 
> Part of the patch set is that the generic fb emulation now maps and
> unmaps the fbdev BO when updating the screen. I guess that's the cause
> of the performance regression. And it should be visible with other
> drivers as well if they use a shadow FB for fbdev emulation.
> 
> The thing is that we'd need another generic fbdev emulation for ast and
> mgag200 that handles this issue properly.
> 
> Best regards
> Thomas
> 
>>
>> in testcase: vm-scalability
>> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
>> with following parameters:
>>
>> 	runtime: 300s
>> 	size: 8T
>> 	test: anon-cow-seq-hugetlb
>> 	cpufreq_governor: performance
>>
>> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
>> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>>
>>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> To reproduce:
>>
>>         git clone https://github.com/intel/lkp-tests.git
>>         cd lkp-tests
>>         bin/lkp install job.yaml  # job file is attached in this email
>>         bin/lkp run     job.yaml
>>
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>>   gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
>>
>> commit: 
>>   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
>>   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
>>
>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 
>> ---------------- --------------------------- 
>>        fail:runs  %reproduction    fail:runs
>>            |             |             |    
>>           2:4          -50%            :4     dmesg.WARNING:at#for_ip_interrupt_entry/0x
>>            :4           25%           1:4     dmesg.WARNING:at_ip___perf_sw_event/0x
>>            :4           25%           1:4     dmesg.WARNING:at_ip__fsnotify_parent/0x
>>          %stddev     %change         %stddev
>>              \          |                \  
>>      43955 ±  2%     -18.8%      35691        vm-scalability.median
>>       0.06 ±  7%    +193.0%       0.16 ±  2%  vm-scalability.median_stddev
>>   14906559 ±  2%     -17.9%   12237079        vm-scalability.throughput
>>      87651 ±  2%     -17.4%      72374        vm-scalability.time.involuntary_context_switches
>>    2086168           -23.6%    1594224        vm-scalability.time.minor_page_faults
>>      15082 ±  2%     -10.4%      13517        vm-scalability.time.percent_of_cpu_this_job_got
>>      29987            -8.9%      27327        vm-scalability.time.system_time
>>      15755           -12.4%      13795        vm-scalability.time.user_time
>>     122011           -19.3%      98418        vm-scalability.time.voluntary_context_switches
>>  3.034e+09           -23.6%  2.318e+09        vm-scalability.workload
>>     242478 ± 12%     +68.5%     408518 ± 23%  cpuidle.POLL.time
>>       2788 ± 21%    +117.4%       6062 ± 26%  cpuidle.POLL.usage
>>      56653 ± 10%     +64.4%      93144 ± 20%  meminfo.Mapped
>>     120392 ±  7%     +14.0%     137212 ±  4%  meminfo.Shmem
>>      47221 ± 11%     +77.1%      83634 ± 22%  numa-meminfo.node0.Mapped
>>     120465 ±  7%     +13.9%     137205 ±  4%  numa-meminfo.node0.Shmem
>>    2885513           -16.5%    2409384        numa-numastat.node0.local_node
>>    2885471           -16.5%    2409354        numa-numastat.node0.numa_hit
>>      11813 ± 11%     +76.3%      20824 ± 22%  numa-vmstat.node0.nr_mapped
>>      30096 ±  7%     +13.8%      34238 ±  4%  numa-vmstat.node0.nr_shmem
>>      43.72 ±  2%      +5.5       49.20        mpstat.cpu.all.idle%
>>       0.03 ±  4%      +0.0        0.05 ±  6%  mpstat.cpu.all.soft%
>>      19.51            -2.4       17.08        mpstat.cpu.all.usr%
>>       1012            -7.9%     932.75        turbostat.Avg_MHz
>>      32.38 ± 10%     +25.8%      40.73        turbostat.CPU%c1
>>     145.51            -3.1%     141.01        turbostat.PkgWatt
>>      15.09           -19.2%      12.19        turbostat.RAMWatt
>>      43.50 ±  2%     +13.2%      49.25        vmstat.cpu.id
>>      18.75 ±  2%     -13.3%      16.25 ±  2%  vmstat.cpu.us
>>     152.00 ±  2%      -9.5%     137.50        vmstat.procs.r
>>       4800           -13.1%       4173        vmstat.system.cs
>>     156170           -11.9%     137594        slabinfo.anon_vma.active_objs
>>       3395           -11.9%       2991        slabinfo.anon_vma.active_slabs
>>     156190           -11.9%     137606        slabinfo.anon_vma.num_objs
>>       3395           -11.9%       2991        slabinfo.anon_vma.num_slabs
>>       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.active_objs
>>       1716 ±  5%     +11.5%       1913 ±  8%  slabinfo.dmaengine-unmap-16.num_objs
>>       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.active_objs
>>       1767 ±  2%     -19.0%       1431 ±  2%  slabinfo.hugetlbfs_inode_cache.num_objs
>>       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.active_objs
>>       3597 ±  5%     -16.4%       3006 ±  3%  slabinfo.skbuff_ext_cache.num_objs
>>    1330122           -23.6%    1016557        proc-vmstat.htlb_buddy_alloc_success
>>      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_active_anon
>>      67277            +2.9%      69246        proc-vmstat.nr_anon_pages
>>     218.50 ±  3%     -10.6%     195.25        proc-vmstat.nr_dirtied
>>     288628            +1.4%     292755        proc-vmstat.nr_file_pages
>>     360.50            -2.7%     350.75        proc-vmstat.nr_inactive_file
>>      14225 ±  9%     +63.8%      23304 ± 20%  proc-vmstat.nr_mapped
>>      30109 ±  7%     +13.8%      34259 ±  4%  proc-vmstat.nr_shmem
>>      99870            -1.3%      98597        proc-vmstat.nr_slab_unreclaimable
>>     204.00 ±  4%     -12.1%     179.25        proc-vmstat.nr_written
>>      77214 ±  3%      +6.4%      82128 ±  2%  proc-vmstat.nr_zone_active_anon
>>     360.50            -2.7%     350.75        proc-vmstat.nr_zone_inactive_file
>>       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults
>>       8810 ± 19%     -66.1%       2987 ± 42%  proc-vmstat.numa_hint_faults_local
>>    2904082           -16.4%    2427026        proc-vmstat.numa_hit
>>    2904081           -16.4%    2427025        proc-vmstat.numa_local
>>  6.828e+08           -23.5%  5.221e+08        proc-vmstat.pgalloc_normal
>>    2900008           -17.2%    2400195        proc-vmstat.pgfault
>>  6.827e+08           -23.5%   5.22e+08        proc-vmstat.pgfree
>>  1.635e+10           -17.0%  1.357e+10        perf-stat.i.branch-instructions
>>       1.53 ±  4%      -0.1        1.45 ±  3%  perf-stat.i.branch-miss-rate%
>>  2.581e+08 ±  3%     -20.5%  2.051e+08 ±  2%  perf-stat.i.branch-misses
>>      12.66            +1.1       13.78        perf-stat.i.cache-miss-rate%
>>   72720849           -12.0%   63958986        perf-stat.i.cache-misses
>>  5.766e+08           -18.6%  4.691e+08        perf-stat.i.cache-references
>>       4674 ±  2%     -13.0%       4064        perf-stat.i.context-switches
>>       4.29           +12.5%       4.83        perf-stat.i.cpi
>>  2.573e+11            -7.4%  2.383e+11        perf-stat.i.cpu-cycles
>>     231.35           -21.5%     181.56        perf-stat.i.cpu-migrations
>>       3522            +4.4%       3677        perf-stat.i.cycles-between-cache-misses
>>       0.09 ± 13%      +0.0        0.12 ±  5%  perf-stat.i.iTLB-load-miss-rate%
>>  5.894e+10           -15.8%  4.961e+10        perf-stat.i.iTLB-loads
>>  5.901e+10           -15.8%  4.967e+10        perf-stat.i.instructions
>>       1291 ± 14%     -21.8%       1010        perf-stat.i.instructions-per-iTLB-miss
>>       0.24           -11.0%       0.21        perf-stat.i.ipc
>>       9476           -17.5%       7821        perf-stat.i.minor-faults
>>       9478           -17.5%       7821        perf-stat.i.page-faults
>>       9.76            -3.6%       9.41        perf-stat.overall.MPKI
>>       1.59 ±  4%      -0.1        1.52        perf-stat.overall.branch-miss-rate%
>>      12.61            +1.1       13.71        perf-stat.overall.cache-miss-rate%
>>       4.38           +10.5%       4.83        perf-stat.overall.cpi
>>       3557            +5.3%       3747        perf-stat.overall.cycles-between-cache-misses
>>       0.08 ± 12%      +0.0        0.10        perf-stat.overall.iTLB-load-miss-rate%
>>       1268 ± 15%     -23.0%     976.22        perf-stat.overall.instructions-per-iTLB-miss
>>       0.23            -9.5%       0.21        perf-stat.overall.ipc
>>       5815            +9.7%       6378        perf-stat.overall.path-length
>>  1.634e+10           -17.5%  1.348e+10        perf-stat.ps.branch-instructions
>>  2.595e+08 ±  3%     -21.2%  2.043e+08 ±  2%  perf-stat.ps.branch-misses
>>   72565205           -12.2%   63706339        perf-stat.ps.cache-misses
>>  5.754e+08           -19.2%  4.646e+08        perf-stat.ps.cache-references
>>       4640 ±  2%     -12.5%       4060        perf-stat.ps.context-switches
>>  2.581e+11            -7.5%  2.387e+11        perf-stat.ps.cpu-cycles
>>     229.91           -22.0%     179.42        perf-stat.ps.cpu-migrations
>>  5.889e+10           -16.3%  4.927e+10        perf-stat.ps.iTLB-loads
>>  5.899e+10           -16.3%  4.938e+10        perf-stat.ps.instructions
>>       9388           -18.2%       7677        perf-stat.ps.minor-faults
>>       9389           -18.2%       7677        perf-stat.ps.page-faults
>>  1.764e+13           -16.2%  1.479e+13        perf-stat.total.instructions
>>      46803 ±  3%     -18.8%      37982 ±  6%  sched_debug.cfs_rq:/.exec_clock.min
>>       5320 ±  3%     +23.7%       6581 ±  3%  sched_debug.cfs_rq:/.exec_clock.stddev
>>       6737 ± 14%     +58.1%      10649 ± 10%  sched_debug.cfs_rq:/.load.avg
>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.load.max
>>      46952 ± 16%     +64.8%      77388 ± 11%  sched_debug.cfs_rq:/.load.stddev
>>       7.12 ±  4%     +49.1%      10.62 ±  6%  sched_debug.cfs_rq:/.load_avg.avg
>>     474.40 ± 23%     +67.5%     794.60 ± 10%  sched_debug.cfs_rq:/.load_avg.max
>>      37.70 ± 11%     +74.8%      65.90 ±  9%  sched_debug.cfs_rq:/.load_avg.stddev
>>   13424269 ±  4%     -15.6%   11328098 ±  2%  sched_debug.cfs_rq:/.min_vruntime.avg
>>   15411275 ±  3%     -12.4%   13505072 ±  2%  sched_debug.cfs_rq:/.min_vruntime.max
>>    7939295 ±  6%     -17.5%    6551322 ±  7%  sched_debug.cfs_rq:/.min_vruntime.min
>>      21.44 ±  7%     -56.1%       9.42 ±  4%  sched_debug.cfs_rq:/.nr_spread_over.avg
>>     117.45 ± 11%     -60.6%      46.30 ± 14%  sched_debug.cfs_rq:/.nr_spread_over.max
>>      19.33 ±  8%     -66.4%       6.49 ±  9%  sched_debug.cfs_rq:/.nr_spread_over.stddev
>>       4.32 ± 15%     +84.4%       7.97 ±  3%  sched_debug.cfs_rq:/.runnable_load_avg.avg
>>     353.85 ± 29%    +118.8%     774.35 ± 11%  sched_debug.cfs_rq:/.runnable_load_avg.max
>>      27.30 ± 24%    +118.5%      59.64 ±  9%  sched_debug.cfs_rq:/.runnable_load_avg.stddev
>>       6729 ± 14%     +58.2%      10644 ± 10%  sched_debug.cfs_rq:/.runnable_weight.avg
>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cfs_rq:/.runnable_weight.max
>>      46950 ± 16%     +64.8%      77387 ± 11%  sched_debug.cfs_rq:/.runnable_weight.stddev
>>    5305069 ±  4%     -17.4%    4380376 ±  7%  sched_debug.cfs_rq:/.spread0.avg
>>    7328745 ±  3%      -9.9%    6600897 ±  3%  sched_debug.cfs_rq:/.spread0.max
>>    2220837 ±  4%     +55.8%    3460596 ±  5%  sched_debug.cpu.avg_idle.avg
>>    4590666 ±  9%     +76.8%    8117037 ± 15%  sched_debug.cpu.avg_idle.max
>>     485052 ±  7%     +80.3%     874679 ± 10%  sched_debug.cpu.avg_idle.stddev
>>     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock.stddev
>>     561.50 ± 26%     +37.7%     773.30 ± 15%  sched_debug.cpu.clock_task.stddev
>>       3.20 ± 10%    +109.6%       6.70 ±  3%  sched_debug.cpu.cpu_load[0].avg
>>     309.10 ± 20%    +150.3%     773.75 ± 12%  sched_debug.cpu.cpu_load[0].max
>>      21.02 ± 14%    +160.8%      54.80 ±  9%  sched_debug.cpu.cpu_load[0].stddev
>>       3.19 ±  8%    +109.8%       6.70 ±  3%  sched_debug.cpu.cpu_load[1].avg
>>     299.75 ± 19%    +158.0%     773.30 ± 12%  sched_debug.cpu.cpu_load[1].max
>>      20.32 ± 12%    +168.7%      54.62 ±  9%  sched_debug.cpu.cpu_load[1].stddev
>>       3.20 ±  8%    +109.1%       6.69 ±  4%  sched_debug.cpu.cpu_load[2].avg
>>     288.90 ± 20%    +167.0%     771.40 ± 12%  sched_debug.cpu.cpu_load[2].max
>>      19.70 ± 12%    +175.4%      54.27 ±  9%  sched_debug.cpu.cpu_load[2].stddev
>>       3.16 ±  8%    +110.9%       6.66 ±  6%  sched_debug.cpu.cpu_load[3].avg
>>     275.50 ± 24%    +178.4%     766.95 ± 12%  sched_debug.cpu.cpu_load[3].max
>>      18.92 ± 15%    +184.2%      53.77 ± 10%  sched_debug.cpu.cpu_load[3].stddev
>>       3.08 ±  8%    +115.7%       6.65 ±  7%  sched_debug.cpu.cpu_load[4].avg
>>     263.55 ± 28%    +188.7%     760.85 ± 12%  sched_debug.cpu.cpu_load[4].max
>>      18.03 ± 18%    +196.6%      53.46 ± 11%  sched_debug.cpu.cpu_load[4].stddev
>>      14543            -9.6%      13150        sched_debug.cpu.curr->pid.max
>>       5293 ± 16%     +74.7%       9248 ± 11%  sched_debug.cpu.load.avg
>>     587978 ± 17%     +58.2%     930382 ±  9%  sched_debug.cpu.load.max
>>      40887 ± 19%     +78.3%      72891 ±  9%  sched_debug.cpu.load.stddev
>>    1141679 ±  4%     +56.9%    1790907 ±  5%  sched_debug.cpu.max_idle_balance_cost.avg
>>    2432100 ±  9%     +72.6%    4196779 ± 13%  sched_debug.cpu.max_idle_balance_cost.max
>>     745656           +29.3%     964170 ±  5%  sched_debug.cpu.max_idle_balance_cost.min
>>     239032 ±  9%     +81.9%     434806 ± 10%  sched_debug.cpu.max_idle_balance_cost.stddev
>>       0.00 ± 27%     +92.1%       0.00 ± 31%  sched_debug.cpu.next_balance.stddev
>>       1030 ±  4%     -10.4%     924.00 ±  2%  sched_debug.cpu.nr_switches.min
>>       0.04 ± 26%    +139.0%       0.09 ± 41%  sched_debug.cpu.nr_uninterruptible.avg
>>     830.35 ±  6%     -12.0%     730.50 ±  2%  sched_debug.cpu.sched_count.min
>>     912.00 ±  2%      -9.5%     825.38        sched_debug.cpu.ttwu_count.avg
>>     433.05 ±  3%     -19.2%     350.05 ±  3%  sched_debug.cpu.ttwu_count.min
>>     160.70 ±  3%     -12.5%     140.60 ±  4%  sched_debug.cpu.ttwu_local.min
>>       9072 ± 11%     -36.4%       5767 ±  8%  softirqs.CPU1.RCU
>>      12769 ±  5%     +15.3%      14718 ±  3%  softirqs.CPU101.SCHED
>>      13198           +11.5%      14717 ±  3%  softirqs.CPU102.SCHED
>>      12981 ±  4%     +13.9%      14788 ±  3%  softirqs.CPU105.SCHED
>>      13486 ±  3%     +11.8%      15071 ±  4%  softirqs.CPU111.SCHED
>>      12794 ±  4%     +14.1%      14601 ±  9%  softirqs.CPU112.SCHED
>>      12999 ±  4%     +10.1%      14314 ±  4%  softirqs.CPU115.SCHED
>>      12844 ±  4%     +10.6%      14202 ±  2%  softirqs.CPU120.SCHED
>>      13336 ±  3%      +9.4%      14585 ±  3%  softirqs.CPU122.SCHED
>>      12639 ±  4%     +20.2%      15195        softirqs.CPU123.SCHED
>>      13040 ±  5%     +15.2%      15024 ±  5%  softirqs.CPU126.SCHED
>>      13123           +15.1%      15106 ±  5%  softirqs.CPU127.SCHED
>>       9188 ±  6%     -35.7%       5911 ±  2%  softirqs.CPU13.RCU
>>      13054 ±  3%     +13.1%      14761 ±  5%  softirqs.CPU130.SCHED
>>      13158 ±  2%     +13.9%      14985 ±  5%  softirqs.CPU131.SCHED
>>      12797 ±  6%     +13.5%      14524 ±  3%  softirqs.CPU133.SCHED
>>      12452 ±  5%     +14.8%      14297        softirqs.CPU134.SCHED
>>      13078 ±  3%     +10.4%      14439 ±  3%  softirqs.CPU138.SCHED
>>      12617 ±  2%     +14.5%      14442 ±  5%  softirqs.CPU139.SCHED
>>      12974 ±  3%     +13.7%      14752 ±  4%  softirqs.CPU142.SCHED
>>      12579 ±  4%     +19.1%      14983 ±  3%  softirqs.CPU143.SCHED
>>       9122 ± 24%     -44.6%       5053 ±  5%  softirqs.CPU144.RCU
>>      13366 ±  2%     +11.1%      14848 ±  3%  softirqs.CPU149.SCHED
>>      13246 ±  2%     +22.0%      16162 ±  7%  softirqs.CPU150.SCHED
>>      13452 ±  3%     +20.5%      16210 ±  7%  softirqs.CPU151.SCHED
>>      13507           +10.1%      14869        softirqs.CPU156.SCHED
>>      13808 ±  3%      +9.2%      15079 ±  4%  softirqs.CPU157.SCHED
>>      13442 ±  2%     +13.4%      15248 ±  4%  softirqs.CPU160.SCHED
>>      13311           +12.1%      14920 ±  2%  softirqs.CPU162.SCHED
>>      13544 ±  3%      +8.5%      14695 ±  4%  softirqs.CPU163.SCHED
>>      13648 ±  3%     +11.2%      15179 ±  2%  softirqs.CPU166.SCHED
>>      13404 ±  4%     +12.5%      15079 ±  3%  softirqs.CPU168.SCHED
>>      13421 ±  6%     +16.0%      15568 ±  8%  softirqs.CPU169.SCHED
>>      13115 ±  3%     +23.1%      16139 ± 10%  softirqs.CPU171.SCHED
>>      13424 ±  6%     +10.4%      14822 ±  3%  softirqs.CPU175.SCHED
>>      13274 ±  3%     +13.7%      15087 ±  9%  softirqs.CPU185.SCHED
>>      13409 ±  3%     +12.3%      15063 ±  3%  softirqs.CPU190.SCHED
>>      13181 ±  7%     +13.4%      14946 ±  3%  softirqs.CPU196.SCHED
>>      13578 ±  3%     +10.9%      15061        softirqs.CPU197.SCHED
>>      13323 ±  5%     +24.8%      16627 ±  6%  softirqs.CPU198.SCHED
>>      14072 ±  2%     +12.3%      15798 ±  7%  softirqs.CPU199.SCHED
>>      12604 ± 13%     +17.9%      14865        softirqs.CPU201.SCHED
>>      13380 ±  4%     +14.8%      15356 ±  3%  softirqs.CPU203.SCHED
>>      13481 ±  8%     +14.2%      15390 ±  3%  softirqs.CPU204.SCHED
>>      12921 ±  2%     +13.8%      14710 ±  3%  softirqs.CPU206.SCHED
>>      13468           +13.0%      15218 ±  2%  softirqs.CPU208.SCHED
>>      13253 ±  2%     +13.1%      14992        softirqs.CPU209.SCHED
>>      13319 ±  2%     +14.3%      15225 ±  7%  softirqs.CPU210.SCHED
>>      13673 ±  5%     +16.3%      15895 ±  3%  softirqs.CPU211.SCHED
>>      13290           +17.0%      15556 ±  5%  softirqs.CPU212.SCHED
>>      13455 ±  4%     +14.4%      15392 ±  3%  softirqs.CPU213.SCHED
>>      13454 ±  4%     +14.3%      15377 ±  3%  softirqs.CPU215.SCHED
>>      13872 ±  7%      +9.7%      15221 ±  5%  softirqs.CPU220.SCHED
>>      13555 ±  4%     +17.3%      15896 ±  5%  softirqs.CPU222.SCHED
>>      13411 ±  4%     +20.8%      16197 ±  6%  softirqs.CPU223.SCHED
>>       8472 ± 21%     -44.8%       4680 ±  3%  softirqs.CPU224.RCU
>>      13141 ±  3%     +16.2%      15265 ±  7%  softirqs.CPU225.SCHED
>>      14084 ±  3%      +8.2%      15242 ±  2%  softirqs.CPU226.SCHED
>>      13528 ±  4%     +11.3%      15063 ±  4%  softirqs.CPU228.SCHED
>>      13218 ±  3%     +16.3%      15377 ±  4%  softirqs.CPU229.SCHED
>>      14031 ±  4%     +10.2%      15467 ±  2%  softirqs.CPU231.SCHED
>>      13770 ±  3%     +14.0%      15700 ±  3%  softirqs.CPU232.SCHED
>>      13456 ±  3%     +12.3%      15105 ±  3%  softirqs.CPU233.SCHED
>>      13137 ±  4%     +13.5%      14909 ±  3%  softirqs.CPU234.SCHED
>>      13318 ±  2%     +14.7%      15280 ±  2%  softirqs.CPU235.SCHED
>>      13690 ±  2%     +13.7%      15563 ±  7%  softirqs.CPU238.SCHED
>>      13771 ±  5%     +20.8%      16634 ±  7%  softirqs.CPU241.SCHED
>>      13317 ±  7%     +19.5%      15919 ±  9%  softirqs.CPU243.SCHED
>>       8234 ± 16%     -43.9%       4616 ±  5%  softirqs.CPU244.RCU
>>      13845 ±  6%     +13.0%      15643 ±  3%  softirqs.CPU244.SCHED
>>      13179 ±  3%     +16.3%      15323        softirqs.CPU246.SCHED
>>      13754           +12.2%      15438 ±  3%  softirqs.CPU248.SCHED
>>      13769 ±  4%     +10.9%      15276 ±  2%  softirqs.CPU252.SCHED
>>      13702           +10.5%      15147 ±  2%  softirqs.CPU254.SCHED
>>      13315 ±  2%     +12.5%      14980 ±  3%  softirqs.CPU255.SCHED
>>      13785 ±  3%     +12.9%      15568 ±  5%  softirqs.CPU256.SCHED
>>      13307 ±  3%     +15.0%      15298 ±  3%  softirqs.CPU257.SCHED
>>      13864 ±  3%     +10.5%      15313 ±  2%  softirqs.CPU259.SCHED
>>      13879 ±  2%     +11.4%      15465        softirqs.CPU261.SCHED
>>      13815           +13.6%      15687 ±  5%  softirqs.CPU264.SCHED
>>     119574 ±  2%     +11.8%     133693 ± 11%  softirqs.CPU266.TIMER
>>      13688           +10.9%      15180 ±  6%  softirqs.CPU267.SCHED
>>      11716 ±  4%     +19.3%      13974 ±  8%  softirqs.CPU27.SCHED
>>      13866 ±  3%     +13.7%      15765 ±  4%  softirqs.CPU271.SCHED
>>      13887 ±  5%     +12.5%      15621        softirqs.CPU272.SCHED
>>      13383 ±  3%     +19.8%      16031 ±  2%  softirqs.CPU274.SCHED
>>      13347           +14.1%      15232 ±  3%  softirqs.CPU275.SCHED
>>      12884 ±  2%     +21.0%      15593 ±  4%  softirqs.CPU276.SCHED
>>      13131 ±  5%     +13.4%      14891 ±  5%  softirqs.CPU277.SCHED
>>      12891 ±  2%     +19.2%      15371 ±  4%  softirqs.CPU278.SCHED
>>      13313 ±  4%     +13.0%      15049 ±  2%  softirqs.CPU279.SCHED
>>      13514 ±  3%     +10.2%      14897 ±  2%  softirqs.CPU280.SCHED
>>      13501 ±  3%     +13.7%      15346        softirqs.CPU281.SCHED
>>      13261           +17.5%      15577        softirqs.CPU282.SCHED
>>       8076 ± 15%     -43.7%       4546 ±  5%  softirqs.CPU283.RCU
>>      13686 ±  3%     +12.6%      15413 ±  2%  softirqs.CPU284.SCHED
>>      13439 ±  2%      +9.2%      14670 ±  4%  softirqs.CPU285.SCHED
>>       8878 ±  9%     -35.4%       5735 ±  4%  softirqs.CPU35.RCU
>>      11690 ±  2%     +13.6%      13274 ±  5%  softirqs.CPU40.SCHED
>>      11714 ±  2%     +19.3%      13975 ± 13%  softirqs.CPU41.SCHED
>>      11763           +12.5%      13239 ±  4%  softirqs.CPU45.SCHED
>>      11662 ±  2%      +9.4%      12757 ±  3%  softirqs.CPU46.SCHED
>>      11805 ±  2%      +9.3%      12902 ±  2%  softirqs.CPU50.SCHED
>>      12158 ±  3%     +12.3%      13655 ±  8%  softirqs.CPU55.SCHED
>>      11716 ±  4%      +8.8%      12751 ±  3%  softirqs.CPU58.SCHED
>>      11922 ±  2%      +9.9%      13100 ±  4%  softirqs.CPU64.SCHED
>>       9674 ± 17%     -41.8%       5625 ±  6%  softirqs.CPU66.RCU
>>      11818           +12.0%      13237        softirqs.CPU66.SCHED
>>     124682 ±  7%      -6.1%     117088 ±  5%  softirqs.CPU66.TIMER
>>       8637 ±  9%     -34.0%       5700 ±  7%  softirqs.CPU70.RCU
>>      11624 ±  2%     +11.0%      12901 ±  2%  softirqs.CPU70.SCHED
>>      12372 ±  2%     +13.2%      14003 ±  3%  softirqs.CPU71.SCHED
>>       9949 ± 25%     -33.9%       6574 ± 31%  softirqs.CPU72.RCU
>>      10392 ± 26%     -35.1%       6745 ± 35%  softirqs.CPU73.RCU
>>      12766 ±  3%     +11.1%      14188 ±  3%  softirqs.CPU76.SCHED
>>      12611 ±  2%     +18.8%      14984 ±  5%  softirqs.CPU78.SCHED
>>      12786 ±  3%     +17.9%      15079 ±  7%  softirqs.CPU79.SCHED
>>      11947 ±  4%      +9.7%      13103 ±  4%  softirqs.CPU8.SCHED
>>      13379 ±  7%     +11.8%      14962 ±  4%  softirqs.CPU83.SCHED
>>      13438 ±  5%      +9.7%      14738 ±  2%  softirqs.CPU84.SCHED
>>      12768           +19.4%      15241 ±  6%  softirqs.CPU88.SCHED
>>       8604 ± 13%     -39.3%       5222 ±  3%  softirqs.CPU89.RCU
>>      13077 ±  2%     +17.1%      15308 ±  7%  softirqs.CPU89.SCHED
>>      11887 ±  3%     +20.1%      14272 ±  5%  softirqs.CPU9.SCHED
>>      12723 ±  3%     +11.3%      14165 ±  4%  softirqs.CPU90.SCHED
>>       8439 ± 12%     -38.9%       5153 ±  4%  softirqs.CPU91.RCU
>>      13429 ±  3%     +10.3%      14806 ±  2%  softirqs.CPU95.SCHED
>>      12852 ±  4%     +10.3%      14174 ±  5%  softirqs.CPU96.SCHED
>>      13010 ±  2%     +14.4%      14888 ±  5%  softirqs.CPU97.SCHED
>>    2315644 ±  4%     -36.2%    1477200 ±  4%  softirqs.RCU
>>       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.NMI:Non-maskable_interrupts
>>       1572 ± 10%     +63.9%       2578 ± 39%  interrupts.CPU0.PMI:Performance_monitoring_interrupts
>>     252.00 ± 11%     -35.2%     163.25 ± 13%  interrupts.CPU104.RES:Rescheduling_interrupts
>>       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.NMI:Non-maskable_interrupts
>>       2738 ± 24%     +52.4%       4173 ± 19%  interrupts.CPU105.PMI:Performance_monitoring_interrupts
>>     245.75 ± 19%     -31.0%     169.50 ±  7%  interrupts.CPU105.RES:Rescheduling_interrupts
>>     228.75 ± 13%     -24.7%     172.25 ± 19%  interrupts.CPU106.RES:Rescheduling_interrupts
>>       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.NMI:Non-maskable_interrupts
>>       2243 ± 15%     +66.3%       3730 ± 35%  interrupts.CPU113.PMI:Performance_monitoring_interrupts
>>       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.NMI:Non-maskable_interrupts
>>       2703 ± 31%     +67.0%       4514 ± 33%  interrupts.CPU118.PMI:Performance_monitoring_interrupts
>>       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.NMI:Non-maskable_interrupts
>>       2613 ± 25%     +42.2%       3715 ± 24%  interrupts.CPU121.PMI:Performance_monitoring_interrupts
>>     311.50 ± 23%     -47.7%     163.00 ±  9%  interrupts.CPU122.RES:Rescheduling_interrupts
>>     266.75 ± 19%     -31.6%     182.50 ± 15%  interrupts.CPU124.RES:Rescheduling_interrupts
>>     293.75 ± 33%     -32.3%     198.75 ± 19%  interrupts.CPU125.RES:Rescheduling_interrupts
>>       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.NMI:Non-maskable_interrupts
>>       2601 ± 36%     +43.2%       3724 ± 29%  interrupts.CPU127.PMI:Performance_monitoring_interrupts
>>       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.NMI:Non-maskable_interrupts
>>       2258 ± 21%     +68.2%       3797 ± 29%  interrupts.CPU13.PMI:Performance_monitoring_interrupts
>>       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.NMI:Non-maskable_interrupts
>>       3338 ± 29%     +54.6%       5160 ±  9%  interrupts.CPU139.PMI:Performance_monitoring_interrupts
>>     219.50 ± 27%     -23.0%     169.00 ± 21%  interrupts.CPU139.RES:Rescheduling_interrupts
>>     290.25 ± 25%     -32.5%     196.00 ± 11%  interrupts.CPU14.RES:Rescheduling_interrupts
>>     243.50 ±  4%     -16.0%     204.50 ± 12%  interrupts.CPU140.RES:Rescheduling_interrupts
>>       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.NMI:Non-maskable_interrupts
>>       1797 ± 15%    +135.0%       4223 ± 46%  interrupts.CPU147.PMI:Performance_monitoring_interrupts
>>       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.NMI:Non-maskable_interrupts
>>       2537 ± 22%     +89.6%       4812 ± 28%  interrupts.CPU15.PMI:Performance_monitoring_interrupts
>>     292.25 ± 34%     -33.9%     193.25 ±  6%  interrupts.CPU15.RES:Rescheduling_interrupts
>>     424.25 ± 37%     -58.5%     176.25 ± 14%  interrupts.CPU158.RES:Rescheduling_interrupts
>>     312.50 ± 42%     -54.2%     143.00 ± 18%  interrupts.CPU159.RES:Rescheduling_interrupts
>>     725.00 ±118%     -75.7%     176.25 ± 14%  interrupts.CPU163.RES:Rescheduling_interrupts
>>       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.NMI:Non-maskable_interrupts
>>       2367 ±  6%     +59.9%       3786 ± 24%  interrupts.CPU177.PMI:Performance_monitoring_interrupts
>>     239.50 ± 30%     -46.6%     128.00 ± 14%  interrupts.CPU179.RES:Rescheduling_interrupts
>>     320.75 ± 15%     -24.0%     243.75 ± 20%  interrupts.CPU20.RES:Rescheduling_interrupts
>>     302.50 ± 17%     -47.2%     159.75 ±  8%  interrupts.CPU200.RES:Rescheduling_interrupts
>>       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.NMI:Non-maskable_interrupts
>>       2166 ±  5%     +92.0%       4157 ± 40%  interrupts.CPU207.PMI:Performance_monitoring_interrupts
>>     217.00 ± 11%     -34.6%     142.00 ± 12%  interrupts.CPU214.RES:Rescheduling_interrupts
>>       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.NMI:Non-maskable_interrupts
>>       2610 ± 36%     +47.4%       3848 ± 35%  interrupts.CPU215.PMI:Performance_monitoring_interrupts
>>       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.NMI:Non-maskable_interrupts
>>       2046 ± 13%    +118.6%       4475 ± 43%  interrupts.CPU22.PMI:Performance_monitoring_interrupts
>>     289.50 ± 28%     -41.1%     170.50 ±  8%  interrupts.CPU22.RES:Rescheduling_interrupts
>>       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.NMI:Non-maskable_interrupts
>>       2232 ±  6%     +33.0%       2970 ± 24%  interrupts.CPU221.PMI:Performance_monitoring_interrupts
>>       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.NMI:Non-maskable_interrupts
>>       4552 ± 12%     -27.6%       3295 ± 15%  interrupts.CPU222.PMI:Performance_monitoring_interrupts
>>       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.NMI:Non-maskable_interrupts
>>       2013 ± 15%     +80.9%       3641 ± 27%  interrupts.CPU226.PMI:Performance_monitoring_interrupts
>>       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.NMI:Non-maskable_interrupts
>>       2575 ± 49%     +67.1%       4302 ± 34%  interrupts.CPU227.PMI:Performance_monitoring_interrupts
>>     248.00 ± 36%     -36.3%     158.00 ± 19%  interrupts.CPU228.RES:Rescheduling_interrupts
>>       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.NMI:Non-maskable_interrupts
>>       2441 ± 24%     +43.0%       3490 ± 30%  interrupts.CPU23.PMI:Performance_monitoring_interrupts
>>     404.25 ± 69%     -65.5%     139.50 ± 17%  interrupts.CPU236.RES:Rescheduling_interrupts
>>     566.50 ± 40%     -73.6%     149.50 ± 31%  interrupts.CPU237.RES:Rescheduling_interrupts
>>     243.50 ± 26%     -37.1%     153.25 ± 21%  interrupts.CPU248.RES:Rescheduling_interrupts
>>     258.25 ± 12%     -53.5%     120.00 ± 18%  interrupts.CPU249.RES:Rescheduling_interrupts
>>       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.NMI:Non-maskable_interrupts
>>       2888 ± 27%     +49.4%       4313 ± 30%  interrupts.CPU253.PMI:Performance_monitoring_interrupts
>>       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.NMI:Non-maskable_interrupts
>>       2468 ± 44%     +67.3%       4131 ± 37%  interrupts.CPU256.PMI:Performance_monitoring_interrupts
>>     425.00 ± 59%     -60.3%     168.75 ± 34%  interrupts.CPU258.RES:Rescheduling_interrupts
>>       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.NMI:Non-maskable_interrupts
>>       1859 ± 16%    +106.3%       3834 ± 44%  interrupts.CPU268.PMI:Performance_monitoring_interrupts
>>       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.NMI:Non-maskable_interrupts
>>       2684 ± 28%     +61.2%       4326 ± 36%  interrupts.CPU269.PMI:Performance_monitoring_interrupts
>>       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.NMI:Non-maskable_interrupts
>>       2171 ±  6%    +108.8%       4533 ± 20%  interrupts.CPU270.PMI:Performance_monitoring_interrupts
>>       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.NMI:Non-maskable_interrupts
>>       2262 ± 14%     +61.8%       3659 ± 37%  interrupts.CPU273.PMI:Performance_monitoring_interrupts
>>       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.NMI:Non-maskable_interrupts
>>       2203 ± 11%     +50.7%       3320 ± 38%  interrupts.CPU279.PMI:Performance_monitoring_interrupts
>>       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.NMI:Non-maskable_interrupts
>>       2433 ± 17%     +52.9%       3721 ± 25%  interrupts.CPU280.PMI:Performance_monitoring_interrupts
>>       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.NMI:Non-maskable_interrupts
>>       2778 ± 33%     +63.1%       4531 ± 36%  interrupts.CPU283.PMI:Performance_monitoring_interrupts
>>     331.75 ± 32%     -39.8%     199.75 ± 17%  interrupts.CPU29.RES:Rescheduling_interrupts
>>       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.NMI:Non-maskable_interrupts
>>       2178 ± 22%     +53.9%       3353 ± 31%  interrupts.CPU3.PMI:Performance_monitoring_interrupts
>>     298.50 ± 30%     -39.7%     180.00 ±  6%  interrupts.CPU34.RES:Rescheduling_interrupts
>>       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.NMI:Non-maskable_interrupts
>>       2490 ±  3%     +58.7%       3953 ± 28%  interrupts.CPU35.PMI:Performance_monitoring_interrupts
>>     270.50 ± 24%     -31.1%     186.25 ±  3%  interrupts.CPU36.RES:Rescheduling_interrupts
>>       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.NMI:Non-maskable_interrupts
>>       2493 ±  7%     +57.0%       3915 ± 27%  interrupts.CPU43.PMI:Performance_monitoring_interrupts
>>     286.75 ± 36%     -32.4%     193.75 ±  7%  interrupts.CPU45.RES:Rescheduling_interrupts
>>     259.00 ± 12%     -23.6%     197.75 ± 13%  interrupts.CPU46.RES:Rescheduling_interrupts
>>     244.00 ± 21%     -35.6%     157.25 ± 11%  interrupts.CPU47.RES:Rescheduling_interrupts
>>     230.00 ±  7%     -21.3%     181.00 ± 11%  interrupts.CPU48.RES:Rescheduling_interrupts
>>     281.00 ± 13%     -27.4%     204.00 ± 15%  interrupts.CPU53.RES:Rescheduling_interrupts
>>     256.75 ±  5%     -18.4%     209.50 ± 12%  interrupts.CPU54.RES:Rescheduling_interrupts
>>       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.NMI:Non-maskable_interrupts
>>       2433 ±  9%     +68.4%       4098 ± 35%  interrupts.CPU58.PMI:Performance_monitoring_interrupts
>>     316.00 ± 25%     -41.4%     185.25 ± 13%  interrupts.CPU59.RES:Rescheduling_interrupts
>>       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.NMI:Non-maskable_interrupts
>>       2703 ± 38%     +56.0%       4217 ± 31%  interrupts.CPU60.PMI:Performance_monitoring_interrupts
>>       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.NMI:Non-maskable_interrupts
>>       2425 ± 16%     +39.9%       3394 ± 27%  interrupts.CPU61.PMI:Performance_monitoring_interrupts
>>       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.NMI:Non-maskable_interrupts
>>       2388 ± 18%     +69.5%       4047 ± 29%  interrupts.CPU66.PMI:Performance_monitoring_interrupts
>>       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.NMI:Non-maskable_interrupts
>>       2322 ± 11%     +93.4%       4491 ± 35%  interrupts.CPU67.PMI:Performance_monitoring_interrupts
>>     319.00 ± 40%     -44.7%     176.25 ±  9%  interrupts.CPU67.RES:Rescheduling_interrupts
>>       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.NMI:Non-maskable_interrupts
>>       2512 ±  8%     +28.1%       3219 ± 25%  interrupts.CPU70.PMI:Performance_monitoring_interrupts
>>       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.NMI:Non-maskable_interrupts
>>       2290 ± 39%     +78.7%       4094 ± 28%  interrupts.CPU74.PMI:Performance_monitoring_interrupts
>>       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.NMI:Non-maskable_interrupts
>>       2446 ± 40%     +94.8%       4764 ± 23%  interrupts.CPU75.PMI:Performance_monitoring_interrupts
>>     426.75 ± 61%     -67.7%     138.00 ±  8%  interrupts.CPU75.RES:Rescheduling_interrupts
>>     192.50 ± 13%     +45.6%     280.25 ± 45%  interrupts.CPU76.RES:Rescheduling_interrupts
>>     274.25 ± 34%     -42.2%     158.50 ± 34%  interrupts.CPU77.RES:Rescheduling_interrupts
>>       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.NMI:Non-maskable_interrupts
>>       2357 ±  9%     +73.0%       4078 ± 23%  interrupts.CPU78.PMI:Performance_monitoring_interrupts
>>     348.50 ± 53%     -47.3%     183.75 ± 29%  interrupts.CPU80.RES:Rescheduling_interrupts
>>       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.NMI:Non-maskable_interrupts
>>       2650 ± 43%     +46.2%       3874 ± 36%  interrupts.CPU84.PMI:Performance_monitoring_interrupts
>>       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.NMI:Non-maskable_interrupts
>>       2235 ± 10%    +117.8%       4867 ± 10%  interrupts.CPU90.PMI:Performance_monitoring_interrupts
>>       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.NMI:Non-maskable_interrupts
>>       2606 ± 33%     +38.1%       3598 ± 21%  interrupts.CPU92.PMI:Performance_monitoring_interrupts
>>     408.75 ± 58%     -56.8%     176.75 ± 25%  interrupts.CPU92.RES:Rescheduling_interrupts
>>     399.00 ± 64%     -63.6%     145.25 ± 16%  interrupts.CPU93.RES:Rescheduling_interrupts
>>     314.75 ± 36%     -44.2%     175.75 ± 13%  interrupts.CPU94.RES:Rescheduling_interrupts
>>     191.00 ± 15%     -29.1%     135.50 ±  9%  interrupts.CPU97.RES:Rescheduling_interrupts
>>      94.00 ±  8%     +50.0%     141.00 ± 12%  interrupts.IWI:IRQ_work_interrupts
>>     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.NMI:Non-maskable_interrupts
>>     841457 ±  7%     +16.6%     980751 ±  3%  interrupts.PMI:Performance_monitoring_interrupts
>>      12.75 ± 11%      -4.1        8.67 ± 31%  perf-profile.calltrace.cycles-pp.do_rw_once
>>       1.02 ± 16%      -0.6        0.47 ± 59%  perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle
>>       1.10 ± 15%      -0.4        0.66 ± 14%  perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
>>       1.05 ± 16%      -0.4        0.61 ± 14%  perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter
>>       1.58 ±  4%      +0.3        1.91 ±  7%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page
>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
>>       2.11 ±  4%      +0.5        2.60 ±  7%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault
>>       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
>>       0.83 ± 26%      +0.5        1.32 ± 18%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
>>       1.90 ±  5%      +0.6        2.45 ±  7%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage
>>       0.65 ± 62%      +0.6        1.20 ± 15%  perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
>>       0.60 ± 62%      +0.6        1.16 ± 18%  perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap
>>       0.95 ± 17%      +0.6        1.52 ±  8%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner
>>       0.61 ± 62%      +0.6        1.18 ± 18%  perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput
>>       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit
>>       0.61 ± 62%      +0.6        1.19 ± 19%  perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit
>>       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
>>       0.64 ± 61%      +0.6        1.23 ± 18%  perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group
>>       1.30 ±  9%      +0.6        1.92 ±  8%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock
>>       0.19 ±173%      +0.7        0.89 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu
>>       0.19 ±173%      +0.7        0.90 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu
>>       0.00            +0.8        0.77 ± 30%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page
>>       0.00            +0.8        0.78 ± 30%  perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page
>>       0.00            +0.8        0.79 ± 29%  perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
>>       0.82 ± 67%      +0.9        1.72 ± 22%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault
>>       0.84 ± 66%      +0.9        1.74 ± 20%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow
>>       2.52 ±  6%      +0.9        3.44 ±  9%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page
>>       0.83 ± 67%      +0.9        1.75 ± 21%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
>>       0.84 ± 66%      +0.9        1.77 ± 20%  perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault
>>       1.64 ± 12%      +1.0        2.67 ±  7%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault
>>       1.65 ± 45%      +1.3        2.99 ± 18%  perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault
>>       1.74 ± 13%      +1.4        3.16 ±  6%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault
>>       2.56 ± 48%      +2.2        4.81 ± 19%  perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault
>>      12.64 ± 14%      +3.6       16.20 ±  8%  perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault
>>       2.97 ±  7%      +3.8        6.74 ±  9%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow
>>      19.99 ±  9%      +4.1       24.05 ±  6%  perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault
>>       1.37 ± 15%      -0.5        0.83 ± 13%  perf-profile.children.cycles-pp.sched_clock_cpu
>>       1.31 ± 16%      -0.5        0.78 ± 13%  perf-profile.children.cycles-pp.sched_clock
>>       1.29 ± 16%      -0.5        0.77 ± 13%  perf-profile.children.cycles-pp.native_sched_clock
>>       1.80 ±  2%      -0.3        1.47 ± 10%  perf-profile.children.cycles-pp.task_tick_fair
>>       0.73 ±  2%      -0.2        0.54 ± 11%  perf-profile.children.cycles-pp.update_curr
>>       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.children.cycles-pp.account_process_tick
>>       0.73 ± 10%      -0.2        0.58 ±  9%  perf-profile.children.cycles-pp.rcu_sched_clock_irq
>>       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.children.cycles-pp.__acct_update_integrals
>>       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs
>>       0.40 ± 12%      -0.1        0.30 ± 14%  perf-profile.children.cycles-pp.__next_timer_interrupt
>>       0.47 ±  7%      -0.1        0.39 ± 13%  perf-profile.children.cycles-pp.update_rq_clock
>>       0.29 ± 12%      -0.1        0.21 ± 15%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
>>       0.21 ±  7%      -0.1        0.14 ± 12%  perf-profile.children.cycles-pp.account_system_index_time
>>       0.38 ±  2%      -0.1        0.31 ± 12%  perf-profile.children.cycles-pp.timerqueue_add
>>       0.26 ± 11%      -0.1        0.20 ± 13%  perf-profile.children.cycles-pp.find_next_bit
>>       0.23 ± 15%      -0.1        0.17 ± 15%  perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit
>>       0.14 ±  8%      -0.1        0.07 ± 14%  perf-profile.children.cycles-pp.account_user_time
>>       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.children.cycles-pp.cpuacct_charge
>>       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.children.cycles-pp.irq_work_tick
>>       0.11 ± 13%      -0.0        0.07 ± 25%  perf-profile.children.cycles-pp.tick_sched_do_timer
>>       0.12 ± 10%      -0.0        0.08 ± 15%  perf-profile.children.cycles-pp.get_cpu_device
>>       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.children.cycles-pp.raise_softirq
>>       0.12 ±  3%      -0.0        0.09 ±  8%  perf-profile.children.cycles-pp.write
>>       0.11 ± 13%      +0.0        0.14 ±  8%  perf-profile.children.cycles-pp.native_write_msr
>>       0.09 ±  9%      +0.0        0.11 ±  7%  perf-profile.children.cycles-pp.finish_task_switch
>>       0.10 ± 10%      +0.0        0.13 ±  5%  perf-profile.children.cycles-pp.schedule_idle
>>       0.07 ±  6%      +0.0        0.10 ± 12%  perf-profile.children.cycles-pp.__read_nocancel
>>       0.04 ± 58%      +0.0        0.07 ± 15%  perf-profile.children.cycles-pp.__free_pages_ok
>>       0.06 ±  7%      +0.0        0.09 ± 13%  perf-profile.children.cycles-pp.perf_read
>>       0.07            +0.0        0.11 ± 14%  perf-profile.children.cycles-pp.perf_evsel__read_counter
>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.cmd_stat
>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.__run_perf_stat
>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.process_interval
>>       0.07            +0.0        0.11 ± 13%  perf-profile.children.cycles-pp.read_counters
>>       0.07 ± 22%      +0.0        0.11 ± 19%  perf-profile.children.cycles-pp.__handle_mm_fault
>>       0.07 ± 19%      +0.1        0.13 ±  8%  perf-profile.children.cycles-pp.rb_erase
>>       0.03 ±100%      +0.1        0.09 ±  9%  perf-profile.children.cycles-pp.smp_call_function_single
>>       0.01 ±173%      +0.1        0.08 ± 11%  perf-profile.children.cycles-pp.perf_event_read
>>       0.00            +0.1        0.07 ± 13%  perf-profile.children.cycles-pp.__perf_event_read_value
>>       0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.__intel_pmu_enable_all
>>       0.08 ± 17%      +0.1        0.15 ±  8%  perf-profile.children.cycles-pp.native_apic_msr_eoi_write
>>       0.04 ±103%      +0.1        0.13 ± 58%  perf-profile.children.cycles-pp.shmem_getpage_gfp
>>       0.38 ± 14%      +0.1        0.51 ±  6%  perf-profile.children.cycles-pp.run_timer_softirq
>>       0.11 ±  4%      +0.3        0.37 ± 32%  perf-profile.children.cycles-pp.worker_thread
>>       0.20 ±  5%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.ret_from_fork
>>       0.20 ±  4%      +0.3        0.48 ± 25%  perf-profile.children.cycles-pp.kthread
>>       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.memcpy_erms
>>       0.00            +0.3        0.29 ± 38%  perf-profile.children.cycles-pp.drm_fb_helper_dirty_work
>>       0.00            +0.3        0.31 ± 37%  perf-profile.children.cycles-pp.process_one_work
>>       0.47 ± 48%      +0.4        0.91 ± 19%  perf-profile.children.cycles-pp.prep_new_huge_page
>>       0.70 ± 29%      +0.5        1.16 ± 18%  perf-profile.children.cycles-pp.free_huge_page
>>       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_flush_mmu
>>       0.72 ± 29%      +0.5        1.18 ± 18%  perf-profile.children.cycles-pp.release_pages
>>       0.73 ± 29%      +0.5        1.19 ± 18%  perf-profile.children.cycles-pp.tlb_finish_mmu
>>       0.76 ± 27%      +0.5        1.23 ± 18%  perf-profile.children.cycles-pp.exit_mmap
>>       0.77 ± 27%      +0.5        1.24 ± 18%  perf-profile.children.cycles-pp.mmput
>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.__x64_sys_exit_group
>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_group_exit
>>       0.79 ± 26%      +0.5        1.27 ± 18%  perf-profile.children.cycles-pp.do_exit
>>       1.28 ± 29%      +0.5        1.76 ±  9%  perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
>>       0.77 ± 28%      +0.5        1.26 ± 13%  perf-profile.children.cycles-pp.alloc_fresh_huge_page
>>       1.53 ± 15%      +0.7        2.26 ± 14%  perf-profile.children.cycles-pp.do_syscall_64
>>       1.53 ± 15%      +0.7        2.27 ± 14%  perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
>>       1.13 ±  3%      +0.9        2.07 ± 14%  perf-profile.children.cycles-pp.interrupt_entry
>>       0.79 ±  9%      +1.0        1.76 ±  5%  perf-profile.children.cycles-pp.perf_event_task_tick
>>       1.71 ± 39%      +1.4        3.08 ± 16%  perf-profile.children.cycles-pp.alloc_surplus_huge_page
>>       2.66 ± 42%      +2.3        4.94 ± 17%  perf-profile.children.cycles-pp.alloc_huge_page
>>       2.89 ± 45%      +2.7        5.54 ± 18%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>>       3.34 ± 35%      +2.7        6.02 ± 17%  perf-profile.children.cycles-pp._raw_spin_lock
>>      12.77 ± 14%      +3.9       16.63 ±  7%  perf-profile.children.cycles-pp.mutex_spin_on_owner
>>      20.12 ±  9%      +4.0       24.16 ±  6%  perf-profile.children.cycles-pp.hugetlb_cow
>>      15.40 ± 10%      -3.6       11.84 ± 28%  perf-profile.self.cycles-pp.do_rw_once
>>       4.02 ±  9%      -1.3        2.73 ± 30%  perf-profile.self.cycles-pp.do_access
>>       2.00 ± 14%      -0.6        1.41 ± 13%  perf-profile.self.cycles-pp.cpuidle_enter_state
>>       1.26 ± 16%      -0.5        0.74 ± 13%  perf-profile.self.cycles-pp.native_sched_clock
>>       0.42 ± 17%      -0.2        0.27 ± 16%  perf-profile.self.cycles-pp.account_process_tick
>>       0.27 ± 19%      -0.2        0.12 ± 17%  perf-profile.self.cycles-pp.timerqueue_del
>>       0.53 ±  3%      -0.1        0.38 ± 11%  perf-profile.self.cycles-pp.update_curr
>>       0.27 ±  6%      -0.1        0.14 ± 14%  perf-profile.self.cycles-pp.__acct_update_integrals
>>       0.27 ± 18%      -0.1        0.16 ± 13%  perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs
>>       0.61 ±  4%      -0.1        0.51 ±  8%  perf-profile.self.cycles-pp.task_tick_fair
>>       0.20 ±  8%      -0.1        0.12 ± 14%  perf-profile.self.cycles-pp.account_system_index_time
>>       0.23 ± 15%      -0.1        0.16 ± 17%  perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit
>>       0.25 ± 11%      -0.1        0.18 ± 14%  perf-profile.self.cycles-pp.find_next_bit
>>       0.10 ± 11%      -0.1        0.03 ±100%  perf-profile.self.cycles-pp.tick_sched_do_timer
>>       0.29            -0.1        0.23 ± 11%  perf-profile.self.cycles-pp.timerqueue_add
>>       0.12 ± 10%      -0.1        0.06 ± 17%  perf-profile.self.cycles-pp.account_user_time
>>       0.22 ± 15%      -0.1        0.16 ±  6%  perf-profile.self.cycles-pp.scheduler_tick
>>       0.17 ±  6%      -0.0        0.12 ± 10%  perf-profile.self.cycles-pp.cpuacct_charge
>>       0.18 ± 20%      -0.0        0.13 ±  3%  perf-profile.self.cycles-pp.irq_work_tick
>>       0.07 ± 13%      -0.0        0.03 ±100%  perf-profile.self.cycles-pp.update_process_times
>>       0.12 ±  7%      -0.0        0.08 ± 15%  perf-profile.self.cycles-pp.get_cpu_device
>>       0.07 ± 11%      -0.0        0.04 ± 58%  perf-profile.self.cycles-pp.raise_softirq
>>       0.12 ± 11%      -0.0        0.09 ±  7%  perf-profile.self.cycles-pp.tick_nohz_get_sleep_length
>>       0.11 ± 11%      +0.0        0.14 ±  6%  perf-profile.self.cycles-pp.native_write_msr
>>       0.10 ±  5%      +0.1        0.15 ±  8%  perf-profile.self.cycles-pp.__remove_hrtimer
>>       0.07 ± 23%      +0.1        0.13 ±  8%  perf-profile.self.cycles-pp.rb_erase
>>       0.08 ± 17%      +0.1        0.15 ±  7%  perf-profile.self.cycles-pp.native_apic_msr_eoi_write
>>       0.00            +0.1        0.08 ± 10%  perf-profile.self.cycles-pp.smp_call_function_single
>>       0.32 ± 17%      +0.1        0.42 ±  7%  perf-profile.self.cycles-pp.run_timer_softirq
>>       0.22 ±  5%      +0.1        0.34 ±  4%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
>>       0.45 ± 15%      +0.2        0.60 ± 12%  perf-profile.self.cycles-pp.rcu_irq_enter
>>       0.31 ±  8%      +0.2        0.46 ± 16%  perf-profile.self.cycles-pp.irq_enter
>>       0.29 ± 10%      +0.2        0.44 ± 16%  perf-profile.self.cycles-pp.apic_timer_interrupt
>>       0.71 ± 30%      +0.2        0.92 ±  8%  perf-profile.self.cycles-pp.perf_mux_hrtimer_handler
>>       0.00            +0.3        0.28 ± 37%  perf-profile.self.cycles-pp.memcpy_erms
>>       1.12 ±  3%      +0.9        2.02 ± 15%  perf-profile.self.cycles-pp.interrupt_entry
>>       0.79 ±  9%      +0.9        1.73 ±  5%  perf-profile.self.cycles-pp.perf_event_task_tick
>>       2.49 ± 45%      +2.1        4.55 ± 20%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
>>      10.95 ± 15%      +2.7       13.61 ±  8%  perf-profile.self.cycles-pp.mutex_spin_on_owner
>>
>>
>>                                                                                 
>>                                vm-scalability.throughput                        
>>                                                                                 
>>   1.6e+07 +-+---------------------------------------------------------------+   
>>           |..+.+    +..+.+..+.+.   +.      +..+.+..+.+..+.+..+.+..+    +    |   
>>   1.4e+07 +-+  :    :  O      O    O                           O            |   
>>   1.2e+07 O-+O O  O O    O  O    O    O O  O  O    O    O    O      O  O O  O   
>>           |     :   :                           O    O    O       O         |   
>>     1e+07 +-+   :  :                                                        |   
>>           |     :  :                                                        |   
>>     8e+06 +-+   :  :                                                        |   
>>           |      : :                                                        |   
>>     6e+06 +-+    : :                                                        |   
>>     4e+06 +-+    : :                                                        |   
>>           |      ::                                                         |   
>>     2e+06 +-+     :                                                         |   
>>           |       :                                                         |   
>>         0 +-+---------------------------------------------------------------+   
>>                                                                                 
>>                                                                                                                                                                 
>>                          vm-scalability.time.minor_page_faults                  
>>                                                                                 
>>   2.5e+06 +-+---------------------------------------------------------------+   
>>           |                                                                 |   
>>           |..+.+    +..+.+..+.+..+.+..+.+..  .+.  .+.+..+.+..+.+..+.+..+    |   
>>     2e+06 +-+  :    :                      +.   +.                          |   
>>           O  O O: O O  O O  O O  O O                    O      O            |   
>>           |     :   :                 O O  O  O O  O O    O  O    O O  O O  O   
>>   1.5e+06 +-+   :  :                                                        |   
>>           |     :  :                                                        |   
>>     1e+06 +-+    : :                                                        |   
>>           |      : :                                                        |   
>>           |      : :                                                        |   
>>    500000 +-+    : :                                                        |   
>>           |       :                                                         |   
>>           |       :                                                         |   
>>         0 +-+---------------------------------------------------------------+   
>>                                                                                 
>>                                                                                                                                                                 
>>                                 vm-scalability.workload                         
>>                                                                                 
>>   3.5e+09 +-+---------------------------------------------------------------+   
>>           | .+.                      .+.+..                        .+..     |   
>>     3e+09 +-+  +    +..+.+..+.+..+.+.      +..+.+..+.+..+.+..+.+..+    +    |   
>>           |    :    :       O O                                O            |   
>>   2.5e+09 O-+O O: O O  O O       O O  O    O            O                   |   
>>           |     :   :                   O     O O  O O    O  O    O O  O O  O   
>>     2e+09 +-+   :  :                                                        |   
>>           |     :  :                                                        |   
>>   1.5e+09 +-+    : :                                                        |   
>>           |      : :                                                        |   
>>     1e+09 +-+    : :                                                        |   
>>           |      : :                                                        |   
>>     5e+08 +-+     :                                                         |   
>>           |       :                                                         |   
>>         0 +-+---------------------------------------------------------------+   
>>                                                                                 
>>                                                                                 
>> [*] bisect-good sample
>> [O] bisect-bad  sample
>>
>>
>>
>> Disclaimer:
>> Results have been estimated based on internal Intel analysis and are provided
>> for informational purposes only. Any difference in system hardware or software
>> design or configuration may affect actual performance.
>>
>>
>> Thanks,
>> Rong Chen
>>
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-04 18:39   ` Thomas Zimmermann
@ 2019-08-05  7:02     ` Feng Tang
  2019-08-05 10:22       ` Thomas Zimmermann
       [not found]       ` <c0c3f387-dc93-3146-788c-23258b28a015@intel.com>
  0 siblings, 2 replies; 61+ messages in thread
From: Feng Tang @ 2019-08-05  7:02 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Stephen Rothwell, kernel test robot, michel, dri-devel, ying.huang, lkp

Hi Thomas,

On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
> Hi
> 
> I did some further analysis on this problem and found that the blinking
> cursor affects performance of the vm-scalability test case.
> 
> I only have a 4-core machine, so scalability is not really testable. Yet
> I see the effects of running vm-scalibility against drm-tip, a revert of
> the mgag200 patch and the vmap fixes that I posted a few days ago.
> 
> After reverting the mgag200 patch, running the test as described in the
> report
> 
>   bin/lkp run job.yaml
> 
> gives results like
> 
>   2019-08-02 19:34:37  ./case-anon-cow-seq-hugetlb
>   2019-08-02 19:34:37  ./usemem --runtime 300 -n 4 --prealloc --prefault
>     -O -U 815395225
>   917319627 bytes / 756534 usecs = 1184110 KB/s
>   917319627 bytes / 764675 usecs = 1171504 KB/s
>   917319627 bytes / 766414 usecs = 1168846 KB/s
>   917319627 bytes / 777990 usecs = 1151454 KB/s
> 
> Running the test against current drm-tip gives slightly worse results,
> such as.
> 
>   2019-08-03 19:17:06  ./case-anon-cow-seq-hugetlb
>   2019-08-03 19:17:06  ./usemem --runtime 300 -n 4 --prealloc --prefault
>     -O -U 815394406
>   917318700 bytes / 871607 usecs = 1027778 KB/s
>   917318700 bytes / 894173 usecs = 1001840 KB/s
>   917318700 bytes / 919694 usecs = 974040 KB/s
>   917318700 bytes / 923341 usecs = 970193 KB/s
> 
> The test puts out roughly one result per second. Strangely sending the
> output to /dev/null can make results significantly worse.
> 
>   bin/lkp run job.yaml > /dev/null
> 
>   2019-08-03 19:23:04  ./case-anon-cow-seq-hugetlb
>   2019-08-03 19:23:04  ./usemem --runtime 300 -n 4 --prealloc --prefault
>     -O -U 815394406
>   917318700 bytes / 1207358 usecs = 741966 KB/s
>   917318700 bytes / 1210456 usecs = 740067 KB/s
>   917318700 bytes / 1216572 usecs = 736346 KB/s
>   917318700 bytes / 1239152 usecs = 722929 KB/s
> 
> I realized that there's still a blinking cursor on the screen, which I
> disabled with
> 
>   tput civis
> 
> or alternatively
> 
>   echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
> 
> Running the the test now gives the original or even better results, such as
> 
>   bin/lkp run job.yaml > /dev/null
> 
>   2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>   2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc --prefault
>     -O -U 815394406
>   917318700 bytes / 659419 usecs = 1358497 KB/s
>   917318700 bytes / 659658 usecs = 1358005 KB/s
>   917318700 bytes / 659916 usecs = 1357474 KB/s
>   917318700 bytes / 660168 usecs = 1356956 KB/s
> 
> Rong, Feng, could you confirm this by disabling the cursor or blinking?

Glad to know this method restored the drop. Rong is running the case.

While I have another finds, as I noticed your patch changed the bpp from
24 to 32, I had a patch to change it back to 24, and run the case in
the weekend, the -18% regrssion was reduced to about -5%. Could this
be related?

commit: 
  f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
  90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
  01e75fea0d5 mgag200: restore the depth back to 24

f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5 
---------------- --------------------------- --------------------------- 
     43921 ±  2%     -18.3%      35884            -4.8%      41826        vm-scalability.median
  14889337           -17.5%   12291029            -4.1%   14278574        vm-scalability.throughput
 
commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
Author: Feng Tang <feng.tang@intel.com>
Date:   Fri Aug 2 15:09:19 2019 +0800

    mgag200: restore the depth back to 24
    
    Signed-off-by: Feng Tang <feng.tang@intel.com>

diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
index a977333..ac8f6c9 100644
--- a/drivers/gpu/drm/mgag200/mgag200_main.c
+++ b/drivers/gpu/drm/mgag200/mgag200_main.c
@@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
 	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
 		dev->mode_config.preferred_depth = 16;
 	else
-		dev->mode_config.preferred_depth = 32;
+		dev->mode_config.preferred_depth = 24;
 	dev->mode_config.prefer_shadow = 1;
 
 	r = mgag200_modeset_init(mdev);

Thanks,
Feng

> 
> 
> The difference between mgag200's original fbdev support and generic
> fbdev emulation is generic fbdev's worker task that updates the VRAM
> buffer from the shadow buffer. mgag200 does this immediately, but relies
> on drm_can_sleep(), which is deprecated.
> 
> I think that the worker task interferes with the test case, as the
> worker has been in fbdev emulation since forever and no performance
> regressions have been reported so far.
> 
> 
> So unless there's a report where this problem happens in a real-world
> use case, I'd like to keep code as it is. And apparently there's always
> the workaround of disabling the cursor blinking.
> 
> Best regards
> Thomas
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-05  7:02     ` Feng Tang
@ 2019-08-05 10:22       ` Thomas Zimmermann
  2019-08-05 12:52         ` Feng Tang
       [not found]       ` <c0c3f387-dc93-3146-788c-23258b28a015@intel.com>
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-05 10:22 UTC (permalink / raw)
  To: Feng Tang
  Cc: Stephen Rothwell, kernel test robot, michel, dri-devel, ying.huang, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 6184 bytes --]

Hi

Am 05.08.19 um 09:02 schrieb Feng Tang:
> Hi Thomas,
> 
> On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> I did some further analysis on this problem and found that the blinking
>> cursor affects performance of the vm-scalability test case.
>>
>> I only have a 4-core machine, so scalability is not really testable. Yet
>> I see the effects of running vm-scalibility against drm-tip, a revert of
>> the mgag200 patch and the vmap fixes that I posted a few days ago.
>>
>> After reverting the mgag200 patch, running the test as described in the
>> report
>>
>>   bin/lkp run job.yaml
>>
>> gives results like
>>
>>   2019-08-02 19:34:37  ./case-anon-cow-seq-hugetlb
>>   2019-08-02 19:34:37  ./usemem --runtime 300 -n 4 --prealloc --prefault
>>     -O -U 815395225
>>   917319627 bytes / 756534 usecs = 1184110 KB/s
>>   917319627 bytes / 764675 usecs = 1171504 KB/s
>>   917319627 bytes / 766414 usecs = 1168846 KB/s
>>   917319627 bytes / 777990 usecs = 1151454 KB/s
>>
>> Running the test against current drm-tip gives slightly worse results,
>> such as.
>>
>>   2019-08-03 19:17:06  ./case-anon-cow-seq-hugetlb
>>   2019-08-03 19:17:06  ./usemem --runtime 300 -n 4 --prealloc --prefault
>>     -O -U 815394406
>>   917318700 bytes / 871607 usecs = 1027778 KB/s
>>   917318700 bytes / 894173 usecs = 1001840 KB/s
>>   917318700 bytes / 919694 usecs = 974040 KB/s
>>   917318700 bytes / 923341 usecs = 970193 KB/s
>>
>> The test puts out roughly one result per second. Strangely sending the
>> output to /dev/null can make results significantly worse.
>>
>>   bin/lkp run job.yaml > /dev/null
>>
>>   2019-08-03 19:23:04  ./case-anon-cow-seq-hugetlb
>>   2019-08-03 19:23:04  ./usemem --runtime 300 -n 4 --prealloc --prefault
>>     -O -U 815394406
>>   917318700 bytes / 1207358 usecs = 741966 KB/s
>>   917318700 bytes / 1210456 usecs = 740067 KB/s
>>   917318700 bytes / 1216572 usecs = 736346 KB/s
>>   917318700 bytes / 1239152 usecs = 722929 KB/s
>>
>> I realized that there's still a blinking cursor on the screen, which I
>> disabled with
>>
>>   tput civis
>>
>> or alternatively
>>
>>   echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>>
>> Running the the test now gives the original or even better results, such as
>>
>>   bin/lkp run job.yaml > /dev/null
>>
>>   2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>>   2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc --prefault
>>     -O -U 815394406
>>   917318700 bytes / 659419 usecs = 1358497 KB/s
>>   917318700 bytes / 659658 usecs = 1358005 KB/s
>>   917318700 bytes / 659916 usecs = 1357474 KB/s
>>   917318700 bytes / 660168 usecs = 1356956 KB/s
>>
>> Rong, Feng, could you confirm this by disabling the cursor or blinking?
> 
> Glad to know this method restored the drop. Rong is running the case.
> 
> While I have another finds, as I noticed your patch changed the bpp from
> 24 to 32, I had a patch to change it back to 24, and run the case in
> the weekend, the -18% regrssion was reduced to about -5%. Could this
> be related?

In the original code, the fbdev console already ran with 32 bpp [1] and
16 bpp was selected for low-end devices. [2][3] The patch only set the
same values for userspace; nothing changed for the console.

Best regards
Thomas

[1]
https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n259
[2]
https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n263
[3]
https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n286

> 
> commit: 
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
>   01e75fea0d5 mgag200: restore the depth back to 24
> 
> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5 
> ---------------- --------------------------- --------------------------- 
>      43921 ±  2%     -18.3%      35884            -4.8%      41826        vm-scalability.median
>   14889337           -17.5%   12291029            -4.1%   14278574        vm-scalability.throughput
>  
> commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
> Author: Feng Tang <feng.tang@intel.com>
> Date:   Fri Aug 2 15:09:19 2019 +0800
> 
>     mgag200: restore the depth back to 24
>     
>     Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
> index a977333..ac8f6c9 100644
> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>  	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>  		dev->mode_config.preferred_depth = 16;
>  	else
> -		dev->mode_config.preferred_depth = 32;
> +		dev->mode_config.preferred_depth = 24;>  	dev->mode_config.prefer_shadow = 1;
>  
>  	r = mgag200_modeset_init(mdev);
> 
> Thanks,
> Feng
> 
>>
>>
>> The difference between mgag200's original fbdev support and generic
>> fbdev emulation is generic fbdev's worker task that updates the VRAM
>> buffer from the shadow buffer. mgag200 does this immediately, but relies
>> on drm_can_sleep(), which is deprecated.
>>
>> I think that the worker task interferes with the test case, as the
>> worker has been in fbdev emulation since forever and no performance
>> regressions have been reported so far.
>>
>>
>> So unless there's a report where this problem happens in a real-world
>> use case, I'd like to keep code as it is. And apparently there's always
>> the workaround of disabling the cursor blinking.
>>
>> Best regards
>> Thomas
>>

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
       [not found]       ` <c0c3f387-dc93-3146-788c-23258b28a015@intel.com>
@ 2019-08-05 10:25         ` Thomas Zimmermann
  2019-08-06 12:59           ` [LKP] " Chen, Rong A
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-05 10:25 UTC (permalink / raw)
  To: Rong Chen, Feng Tang; +Cc: Stephen Rothwell, michel, dri-devel, ying.huang, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 7276 bytes --]

Hi

Am 05.08.19 um 09:28 schrieb Rong Chen:
> Hi,
> 
> On 8/5/19 3:02 PM, Feng Tang wrote:
>> Hi Thomas,
>>
>> On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> I did some further analysis on this problem and found that the blinking
>>> cursor affects performance of the vm-scalability test case.
>>>
>>> I only have a 4-core machine, so scalability is not really testable. Yet
>>> I see the effects of running vm-scalibility against drm-tip, a revert of
>>> the mgag200 patch and the vmap fixes that I posted a few days ago.
>>>
>>> After reverting the mgag200 patch, running the test as described in the
>>> report
>>>
>>>    bin/lkp run job.yaml
>>>
>>> gives results like
>>>
>>>    2019-08-02 19:34:37  ./case-anon-cow-seq-hugetlb
>>>    2019-08-02 19:34:37  ./usemem --runtime 300 -n 4 --prealloc
>>> --prefault
>>>      -O -U 815395225
>>>    917319627 bytes / 756534 usecs = 1184110 KB/s
>>>    917319627 bytes / 764675 usecs = 1171504 KB/s
>>>    917319627 bytes / 766414 usecs = 1168846 KB/s
>>>    917319627 bytes / 777990 usecs = 1151454 KB/s
>>>
>>> Running the test against current drm-tip gives slightly worse results,
>>> such as.
>>>
>>>    2019-08-03 19:17:06  ./case-anon-cow-seq-hugetlb
>>>    2019-08-03 19:17:06  ./usemem --runtime 300 -n 4 --prealloc
>>> --prefault
>>>      -O -U 815394406
>>>    917318700 bytes / 871607 usecs = 1027778 KB/s
>>>    917318700 bytes / 894173 usecs = 1001840 KB/s
>>>    917318700 bytes / 919694 usecs = 974040 KB/s
>>>    917318700 bytes / 923341 usecs = 970193 KB/s
>>>
>>> The test puts out roughly one result per second. Strangely sending the
>>> output to /dev/null can make results significantly worse.
>>>
>>>    bin/lkp run job.yaml > /dev/null
>>>
>>>    2019-08-03 19:23:04  ./case-anon-cow-seq-hugetlb
>>>    2019-08-03 19:23:04  ./usemem --runtime 300 -n 4 --prealloc
>>> --prefault
>>>      -O -U 815394406
>>>    917318700 bytes / 1207358 usecs = 741966 KB/s
>>>    917318700 bytes / 1210456 usecs = 740067 KB/s
>>>    917318700 bytes / 1216572 usecs = 736346 KB/s
>>>    917318700 bytes / 1239152 usecs = 722929 KB/s
>>>
>>> I realized that there's still a blinking cursor on the screen, which I
>>> disabled with
>>>
>>>    tput civis
>>>
>>> or alternatively
>>>
>>>    echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>>>
>>> Running the the test now gives the original or even better results,
>>> such as
>>>
>>>    bin/lkp run job.yaml > /dev/null
>>>
>>>    2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>>>    2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc
>>> --prefault
>>>      -O -U 815394406
>>>    917318700 bytes / 659419 usecs = 1358497 KB/s
>>>    917318700 bytes / 659658 usecs = 1358005 KB/s
>>>    917318700 bytes / 659916 usecs = 1357474 KB/s
>>>    917318700 bytes / 660168 usecs = 1356956 KB/s
>>>
>>> Rong, Feng, could you confirm this by disabling the cursor or blinking?
>> Glad to know this method restored the drop. Rong is running the case.
> 
> I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for
> both commits,
> and the regression has no obvious change.

Ah, I see. Thank you for testing. There are two questions that come to
my mind: did you send the regular output to /dev/null? And what happens
if you disable the cursor with 'tput civis'?

If there is absolutely nothing changing on the screen, I don't see how
the regression could persist.

Best regards
Thomas


> commit:
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
> 
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> ----------------  -------------------------- ---------------------------
>          %stddev      change         %stddev
>              \          |                \
>      43394             -20%      34575 ±  3%
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      43393             -20%      34575        GEO-MEAN
> vm-scalability.median
> 
> Best Regards,
> Rong Chen
> 
>>
>> While I have another finds, as I noticed your patch changed the bpp from
>> 24 to 32, I had a patch to change it back to 24, and run the case in
>> the weekend, the -18% regrssion was reduced to about -5%. Could this
>> be related?
>>
>> commit:
>>    f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>    90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
>> framebuffer emulation
>>    01e75fea0d5 mgag200: restore the depth back to 24
>>
>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
>> ---------------- --------------------------- ---------------------------
>>       43921 ±  2%     -18.3%      35884            -4.8%     
>> 41826        vm-scalability.median
>>    14889337           -17.5%   12291029            -4.1%  
>> 14278574        vm-scalability.throughput
>>   commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
>> Author: Feng Tang <feng.tang@intel.com>
>> Date:   Fri Aug 2 15:09:19 2019 +0800
>>
>>      mgag200: restore the depth back to 24
>>           Signed-off-by: Feng Tang <feng.tang@intel.com>
>>
>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c
>> b/drivers/gpu/drm/mgag200/mgag200_main.c
>> index a977333..ac8f6c9 100644
>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev,
>> unsigned long flags)
>>       if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>           dev->mode_config.preferred_depth = 16;
>>       else
>> -        dev->mode_config.preferred_depth = 32;
>> +        dev->mode_config.preferred_depth = 24;
>>       dev->mode_config.prefer_shadow = 1;
>>         r = mgag200_modeset_init(mdev);
>>
>> Thanks,
>> Feng
>>
>>>
>>> The difference between mgag200's original fbdev support and generic
>>> fbdev emulation is generic fbdev's worker task that updates the VRAM
>>> buffer from the shadow buffer. mgag200 does this immediately, but relies
>>> on drm_can_sleep(), which is deprecated.
>>>
>>> I think that the worker task interferes with the test case, as the
>>> worker has been in fbdev emulation since forever and no performance
>>> regressions have been reported so far.
>>>
>>>
>>> So unless there's a report where this problem happens in a real-world
>>> use case, I'd like to keep code as it is. And apparently there's always
>>> the workaround of disabling the cursor blinking.
>>>
>>> Best regards
>>> Thomas
>>>
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-05 10:22       ` Thomas Zimmermann
@ 2019-08-05 12:52         ` Feng Tang
  2020-01-06 13:19           ` Thomas Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Feng Tang @ 2019-08-05 12:52 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Stephen Rothwell, kernel test robot, michel, dri-devel, ying.huang, lkp

Hi Thomas,

On Mon, Aug 05, 2019 at 12:22:11PM +0200, Thomas Zimmermann wrote:

	[snip] 

> >>   2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
> >>   2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc --prefault
> >>     -O -U 815394406
> >>   917318700 bytes / 659419 usecs = 1358497 KB/s
> >>   917318700 bytes / 659658 usecs = 1358005 KB/s
> >>   917318700 bytes / 659916 usecs = 1357474 KB/s
> >>   917318700 bytes / 660168 usecs = 1356956 KB/s
> >>
> >> Rong, Feng, could you confirm this by disabling the cursor or blinking?
> > 
> > Glad to know this method restored the drop. Rong is running the case.
> > 
> > While I have another finds, as I noticed your patch changed the bpp from
> > 24 to 32, I had a patch to change it back to 24, and run the case in
> > the weekend, the -18% regrssion was reduced to about -5%. Could this
> > be related?
> 
> In the original code, the fbdev console already ran with 32 bpp [1] and
> 16 bpp was selected for low-end devices. [2][3] The patch only set the
> same values for userspace; nothing changed for the console.

I did the experiment becasue I checked the commit 

90f479ae51afa4 drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation

in which there is code:

diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
index b10f726..a977333 100644
--- a/drivers/gpu/drm/mgag200/mgag200_main.c
+++ b/drivers/gpu/drm/mgag200/mgag200_main.c
@@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
 	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
 		dev->mode_config.preferred_depth = 16;
 	else
-		dev->mode_config.preferred_depth = 24;
+		dev->mode_config.preferred_depth = 32;
 	dev->mode_config.prefer_shadow = 1;
 
My debug patch was kind of restoring of this part.

Thanks,
Feng

> 
> Best regards
> Thomas
> 
> [1]
> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n259
> [2]
> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n263
> [3]
> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n286
> 
> > 
> > commit: 
> >   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> >   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
> >   01e75fea0d5 mgag200: restore the depth back to 24
> > 
> > f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5 
> > ---------------- --------------------------- --------------------------- 
> >      43921 ±  2%     -18.3%      35884            -4.8%      41826        vm-scalability.median
> >   14889337           -17.5%   12291029            -4.1%   14278574        vm-scalability.throughput
> >  
> > commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
> > Author: Feng Tang <feng.tang@intel.com>
> > Date:   Fri Aug 2 15:09:19 2019 +0800
> > 
> >     mgag200: restore the depth back to 24
> >     
> >     Signed-off-by: Feng Tang <feng.tang@intel.com>
> > 
> > diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
> > index a977333..ac8f6c9 100644
> > --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> > +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> > @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
> >  	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
> >  		dev->mode_config.preferred_depth = 16;
> >  	else
> > -		dev->mode_config.preferred_depth = 32;
> > +		dev->mode_config.preferred_depth = 24;>  	dev->mode_config.prefer_shadow = 1;
> >  
> >  	r = mgag200_modeset_init(mdev);
> > 
> > Thanks,
> > Feng
> > 
> >>
> >>
> >> The difference between mgag200's original fbdev support and generic
> >> fbdev emulation is generic fbdev's worker task that updates the VRAM
> >> buffer from the shadow buffer. mgag200 does this immediately, but relies
> >> on drm_can_sleep(), which is deprecated.
> >>
> >> I think that the worker task interferes with the test case, as the
> >> worker has been in fbdev emulation since forever and no performance
> >> regressions have been reported so far.
> >>
> >>
> >> So unless there's a report where this problem happens in a real-world
> >> use case, I'd like to keep code as it is. And apparently there's always
> >> the workaround of disabling the cursor blinking.
> >>
> >> Best regards
> >> Thomas
> >>
> 
> -- 
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
> 



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-05 10:25         ` Thomas Zimmermann
@ 2019-08-06 12:59           ` Chen, Rong A
  2019-08-07 10:42             ` Thomas Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Chen, Rong A @ 2019-08-06 12:59 UTC (permalink / raw)
  To: Thomas Zimmermann, Feng Tang; +Cc: Stephen Rothwell, michel, dri-devel, lkp


[-- Attachment #1.1: Type: text/plain, Size: 7514 bytes --]

Hi,

On 8/5/2019 6:25 PM, Thomas Zimmermann wrote:
> Hi
>
> Am 05.08.19 um 09:28 schrieb Rong Chen:
>> Hi,
>>
>> On 8/5/19 3:02 PM, Feng Tang wrote:
>>> Hi Thomas,
>>>
>>> On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
>>>> Hi
>>>>
>>>> I did some further analysis on this problem and found that the blinking
>>>> cursor affects performance of the vm-scalability test case.
>>>>
>>>> I only have a 4-core machine, so scalability is not really testable. Yet
>>>> I see the effects of running vm-scalibility against drm-tip, a revert of
>>>> the mgag200 patch and the vmap fixes that I posted a few days ago.
>>>>
>>>> After reverting the mgag200 patch, running the test as described in the
>>>> report
>>>>
>>>>     bin/lkp run job.yaml
>>>>
>>>> gives results like
>>>>
>>>>     2019-08-02 19:34:37  ./case-anon-cow-seq-hugetlb
>>>>     2019-08-02 19:34:37  ./usemem --runtime 300 -n 4 --prealloc
>>>> --prefault
>>>>       -O -U 815395225
>>>>     917319627 bytes / 756534 usecs = 1184110 KB/s
>>>>     917319627 bytes / 764675 usecs = 1171504 KB/s
>>>>     917319627 bytes / 766414 usecs = 1168846 KB/s
>>>>     917319627 bytes / 777990 usecs = 1151454 KB/s
>>>>
>>>> Running the test against current drm-tip gives slightly worse results,
>>>> such as.
>>>>
>>>>     2019-08-03 19:17:06  ./case-anon-cow-seq-hugetlb
>>>>     2019-08-03 19:17:06  ./usemem --runtime 300 -n 4 --prealloc
>>>> --prefault
>>>>       -O -U 815394406
>>>>     917318700 bytes / 871607 usecs = 1027778 KB/s
>>>>     917318700 bytes / 894173 usecs = 1001840 KB/s
>>>>     917318700 bytes / 919694 usecs = 974040 KB/s
>>>>     917318700 bytes / 923341 usecs = 970193 KB/s
>>>>
>>>> The test puts out roughly one result per second. Strangely sending the
>>>> output to /dev/null can make results significantly worse.
>>>>
>>>>     bin/lkp run job.yaml > /dev/null
>>>>
>>>>     2019-08-03 19:23:04  ./case-anon-cow-seq-hugetlb
>>>>     2019-08-03 19:23:04  ./usemem --runtime 300 -n 4 --prealloc
>>>> --prefault
>>>>       -O -U 815394406
>>>>     917318700 bytes / 1207358 usecs = 741966 KB/s
>>>>     917318700 bytes / 1210456 usecs = 740067 KB/s
>>>>     917318700 bytes / 1216572 usecs = 736346 KB/s
>>>>     917318700 bytes / 1239152 usecs = 722929 KB/s
>>>>
>>>> I realized that there's still a blinking cursor on the screen, which I
>>>> disabled with
>>>>
>>>>     tput civis
>>>>
>>>> or alternatively
>>>>
>>>>     echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>>>>
>>>> Running the the test now gives the original or even better results,
>>>> such as
>>>>
>>>>     bin/lkp run job.yaml > /dev/null
>>>>
>>>>     2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>>>>     2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc
>>>> --prefault
>>>>       -O -U 815394406
>>>>     917318700 bytes / 659419 usecs = 1358497 KB/s
>>>>     917318700 bytes / 659658 usecs = 1358005 KB/s
>>>>     917318700 bytes / 659916 usecs = 1357474 KB/s
>>>>     917318700 bytes / 660168 usecs = 1356956 KB/s
>>>>
>>>> Rong, Feng, could you confirm this by disabling the cursor or blinking?
>>> Glad to know this method restored the drop. Rong is running the case.
>> I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for
>> both commits,
>> and the regression has no obvious change.
> Ah, I see. Thank you for testing. There are two questions that come to
> my mind: did you send the regular output to /dev/null? And what happens
> if you disable the cursor with 'tput civis'?

I didn't send the output to /dev/null because we need to collect data 
from the output,
Actually we run the benchmark as a background process, do we need to 
disable the cursor and test again?

Best Regards,
Rong Chen

>
> If there is absolutely nothing changing on the screen, I don't see how
> the regression could persist.
>
> Best regards
> Thomas
>
>
>> commit:
>>    f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>    90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
>> framebuffer emulation
>>
>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
>> ----------------  -------------------------- ---------------------------
>>           %stddev      change         %stddev
>>               \          |                \
>>       43394             -20%      34575 ±  3%
>> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>       43393             -20%      34575        GEO-MEAN
>> vm-scalability.median
>>
>> Best Regards,
>> Rong Chen
>>
>>> While I have another finds, as I noticed your patch changed the bpp from
>>> 24 to 32, I had a patch to change it back to 24, and run the case in
>>> the weekend, the -18% regrssion was reduced to about -5%. Could this
>>> be related?
>>>
>>> commit:
>>>     f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>     90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
>>> framebuffer emulation
>>>     01e75fea0d5 mgag200: restore the depth back to 24
>>>
>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
>>> ---------------- --------------------------- ---------------------------
>>>        43921 ±  2%     -18.3%      35884            -4.8%
>>> 41826        vm-scalability.median
>>>     14889337           -17.5%   12291029            -4.1%
>>> 14278574        vm-scalability.throughput
>>>    commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
>>> Author: Feng Tang <feng.tang@intel.com>
>>> Date:   Fri Aug 2 15:09:19 2019 +0800
>>>
>>>       mgag200: restore the depth back to 24
>>>            Signed-off-by: Feng Tang <feng.tang@intel.com>
>>>
>>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c
>>> b/drivers/gpu/drm/mgag200/mgag200_main.c
>>> index a977333..ac8f6c9 100644
>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev,
>>> unsigned long flags)
>>>        if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>>            dev->mode_config.preferred_depth = 16;
>>>        else
>>> -        dev->mode_config.preferred_depth = 32;
>>> +        dev->mode_config.preferred_depth = 24;
>>>        dev->mode_config.prefer_shadow = 1;
>>>          r = mgag200_modeset_init(mdev);
>>>
>>> Thanks,
>>> Feng
>>>
>>>> The difference between mgag200's original fbdev support and generic
>>>> fbdev emulation is generic fbdev's worker task that updates the VRAM
>>>> buffer from the shadow buffer. mgag200 does this immediately, but relies
>>>> on drm_can_sleep(), which is deprecated.
>>>>
>>>> I think that the worker task interferes with the test case, as the
>>>> worker has been in fbdev emulation since forever and no performance
>>>> regressions have been reported so far.
>>>>
>>>>
>>>> So unless there's a report where this problem happens in a real-world
>>>> use case, I'd like to keep code as it is. And apparently there's always
>>>> the workaround of disabling the cursor blinking.
>>>>
>>>> Best regards
>>>> Thomas
>>>>
>
> _______________________________________________
> LKP mailing list
> LKP@lists.01.org
> https://lists.01.org/mailman/listinfo/lkp


[-- Attachment #1.2: Type: text/html, Size: 8738 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-06 12:59           ` [LKP] " Chen, Rong A
@ 2019-08-07 10:42             ` Thomas Zimmermann
  2019-08-09  8:12               ` Rong Chen
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-07 10:42 UTC (permalink / raw)
  To: Chen, Rong A, Feng Tang; +Cc: Stephen Rothwell, michel, dri-devel, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 8974 bytes --]

Hi Rong

Am 06.08.19 um 14:59 schrieb Chen, Rong A:
> Hi,
> 
> On 8/5/2019 6:25 PM, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 05.08.19 um 09:28 schrieb Rong Chen:
>>> Hi,
>>>
>>> On 8/5/19 3:02 PM, Feng Tang wrote:
>>>> Hi Thomas,
>>>>
>>>> On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
>>>>> Hi
>>>>>
>>>>> I did some further analysis on this problem and found that the blinking
>>>>> cursor affects performance of the vm-scalability test case.
>>>>>
>>>>> I only have a 4-core machine, so scalability is not really testable. Yet
>>>>> I see the effects of running vm-scalibility against drm-tip, a revert of
>>>>> the mgag200 patch and the vmap fixes that I posted a few days ago.
>>>>>
>>>>> After reverting the mgag200 patch, running the test as described in the
>>>>> report
>>>>>
>>>>>    bin/lkp run job.yaml
>>>>>
>>>>> gives results like
>>>>>
>>>>>    2019-08-02 19:34:37  ./case-anon-cow-seq-hugetlb
>>>>>    2019-08-02 19:34:37  ./usemem --runtime 300 -n 4 --prealloc
>>>>> --prefault
>>>>>      -O -U 815395225
>>>>>    917319627 bytes / 756534 usecs = 1184110 KB/s
>>>>>    917319627 bytes / 764675 usecs = 1171504 KB/s
>>>>>    917319627 bytes / 766414 usecs = 1168846 KB/s
>>>>>    917319627 bytes / 777990 usecs = 1151454 KB/s
>>>>>
>>>>> Running the test against current drm-tip gives slightly worse results,
>>>>> such as.
>>>>>
>>>>>    2019-08-03 19:17:06  ./case-anon-cow-seq-hugetlb
>>>>>    2019-08-03 19:17:06  ./usemem --runtime 300 -n 4 --prealloc
>>>>> --prefault
>>>>>      -O -U 815394406
>>>>>    917318700 bytes / 871607 usecs = 1027778 KB/s
>>>>>    917318700 bytes / 894173 usecs = 1001840 KB/s
>>>>>    917318700 bytes / 919694 usecs = 974040 KB/s
>>>>>    917318700 bytes / 923341 usecs = 970193 KB/s
>>>>>
>>>>> The test puts out roughly one result per second. Strangely sending the
>>>>> output to /dev/null can make results significantly worse.
>>>>>
>>>>>    bin/lkp run job.yaml > /dev/null
>>>>>
>>>>>    2019-08-03 19:23:04  ./case-anon-cow-seq-hugetlb
>>>>>    2019-08-03 19:23:04  ./usemem --runtime 300 -n 4 --prealloc
>>>>> --prefault
>>>>>      -O -U 815394406
>>>>>    917318700 bytes / 1207358 usecs = 741966 KB/s
>>>>>    917318700 bytes / 1210456 usecs = 740067 KB/s
>>>>>    917318700 bytes / 1216572 usecs = 736346 KB/s
>>>>>    917318700 bytes / 1239152 usecs = 722929 KB/s
>>>>>
>>>>> I realized that there's still a blinking cursor on the screen, which I
>>>>> disabled with
>>>>>
>>>>>    tput civis
>>>>>
>>>>> or alternatively
>>>>>
>>>>>    echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>>>>>
>>>>> Running the the test now gives the original or even better results,
>>>>> such as
>>>>>
>>>>>    bin/lkp run job.yaml > /dev/null
>>>>>
>>>>>    2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>>>>>    2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc
>>>>> --prefault
>>>>>      -O -U 815394406
>>>>>    917318700 bytes / 659419 usecs = 1358497 KB/s
>>>>>    917318700 bytes / 659658 usecs = 1358005 KB/s
>>>>>    917318700 bytes / 659916 usecs = 1357474 KB/s
>>>>>    917318700 bytes / 660168 usecs = 1356956 KB/s
>>>>>
>>>>> Rong, Feng, could you confirm this by disabling the cursor or blinking?
>>>> Glad to know this method restored the drop. Rong is running the case.
>>> I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for
>>> both commits,
>>> and the regression has no obvious change.
>> Ah, I see. Thank you for testing. There are two questions that come to
>> my mind: did you send the regular output to /dev/null? And what happens
>> if you disable the cursor with 'tput civis'?
> 
> I didn't send the output to /dev/null because we need to collect data
> from the output,

You can send it to any file, as long as it doesn't show up on the
console. I also found the latest results in the file result/vm-scalability.


> Actually we run the benchmark as a background process, do we need to
> disable the cursor and test again?

There's a worker thread that updates the display from the shadow buffer.
The blinking cursor periodically triggers the worker thread, but the
actual update is just the size of one character.

The point of the test without output is to see if the regression comes
from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
from the worker thread. If the regression goes away after disabling the
blinking cursor, then the worker thread is the problem. If it already
goes away if there's simply no output from the test, the screen update
is the problem. On my machine I have to disable the blinking cursor, so
I think the worker causes the performance drop.

Best regards
Thomas

> 
> Best Regards,
> Rong Chen
> 
>> If there is absolutely nothing changing on the screen, I don't see how
>> the regression could persist.
>>
>> Best regards
>> Thomas
>>
>>
>>> commit:
>>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
>>> framebuffer emulation
>>>
>>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
>>> ----------------  -------------------------- ---------------------------
>>>          %stddev      change         %stddev
>>>              \          |                \
>>>      43394             -20%      34575 ±  3%
>>> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>      43393             -20%      34575        GEO-MEAN
>>> vm-scalability.median
>>>
>>> Best Regards,
>>> Rong Chen
>>>
>>>> While I have another finds, as I noticed your patch changed the bpp from
>>>> 24 to 32, I had a patch to change it back to 24, and run the case in
>>>> the weekend, the -18% regrssion was reduced to about -5%. Could this
>>>> be related?
>>>>
>>>> commit:
>>>>    f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>>    90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
>>>> framebuffer emulation
>>>>    01e75fea0d5 mgag200: restore the depth back to 24
>>>>
>>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
>>>> ---------------- --------------------------- ---------------------------
>>>>       43921 ±  2%     -18.3%      35884            -4.8%     
>>>> 41826        vm-scalability.median
>>>>    14889337           -17.5%   12291029            -4.1%  
>>>> 14278574        vm-scalability.throughput
>>>>   commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
>>>> Author: Feng Tang <feng.tang@intel.com>
>>>> Date:   Fri Aug 2 15:09:19 2019 +0800
>>>>
>>>>      mgag200: restore the depth back to 24
>>>>           Signed-off-by: Feng Tang <feng.tang@intel.com>
>>>>
>>>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c
>>>> b/drivers/gpu/drm/mgag200/mgag200_main.c
>>>> index a977333..ac8f6c9 100644
>>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>>>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>>>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev,
>>>> unsigned long flags)
>>>>       if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>>>           dev->mode_config.preferred_depth = 16;
>>>>       else
>>>> -        dev->mode_config.preferred_depth = 32;
>>>> +        dev->mode_config.preferred_depth = 24;
>>>>       dev->mode_config.prefer_shadow = 1;
>>>>         r = mgag200_modeset_init(mdev);
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>>> The difference between mgag200's original fbdev support and generic
>>>>> fbdev emulation is generic fbdev's worker task that updates the VRAM
>>>>> buffer from the shadow buffer. mgag200 does this immediately, but relies
>>>>> on drm_can_sleep(), which is deprecated.
>>>>>
>>>>> I think that the worker task interferes with the test case, as the
>>>>> worker has been in fbdev emulation since forever and no performance
>>>>> regressions have been reported so far.
>>>>>
>>>>>
>>>>> So unless there's a report where this problem happens in a real-world
>>>>> use case, I'd like to keep code as it is. And apparently there's always
>>>>> the workaround of disabling the cursor blinking.
>>>>>
>>>>> Best regards
>>>>> Thomas
>>>>>
>>
>> _______________________________________________
>> LKP mailing list
>> LKP@lists.01.org
>> https://lists.01.org/mailman/listinfo/lkp
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-07 10:42             ` Thomas Zimmermann
@ 2019-08-09  8:12               ` Rong Chen
  2019-08-12  7:25                 ` Feng Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Rong Chen @ 2019-08-09  8:12 UTC (permalink / raw)
  To: Thomas Zimmermann, Feng Tang; +Cc: Stephen Rothwell, michel, dri-devel, lkp

Hi,

On 8/7/19 6:42 PM, Thomas Zimmermann wrote:
> Hi Rong
>
> Am 06.08.19 um 14:59 schrieb Chen, Rong A:
>> Hi,
>>
>> On 8/5/2019 6:25 PM, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> Am 05.08.19 um 09:28 schrieb Rong Chen:
>>>> Hi,
>>>>
>>>> On 8/5/19 3:02 PM, Feng Tang wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
>>>>>> Hi
>>>>>>
>>>>>> I did some further analysis on this problem and found that the blinking
>>>>>> cursor affects performance of the vm-scalability test case.
>>>>>>
>>>>>> I only have a 4-core machine, so scalability is not really testable. Yet
>>>>>> I see the effects of running vm-scalibility against drm-tip, a revert of
>>>>>> the mgag200 patch and the vmap fixes that I posted a few days ago.
>>>>>>
>>>>>> After reverting the mgag200 patch, running the test as described in the
>>>>>> report
>>>>>>
>>>>>>     bin/lkp run job.yaml
>>>>>>
>>>>>> gives results like
>>>>>>
>>>>>>     2019-08-02 19:34:37  ./case-anon-cow-seq-hugetlb
>>>>>>     2019-08-02 19:34:37  ./usemem --runtime 300 -n 4 --prealloc
>>>>>> --prefault
>>>>>>       -O -U 815395225
>>>>>>     917319627 bytes / 756534 usecs = 1184110 KB/s
>>>>>>     917319627 bytes / 764675 usecs = 1171504 KB/s
>>>>>>     917319627 bytes / 766414 usecs = 1168846 KB/s
>>>>>>     917319627 bytes / 777990 usecs = 1151454 KB/s
>>>>>>
>>>>>> Running the test against current drm-tip gives slightly worse results,
>>>>>> such as.
>>>>>>
>>>>>>     2019-08-03 19:17:06  ./case-anon-cow-seq-hugetlb
>>>>>>     2019-08-03 19:17:06  ./usemem --runtime 300 -n 4 --prealloc
>>>>>> --prefault
>>>>>>       -O -U 815394406
>>>>>>     917318700 bytes / 871607 usecs = 1027778 KB/s
>>>>>>     917318700 bytes / 894173 usecs = 1001840 KB/s
>>>>>>     917318700 bytes / 919694 usecs = 974040 KB/s
>>>>>>     917318700 bytes / 923341 usecs = 970193 KB/s
>>>>>>
>>>>>> The test puts out roughly one result per second. Strangely sending the
>>>>>> output to /dev/null can make results significantly worse.
>>>>>>
>>>>>>     bin/lkp run job.yaml > /dev/null
>>>>>>
>>>>>>     2019-08-03 19:23:04  ./case-anon-cow-seq-hugetlb
>>>>>>     2019-08-03 19:23:04  ./usemem --runtime 300 -n 4 --prealloc
>>>>>> --prefault
>>>>>>       -O -U 815394406
>>>>>>     917318700 bytes / 1207358 usecs = 741966 KB/s
>>>>>>     917318700 bytes / 1210456 usecs = 740067 KB/s
>>>>>>     917318700 bytes / 1216572 usecs = 736346 KB/s
>>>>>>     917318700 bytes / 1239152 usecs = 722929 KB/s
>>>>>>
>>>>>> I realized that there's still a blinking cursor on the screen, which I
>>>>>> disabled with
>>>>>>
>>>>>>     tput civis
>>>>>>
>>>>>> or alternatively
>>>>>>
>>>>>>     echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>>>>>>
>>>>>> Running the the test now gives the original or even better results,
>>>>>> such as
>>>>>>
>>>>>>     bin/lkp run job.yaml > /dev/null
>>>>>>
>>>>>>     2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>>>>>>     2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc
>>>>>> --prefault
>>>>>>       -O -U 815394406
>>>>>>     917318700 bytes / 659419 usecs = 1358497 KB/s
>>>>>>     917318700 bytes / 659658 usecs = 1358005 KB/s
>>>>>>     917318700 bytes / 659916 usecs = 1357474 KB/s
>>>>>>     917318700 bytes / 660168 usecs = 1356956 KB/s
>>>>>>
>>>>>> Rong, Feng, could you confirm this by disabling the cursor or blinking?
>>>>> Glad to know this method restored the drop. Rong is running the case.
>>>> I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for
>>>> both commits,
>>>> and the regression has no obvious change.
>>> Ah, I see. Thank you for testing. There are two questions that come to
>>> my mind: did you send the regular output to /dev/null? And what happens
>>> if you disable the cursor with 'tput civis'?
>> I didn't send the output to /dev/null because we need to collect data
>> from the output,
> You can send it to any file, as long as it doesn't show up on the
> console. I also found the latest results in the file result/vm-scalability.
>
>
>> Actually we run the benchmark as a background process, do we need to
>> disable the cursor and test again?
> There's a worker thread that updates the display from the shadow buffer.
> The blinking cursor periodically triggers the worker thread, but the
> actual update is just the size of one character.
>
> The point of the test without output is to see if the regression comes
> from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> from the worker thread. If the regression goes away after disabling the
> blinking cursor, then the worker thread is the problem. If it already
> goes away if there's simply no output from the test, the screen update
> is the problem. On my machine I have to disable the blinking cursor, so
> I think the worker causes the performance drop.

We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression 
is gone.

commit:
   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic 
framebuffer emulation

f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
----------------  -------------------------- ---------------------------
          %stddev      change         %stddev
              \          |                \
      43785                       44481 
vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
      43785                       44481        GEO-MEAN 
vm-scalability.median

Best Regards,
Rong Chen


>
> Best regards
> Thomas
>
>> Best Regards,
>> Rong Chen
>>
>>> If there is absolutely nothing changing on the screen, I don't see how
>>> the regression could persist.
>>>
>>> Best regards
>>> Thomas
>>>
>>>
>>>> commit:
>>>>    f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>>    90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
>>>> framebuffer emulation
>>>>
>>>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
>>>> ----------------  -------------------------- ---------------------------
>>>>           %stddev      change         %stddev
>>>>               \          |                \
>>>>       43394             -20%      34575 ±  3%
>>>> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>>       43393             -20%      34575        GEO-MEAN
>>>> vm-scalability.median
>>>>
>>>> Best Regards,
>>>> Rong Chen
>>>>
>>>>> While I have another finds, as I noticed your patch changed the bpp from
>>>>> 24 to 32, I had a patch to change it back to 24, and run the case in
>>>>> the weekend, the -18% regrssion was reduced to about -5%. Could this
>>>>> be related?
>>>>>
>>>>> commit:
>>>>>     f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>>>     90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
>>>>> framebuffer emulation
>>>>>     01e75fea0d5 mgag200: restore the depth back to 24
>>>>>
>>>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
>>>>> ---------------- --------------------------- ---------------------------
>>>>>        43921 ±  2%     -18.3%      35884            -4.8%
>>>>> 41826        vm-scalability.median
>>>>>     14889337           -17.5%   12291029            -4.1%
>>>>> 14278574        vm-scalability.throughput
>>>>>    commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
>>>>> Author: Feng Tang <feng.tang@intel.com>
>>>>> Date:   Fri Aug 2 15:09:19 2019 +0800
>>>>>
>>>>>       mgag200: restore the depth back to 24
>>>>>            Signed-off-by: Feng Tang <feng.tang@intel.com>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c
>>>>> b/drivers/gpu/drm/mgag200/mgag200_main.c
>>>>> index a977333..ac8f6c9 100644
>>>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>>>>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>>>>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev,
>>>>> unsigned long flags)
>>>>>        if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>>>>            dev->mode_config.preferred_depth = 16;
>>>>>        else
>>>>> -        dev->mode_config.preferred_depth = 32;
>>>>> +        dev->mode_config.preferred_depth = 24;
>>>>>        dev->mode_config.prefer_shadow = 1;
>>>>>          r = mgag200_modeset_init(mdev);
>>>>>
>>>>> Thanks,
>>>>> Feng
>>>>>
>>>>>> The difference between mgag200's original fbdev support and generic
>>>>>> fbdev emulation is generic fbdev's worker task that updates the VRAM
>>>>>> buffer from the shadow buffer. mgag200 does this immediately, but relies
>>>>>> on drm_can_sleep(), which is deprecated.
>>>>>>
>>>>>> I think that the worker task interferes with the test case, as the
>>>>>> worker has been in fbdev emulation since forever and no performance
>>>>>> regressions have been reported so far.
>>>>>>
>>>>>>
>>>>>> So unless there's a report where this problem happens in a real-world
>>>>>> use case, I'd like to keep code as it is. And apparently there's always
>>>>>> the workaround of disabling the cursor blinking.
>>>>>>
>>>>>> Best regards
>>>>>> Thomas
>>>>>>
>>> _______________________________________________
>>> LKP mailing list
>>> LKP@lists.01.org
>>> https://lists.01.org/mailman/listinfo/lkp

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-09  8:12               ` Rong Chen
@ 2019-08-12  7:25                 ` Feng Tang
  2019-08-13  9:36                   ` Feng Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Feng Tang @ 2019-08-12  7:25 UTC (permalink / raw)
  To: Rong Chen; +Cc: Stephen Rothwell, Thomas Zimmermann, michel, dri-devel, lkp

Hi Thomas,

On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> Hi,
> 
> >>Actually we run the benchmark as a background process, do we need to
> >>disable the cursor and test again?
> >There's a worker thread that updates the display from the shadow buffer.
> >The blinking cursor periodically triggers the worker thread, but the
> >actual update is just the size of one character.
> >
> >The point of the test without output is to see if the regression comes
> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> >from the worker thread. If the regression goes away after disabling the
> >blinking cursor, then the worker thread is the problem. If it already
> >goes away if there's simply no output from the test, the screen update
> >is the problem. On my machine I have to disable the blinking cursor, so
> >I think the worker causes the performance drop.
> 
> We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
> gone.
> 
> commit:
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> emulation
> 
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> ----------------  -------------------------- ---------------------------
>          %stddev      change         %stddev
>              \          |                \
>      43785                       44481
> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      43785                       44481        GEO-MEAN vm-scalability.median

Till now, from Rong's tests:
1. Disabling cursor blinking doesn't cure the regression.
2. Disabling printint test results to console can workaround the
regression.

Also if we set the perfer_shadown to 0, the regression is also
gone.

--- a/drivers/gpu/drm/mgag200/mgag200_main.c
+++ b/drivers/gpu/drm/mgag200/mgag200_main.c
@@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
 		dev->mode_config.preferred_depth = 16;
 	else
 		dev->mode_config.preferred_depth = 32;
-	dev->mode_config.prefer_shadow = 1;
+	dev->mode_config.prefer_shadow = 0;

And from the perf data, one obvious difference is good case don't
call drm_fb_helper_dirty_work(), while bad case calls.

Thanks,
Feng

> Best Regards,
> Rong Chen
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-12  7:25                 ` Feng Tang
@ 2019-08-13  9:36                   ` Feng Tang
  2019-08-16  6:55                     ` Feng Tang
  2019-08-22 17:25                     ` Thomas Zimmermann
  0 siblings, 2 replies; 61+ messages in thread
From: Feng Tang @ 2019-08-13  9:36 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Rong Chen, Stephen Rothwell, michel, dri-devel,
	Noralf Trønnes, Daniel Vetter, lkp, linux-kernel,
	ying.huang

Hi Thomas, 

On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
> Hi Thomas,
> 
> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> > Hi,
> > 
> > >>Actually we run the benchmark as a background process, do we need to
> > >>disable the cursor and test again?
> > >There's a worker thread that updates the display from the shadow buffer.
> > >The blinking cursor periodically triggers the worker thread, but the
> > >actual update is just the size of one character.
> > >
> > >The point of the test without output is to see if the regression comes
> > >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> > >from the worker thread. If the regression goes away after disabling the
> > >blinking cursor, then the worker thread is the problem. If it already
> > >goes away if there's simply no output from the test, the screen update
> > >is the problem. On my machine I have to disable the blinking cursor, so
> > >I think the worker causes the performance drop.
> > 
> > We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
> > gone.
> > 
> > commit:
> >   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> >   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> > emulation
> > 
> > f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> > ----------------  -------------------------- ---------------------------
> >          %stddev      change         %stddev
> >              \          |                \
> >      43785                       44481
> > vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
> >      43785                       44481        GEO-MEAN vm-scalability.median
> 
> Till now, from Rong's tests:
> 1. Disabling cursor blinking doesn't cure the regression.
> 2. Disabling printint test results to console can workaround the
> regression.
> 
> Also if we set the perfer_shadown to 0, the regression is also
> gone.

We also did some further break down for the time consumed by the
new code.

The drm_fb_helper_dirty_work() calls sequentially 
1. drm_client_buffer_vmap	  (290 us)
2. drm_fb_helper_dirty_blit_real  (19240 us)
3. helper->fb->funcs->dirty()    ---> NULL for mgag200 driver
4. drm_client_buffer_vunmap       (215 us)

The average run time is listed after the function names.

>From it, we can see drm_fb_helper_dirty_blit_real() takes too long
time (about 20ms for each run). I guess this is the root cause
of this regression, as the original code doesn't use this dirty worker.

As said in last email, setting the prefer_shadow to 0 can avoid
the regrssion. Could it be an option?

Thanks,
Feng

> 
> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>  		dev->mode_config.preferred_depth = 16;
>  	else
>  		dev->mode_config.preferred_depth = 32;
> -	dev->mode_config.prefer_shadow = 1;
> +	dev->mode_config.prefer_shadow = 0;
> 
> And from the perf data, one obvious difference is good case don't
> call drm_fb_helper_dirty_work(), while bad case calls.
> 
> Thanks,
> Feng
> 
> > Best Regards,
> > Rong Chen

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-13  9:36                   ` Feng Tang
@ 2019-08-16  6:55                     ` Feng Tang
  2019-08-22 17:25                     ` Thomas Zimmermann
  1 sibling, 0 replies; 61+ messages in thread
From: Feng Tang @ 2019-08-16  6:55 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Stephen Rothwell, michel, linux-kernel, dri-devel,
	Noralf Trønnes, Daniel Vetter, lkp

Hi Thomas,

On Tue, Aug 13, 2019 at 05:36:16PM +0800, Feng Tang wrote:
> Hi Thomas, 
> 
> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
> > Hi Thomas,
> > 
> > On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> > > Hi,
> > > 
> > > >>Actually we run the benchmark as a background process, do we need to
> > > >>disable the cursor and test again?
> > > >There's a worker thread that updates the display from the shadow buffer.
> > > >The blinking cursor periodically triggers the worker thread, but the
> > > >actual update is just the size of one character.
> > > >
> > > >The point of the test without output is to see if the regression comes
> > > >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> > > >from the worker thread. If the regression goes away after disabling the
> > > >blinking cursor, then the worker thread is the problem. If it already
> > > >goes away if there's simply no output from the test, the screen update
> > > >is the problem. On my machine I have to disable the blinking cursor, so
> > > >I think the worker causes the performance drop.
> > > 
> > > We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
> > > gone.
> > > 
> > > commit:
> > >   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> > >   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> > > emulation
> > > 
> > > f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> > > ----------------  -------------------------- ---------------------------
> > >          %stddev      change         %stddev
> > >              \          |                \
> > >      43785                       44481
> > > vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
> > >      43785                       44481        GEO-MEAN vm-scalability.median
> > 
> > Till now, from Rong's tests:
> > 1. Disabling cursor blinking doesn't cure the regression.
> > 2. Disabling printint test results to console can workaround the
> > regression.
> > 
> > Also if we set the perfer_shadown to 0, the regression is also
> > gone.
> 
> We also did some further break down for the time consumed by the
> new code.
> 
> The drm_fb_helper_dirty_work() calls sequentially 
> 1. drm_client_buffer_vmap	  (290 us)
> 2. drm_fb_helper_dirty_blit_real  (19240 us)
> 3. helper->fb->funcs->dirty()    ---> NULL for mgag200 driver
> 4. drm_client_buffer_vunmap       (215 us)
> 
> The average run time is listed after the function names.
> 
> From it, we can see drm_fb_helper_dirty_blit_real() takes too long
> time (about 20ms for each run). I guess this is the root cause
> of this regression, as the original code doesn't use this dirty worker.
> 
> As said in last email, setting the prefer_shadow to 0 can avoid
> the regrssion. Could it be an option?

Any comments on this? thanks

- Feng

> 
> Thanks,
> Feng
> 
> > 
> > --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> > +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> > @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
> >  		dev->mode_config.preferred_depth = 16;
> >  	else
> >  		dev->mode_config.preferred_depth = 32;
> > -	dev->mode_config.prefer_shadow = 1;
> > +	dev->mode_config.prefer_shadow = 0;
> > 
> > And from the perf data, one obvious difference is good case don't
> > call drm_fb_helper_dirty_work(), while bad case calls.
> > 
> > Thanks,
> > Feng
> > 
> > > Best Regards,
> > > Rong Chen
> _______________________________________________
> LKP mailing list
> LKP@lists.01.org
> https://lists.01.org/mailman/listinfo/lkp

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-13  9:36                   ` Feng Tang
  2019-08-16  6:55                     ` Feng Tang
@ 2019-08-22 17:25                     ` Thomas Zimmermann
  2019-08-22 20:02                       ` Dave Airlie
  2019-08-24  5:16                       ` Feng Tang
  1 sibling, 2 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-22 17:25 UTC (permalink / raw)
  To: Feng Tang
  Cc: Stephen Rothwell, Rong Chen, michel, linux-kernel, dri-devel,
	ying.huang, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 4618 bytes --]

Hi

I was traveling and could reply earlier. Sorry for taking so long.

Am 13.08.19 um 11:36 schrieb Feng Tang:
> Hi Thomas, 
> 
> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
>> Hi Thomas,
>>
>> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
>>> Hi,
>>>
>>>>> Actually we run the benchmark as a background process, do we need to
>>>>> disable the cursor and test again?
>>>> There's a worker thread that updates the display from the shadow buffer.
>>>> The blinking cursor periodically triggers the worker thread, but the
>>>> actual update is just the size of one character.
>>>>
>>>> The point of the test without output is to see if the regression comes
>>> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
>>> >from the worker thread. If the regression goes away after disabling the
>>>> blinking cursor, then the worker thread is the problem. If it already
>>>> goes away if there's simply no output from the test, the screen update
>>>> is the problem. On my machine I have to disable the blinking cursor, so
>>>> I think the worker causes the performance drop.
>>>
>>> We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
>>> gone.
>>>
>>> commit:
>>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
>>> emulation
>>>
>>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
>>> ----------------  -------------------------- ---------------------------
>>>          %stddev      change         %stddev
>>>              \          |                \
>>>      43785                       44481
>>> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>      43785                       44481        GEO-MEAN vm-scalability.median
>>
>> Till now, from Rong's tests:
>> 1. Disabling cursor blinking doesn't cure the regression.
>> 2. Disabling printint test results to console can workaround the
>> regression.
>>
>> Also if we set the perfer_shadown to 0, the regression is also
>> gone.
> 
> We also did some further break down for the time consumed by the
> new code.
> 
> The drm_fb_helper_dirty_work() calls sequentially 
> 1. drm_client_buffer_vmap	  (290 us)
> 2. drm_fb_helper_dirty_blit_real  (19240 us)
> 3. helper->fb->funcs->dirty()    ---> NULL for mgag200 driver
> 4. drm_client_buffer_vunmap       (215 us)
>

It's somewhat different to what I observed, but maybe I just couldn't
reproduce the problem correctly.

> The average run time is listed after the function names.
> 
> From it, we can see drm_fb_helper_dirty_blit_real() takes too long
> time (about 20ms for each run). I guess this is the root cause
> of this regression, as the original code doesn't use this dirty worker.

True, the original code uses a temporary buffer, but updates the display
immediately.

My guess is that this could be a caching problem. The worker runs on a
different CPU, which doesn't have the shadow buffer in cache.

> As said in last email, setting the prefer_shadow to 0 can avoid
> the regrssion. Could it be an option?

Unfortunately not. Without the shadow buffer, the console's display
buffer permanently resides in video memory. It consumes significant
amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave
enough room for anything else.

The best option is to not print to the console.

Best regards
Thomas

> Thanks,
> Feng
> 
>>
>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>> @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>>  		dev->mode_config.preferred_depth = 16;
>>  	else
>>  		dev->mode_config.preferred_depth = 32;
>> -	dev->mode_config.prefer_shadow = 1;
>> +	dev->mode_config.prefer_shadow = 0;
>>
>> And from the perf data, one obvious difference is good case don't
>> call drm_fb_helper_dirty_work(), while bad case calls.
>>
>> Thanks,
>> Feng
>>
>>> Best Regards,
>>> Rong Chen
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-22 17:25                     ` Thomas Zimmermann
@ 2019-08-22 20:02                       ` Dave Airlie
  2019-08-23  9:54                         ` Thomas Zimmermann
  2019-08-24  5:16                       ` Feng Tang
  1 sibling, 1 reply; 61+ messages in thread
From: Dave Airlie @ 2019-08-22 20:02 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Feng Tang, Stephen Rothwell, Rong Chen, Michel Dänzer, LKML,
	dri-devel, ying.huang, LKP

On Fri, 23 Aug 2019 at 03:25, Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> Hi
>
> I was traveling and could reply earlier. Sorry for taking so long.
>
> Am 13.08.19 um 11:36 schrieb Feng Tang:
> > Hi Thomas,
> >
> > On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
> >> Hi Thomas,
> >>
> >> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> >>> Hi,
> >>>
> >>>>> Actually we run the benchmark as a background process, do we need to
> >>>>> disable the cursor and test again?
> >>>> There's a worker thread that updates the display from the shadow buffer.
> >>>> The blinking cursor periodically triggers the worker thread, but the
> >>>> actual update is just the size of one character.
> >>>>
> >>>> The point of the test without output is to see if the regression comes
> >>> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> >>> >from the worker thread. If the regression goes away after disabling the
> >>>> blinking cursor, then the worker thread is the problem. If it already
> >>>> goes away if there's simply no output from the test, the screen update
> >>>> is the problem. On my machine I have to disable the blinking cursor, so
> >>>> I think the worker causes the performance drop.
> >>>
> >>> We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
> >>> gone.
> >>>
> >>> commit:
> >>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> >>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> >>> emulation
> >>>
> >>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> >>> ----------------  -------------------------- ---------------------------
> >>>          %stddev      change         %stddev
> >>>              \          |                \
> >>>      43785                       44481
> >>> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
> >>>      43785                       44481        GEO-MEAN vm-scalability.median
> >>
> >> Till now, from Rong's tests:
> >> 1. Disabling cursor blinking doesn't cure the regression.
> >> 2. Disabling printint test results to console can workaround the
> >> regression.
> >>
> >> Also if we set the perfer_shadown to 0, the regression is also
> >> gone.
> >
> > We also did some further break down for the time consumed by the
> > new code.
> >
> > The drm_fb_helper_dirty_work() calls sequentially
> > 1. drm_client_buffer_vmap       (290 us)
> > 2. drm_fb_helper_dirty_blit_real  (19240 us)
> > 3. helper->fb->funcs->dirty()    ---> NULL for mgag200 driver
> > 4. drm_client_buffer_vunmap       (215 us)
> >
>
> It's somewhat different to what I observed, but maybe I just couldn't
> reproduce the problem correctly.
>
> > The average run time is listed after the function names.
> >
> > From it, we can see drm_fb_helper_dirty_blit_real() takes too long
> > time (about 20ms for each run). I guess this is the root cause
> > of this regression, as the original code doesn't use this dirty worker.
>
> True, the original code uses a temporary buffer, but updates the display
> immediately.
>
> My guess is that this could be a caching problem. The worker runs on a
> different CPU, which doesn't have the shadow buffer in cache.
>
> > As said in last email, setting the prefer_shadow to 0 can avoid
> > the regrssion. Could it be an option?
>
> Unfortunately not. Without the shadow buffer, the console's display
> buffer permanently resides in video memory. It consumes significant
> amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave
> enough room for anything else.
>
> The best option is to not print to the console.

Wait a second, I thought the driver did an eviction on modeset of the
scanned out object, this was a deliberate design decision made when
writing those drivers, has this been removed in favour of gem and
generic code paths?

Dave.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-22 20:02                       ` Dave Airlie
@ 2019-08-23  9:54                         ` Thomas Zimmermann
  0 siblings, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-23  9:54 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Stephen Rothwell, Feng Tang, Rong Chen, Michel Dänzer, LKML,
	dri-devel, ying.huang, LKP


[-- Attachment #1.1.1: Type: text/plain, Size: 4586 bytes --]

Hi

Am 22.08.19 um 22:02 schrieb Dave Airlie:
> On Fri, 23 Aug 2019 at 03:25, Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>
>> Hi
>>
>> I was traveling and could reply earlier. Sorry for taking so long.
>>
>> Am 13.08.19 um 11:36 schrieb Feng Tang:
>>> Hi Thomas,
>>>
>>> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
>>>> Hi Thomas,
>>>>
>>>> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
>>>>> Hi,
>>>>>
>>>>>>> Actually we run the benchmark as a background process, do we need to
>>>>>>> disable the cursor and test again?
>>>>>> There's a worker thread that updates the display from the shadow buffer.
>>>>>> The blinking cursor periodically triggers the worker thread, but the
>>>>>> actual update is just the size of one character.
>>>>>>
>>>>>> The point of the test without output is to see if the regression comes
>>>>> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
>>>>> >from the worker thread. If the regression goes away after disabling the
>>>>>> blinking cursor, then the worker thread is the problem. If it already
>>>>>> goes away if there's simply no output from the test, the screen update
>>>>>> is the problem. On my machine I have to disable the blinking cursor, so
>>>>>> I think the worker causes the performance drop.
>>>>>
>>>>> We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
>>>>> gone.
>>>>>
>>>>> commit:
>>>>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
>>>>> emulation
>>>>>
>>>>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
>>>>> ----------------  -------------------------- ---------------------------
>>>>>          %stddev      change         %stddev
>>>>>              \          |                \
>>>>>      43785                       44481
>>>>> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>>>      43785                       44481        GEO-MEAN vm-scalability.median
>>>>
>>>> Till now, from Rong's tests:
>>>> 1. Disabling cursor blinking doesn't cure the regression.
>>>> 2. Disabling printint test results to console can workaround the
>>>> regression.
>>>>
>>>> Also if we set the perfer_shadown to 0, the regression is also
>>>> gone.
>>>
>>> We also did some further break down for the time consumed by the
>>> new code.
>>>
>>> The drm_fb_helper_dirty_work() calls sequentially
>>> 1. drm_client_buffer_vmap       (290 us)
>>> 2. drm_fb_helper_dirty_blit_real  (19240 us)
>>> 3. helper->fb->funcs->dirty()    ---> NULL for mgag200 driver
>>> 4. drm_client_buffer_vunmap       (215 us)
>>>
>>
>> It's somewhat different to what I observed, but maybe I just couldn't
>> reproduce the problem correctly.
>>
>>> The average run time is listed after the function names.
>>>
>>> From it, we can see drm_fb_helper_dirty_blit_real() takes too long
>>> time (about 20ms for each run). I guess this is the root cause
>>> of this regression, as the original code doesn't use this dirty worker.
>>
>> True, the original code uses a temporary buffer, but updates the display
>> immediately.
>>
>> My guess is that this could be a caching problem. The worker runs on a
>> different CPU, which doesn't have the shadow buffer in cache.
>>
>>> As said in last email, setting the prefer_shadow to 0 can avoid
>>> the regrssion. Could it be an option?
>>
>> Unfortunately not. Without the shadow buffer, the console's display
>> buffer permanently resides in video memory. It consumes significant
>> amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave
>> enough room for anything else.
>>
>> The best option is to not print to the console.
> 
> Wait a second, I thought the driver did an eviction on modeset of the
> scanned out object, this was a deliberate design decision made when
> writing those drivers, has this been removed in favour of gem and
> generic code paths?

Yes. We added back this feature for testing in [1]. It was only an
improvement of ~1% compared to the original report. I wouldn't mind
landing this patch set, but it probably doesn't make a difference either.

Best regards
Thomas

[1] https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html

> 
> Dave.
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-22 17:25                     ` Thomas Zimmermann
  2019-08-22 20:02                       ` Dave Airlie
@ 2019-08-24  5:16                       ` Feng Tang
  2019-08-26 10:50                         ` Thomas Zimmermann
  1 sibling, 1 reply; 61+ messages in thread
From: Feng Tang @ 2019-08-24  5:16 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Stephen Rothwell, Rong Chen, michel, linux-kernel, dri-devel,
	ying.huang, lkp

Hi Thomas,

On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
> Hi
> 
> I was traveling and could reply earlier. Sorry for taking so long.

No problem! I guessed so :)

> 
> Am 13.08.19 um 11:36 schrieb Feng Tang:
> > Hi Thomas, 
> > 
> > On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
> >> Hi Thomas,
> >>
> >> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> >>> Hi,
> >>>
> >>>>> Actually we run the benchmark as a background process, do we need to
> >>>>> disable the cursor and test again?
> >>>> There's a worker thread that updates the display from the shadow buffer.
> >>>> The blinking cursor periodically triggers the worker thread, but the
> >>>> actual update is just the size of one character.
> >>>>
> >>>> The point of the test without output is to see if the regression comes
> >>> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> >>> >from the worker thread. If the regression goes away after disabling the
> >>>> blinking cursor, then the worker thread is the problem. If it already
> >>>> goes away if there's simply no output from the test, the screen update
> >>>> is the problem. On my machine I have to disable the blinking cursor, so
> >>>> I think the worker causes the performance drop.
> >>>
> >>> We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
> >>> gone.
> >>>
> >>> commit:
> >>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> >>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> >>> emulation
> >>>
> >>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> >>> ----------------  -------------------------- ---------------------------
> >>>          %stddev      change         %stddev
> >>>              \          |                \
> >>>      43785                       44481
> >>> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
> >>>      43785                       44481        GEO-MEAN vm-scalability.median
> >>
> >> Till now, from Rong's tests:
> >> 1. Disabling cursor blinking doesn't cure the regression.
> >> 2. Disabling printint test results to console can workaround the
> >> regression.
> >>
> >> Also if we set the perfer_shadown to 0, the regression is also
> >> gone.
> > 
> > We also did some further break down for the time consumed by the
> > new code.
> > 
> > The drm_fb_helper_dirty_work() calls sequentially 
> > 1. drm_client_buffer_vmap	  (290 us)
> > 2. drm_fb_helper_dirty_blit_real  (19240 us)
> > 3. helper->fb->funcs->dirty()    ---> NULL for mgag200 driver
> > 4. drm_client_buffer_vunmap       (215 us)
> >
> 
> It's somewhat different to what I observed, but maybe I just couldn't
> reproduce the problem correctly.
> 
> > The average run time is listed after the function names.
> > 
> > From it, we can see drm_fb_helper_dirty_blit_real() takes too long
> > time (about 20ms for each run). I guess this is the root cause
> > of this regression, as the original code doesn't use this dirty worker.
> 
> True, the original code uses a temporary buffer, but updates the display
> immediately.
> 
> My guess is that this could be a caching problem. The worker runs on a
> different CPU, which doesn't have the shadow buffer in cache.

Yes, that's my thought too. I profiled the working set size, for most of
the drm_fb_helper_dirty_blit_real(), it will update a buffer 4096x768(3 MB),
and as it is called 30~40 times per second, it surely will affect the cache.


> > As said in last email, setting the prefer_shadow to 0 can avoid
> > the regrssion. Could it be an option?
> 
> Unfortunately not. Without the shadow buffer, the console's display
> buffer permanently resides in video memory. It consumes significant
> amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave
> enough room for anything else.
> 
> The best option is to not print to the console.

Do we have other options here?

My thought is this is clearly a regression, that the old driver works
fine, while the new version in linux-next doesn't. Also for a frame
buffer console, writting dozens line of message to it is not a rare
user case. We have many test platforms (servers/desktops/laptops)
with different kinds of GFX hardwares, and this model works fine for
many years :)

Thanks,
Feng


 
> Best regards
> Thomas
> 
> > Thanks,
> > Feng
> > 
> >>
> >> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> >> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> >> @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
> >>  		dev->mode_config.preferred_depth = 16;
> >>  	else
> >>  		dev->mode_config.preferred_depth = 32;
> >> -	dev->mode_config.prefer_shadow = 1;
> >> +	dev->mode_config.prefer_shadow = 0;
> >>
> >> And from the perf data, one obvious difference is good case don't
> >> call drm_fb_helper_dirty_work(), while bad case calls.
> >>
> >> Thanks,
> >> Feng
> >>
> >>> Best Regards,
> >>> Rong Chen
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > 
> 
> -- 
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
> 



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-24  5:16                       ` Feng Tang
@ 2019-08-26 10:50                         ` Thomas Zimmermann
  2019-08-27 12:33                           ` Chen, Rong A
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-26 10:50 UTC (permalink / raw)
  To: Feng Tang
  Cc: Stephen Rothwell, Rong Chen, michel, linux-kernel, dri-devel,
	ying.huang, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 6473 bytes --]

Hi Feng

Am 24.08.19 um 07:16 schrieb Feng Tang:
> Hi Thomas,
> 
> On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
>> Hi
>> 
>> I was traveling and could reply earlier. Sorry for taking so long.
> 
> No problem! I guessed so :)
> 
>> 
>> Am 13.08.19 um 11:36 schrieb Feng Tang:
>>> Hi Thomas,
>>> 
>>> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
>>>> Hi Thomas,
>>>> 
>>>> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
>>>>> Hi,
>>>>> 
>>>>>>> Actually we run the benchmark as a background process, do
>>>>>>> we need to disable the cursor and test again?
>>>>>> There's a worker thread that updates the display from the 
>>>>>> shadow buffer. The blinking cursor periodically triggers 
>>>>>> the worker thread, but the actual update is just the size 
>>>>>> of one character.
>>>>>> 
>>>>>> The point of the test without output is to see if the 
>>>>>> regression comes from the buffer update (i.e., the memcpy 
>>>>>> from shadow buffer to VRAM), or from the worker thread. If
>>>>>>  the regression goes away after disabling the blinking 
>>>>>> cursor, then the worker thread is the problem. If it 
>>>>>> already goes away if there's simply no output from the 
>>>>>> test, the screen update is the problem. On my machine I 
>>>>>> have to disable the blinking cursor, so I think the worker
>>>>>>  causes the performance drop.
>>>>> 
>>>>> We disabled redirecting stdout/stderr to /dev/kmsg,  and the
>>>>>  regression is gone.
>>>>> 
>>>>> commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs 
>>>>> framebuffer console 90f479ae51a drm/mgag200: Replace struct 
>>>>> mga_fbdev with generic framebuffer emulation
>>>>> 
>>>>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde 
>>>>> testcase/testparams/testbox ---------------- 
>>>>> -------------------------- --------------------------- 
>>>>> %stddev      change         %stddev \          | \ 43785 
>>>>> 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 
>>>>> 43785 44481        GEO-MEAN vm-scalability.median
>>>> 
>>>> Till now, from Rong's tests: 1. Disabling cursor blinking 
>>>> doesn't cure the regression. 2. Disabling printint test results
>>>> to console can workaround the regression.
>>>> 
>>>> Also if we set the perfer_shadown to 0, the regression is also 
>>>> gone.
>>> 
>>> We also did some further break down for the time consumed by the 
>>> new code.
>>> 
>>> The drm_fb_helper_dirty_work() calls sequentially 1. 
>>> drm_client_buffer_vmap	  (290 us) 2. 
>>> drm_fb_helper_dirty_blit_real  (19240 us) 3. 
>>> helper->fb->funcs->dirty()    ---> NULL for mgag200 driver 4. 
>>> drm_client_buffer_vunmap       (215 us)
>>> 
>> 
>> It's somewhat different to what I observed, but maybe I just 
>> couldn't reproduce the problem correctly.
>> 
>>> The average run time is listed after the function names.
>>> 
>>> From it, we can see drm_fb_helper_dirty_blit_real() takes too 
>>> long time (about 20ms for each run). I guess this is the root 
>>> cause of this regression, as the original code doesn't use this 
>>> dirty worker.
>> 
>> True, the original code uses a temporary buffer, but updates the 
>> display immediately.
>> 
>> My guess is that this could be a caching problem. The worker runs 
>> on a different CPU, which doesn't have the shadow buffer in cache.
> 
> Yes, that's my thought too. I profiled the working set size, for most
> of the drm_fb_helper_dirty_blit_real(), it will update a buffer 
> 4096x768(3 MB), and as it is called 30~40 times per second, it surely
> will affect the cache.
> 
> 
>>> As said in last email, setting the prefer_shadow to 0 can avoid 
>>> the regrssion. Could it be an option?
>> 
>> Unfortunately not. Without the shadow buffer, the console's
>> display buffer permanently resides in video memory. It consumes
>> significant amount of that memory (say 8 MiB out of 16 MiB). That
>> doesn't leave enough room for anything else.
>> 
>> The best option is to not print to the console.
> 
> Do we have other options here?

I attached two patches. Both show an improvement in my setup at least.
Could you please test them independently from each other and report back?

prefetch.patch prefetches the shadow buffer two scanlines ahead during
the blit function. The idea is to have the scanlines in cache when they
are supposed to go to hardware.

schedule.patch schedules the dirty worker on the current CPU core (i.e.,
the one that did the drawing to the shadow buffer). Hopefully the shadow
buffer remains in cache meanwhile.

Best regards
Thomas

> My thought is this is clearly a regression, that the old driver
> works fine, while the new version in linux-next doesn't. Also for a
> frame buffer console, writting dozens line of message to it is not a
> rare user case. We have many test platforms
> (servers/desktops/laptops) with different kinds of GFX hardwares, and
> this model works fine for many years :)
> 
> Thanks, Feng
> 
> 
> 
>> Best regards Thomas
>> 
>>> Thanks, Feng
>>> 
>>>> 
>>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ 
>>>> b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -167,7 +167,7 @@ 
>>>> int mgag200_driver_load(struct drm_device *dev, unsigned long 
>>>> flags) dev->mode_config.preferred_depth = 16; else 
>>>> dev->mode_config.preferred_depth = 32; - 
>>>> dev->mode_config.prefer_shadow = 1; + 
>>>> dev->mode_config.prefer_shadow = 0;
>>>> 
>>>> And from the perf data, one obvious difference is good case 
>>>> don't call drm_fb_helper_dirty_work(), while bad case calls.
>>>> 
>>>> Thanks, Feng
>>>> 
>>>>> Best Regards, Rong Chen
>>> _______________________________________________ dri-devel mailing
>>> list dri-devel@lists.freedesktop.org 
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>> 
>> 
>> -- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, 
>> Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, 
>> Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
>> 
> 
> 
> 
> _______________________________________________ dri-devel mailing 
> list dri-devel@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.2: prefetch.patch --]
[-- Type: text/x-patch; name="prefetch.patch", Size: 1057 bytes --]

From 7258064b16ab4f44db708670f63c88db8b3f2eea Mon Sep 17 00:00:00 2001
From: Thomas Zimmermann <tzimmermann@suse.de>
Date: Mon, 26 Aug 2019 09:53:38 +0200
Subject: prefetch shadow buffer two lines ahead of blit offset

---
 drivers/gpu/drm/drm_fb_helper.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index a7ba5b4902d6..61cf436840c7 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -33,6 +33,7 @@
 #include <linux/dma-buf.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/prefetch.h>
 #include <linux/slab.h>
 #include <linux/sysrq.h>
 #include <linux/vmalloc.h>
@@ -390,6 +391,8 @@ static void drm_fb_helper_dirty_blit_real(struct drm_fb_helper *fb_helper,
 	unsigned int y;
 
 	for (y = clip->y1; y < clip->y2; y++) {
+		if (y < clip->y2 - 2)
+			prefetch_range(src + 2 * fb->pitches[0], len);
 		memcpy(dst, src, len);
 		src += fb->pitches[0];
 		dst += fb->pitches[0];
-- 
2.22.0


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.3: schedule.patch --]
[-- Type: text/x-patch; name="schedule.patch", Size: 1012 bytes --]

From 60d5322ae3ab2a4c82c1579b37c34abb3b8222f0 Mon Sep 17 00:00:00 2001
From: Thomas Zimmermann <tzimmermann@suse.de>
Date: Mon, 26 Aug 2019 12:17:38 +0200
Subject: schedule dirty worker on local core

---
 drivers/gpu/drm/drm_fb_helper.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index a7ba5b4902d6..9abc950cfae2 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -34,6 +34,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <linux/smp.h>
 #include <linux/sysrq.h>
 #include <linux/vmalloc.h>
 
@@ -642,7 +643,7 @@ static void drm_fb_helper_dirty(struct fb_info *info, u32 x, u32 y,
 	clip->y2 = max_t(u32, clip->y2, y + height);
 	spin_unlock_irqrestore(&helper->dirty_lock, flags);
 
-	schedule_work(&helper->dirty_work);
+	schedule_work_on(smp_processor_id(), &helper->dirty_work);
 }
 
 /**
-- 
2.22.0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-26 10:50                         ` Thomas Zimmermann
@ 2019-08-27 12:33                           ` Chen, Rong A
  2019-08-27 17:16                             ` Thomas Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Chen, Rong A @ 2019-08-27 12:33 UTC (permalink / raw)
  To: Thomas Zimmermann, Feng Tang
  Cc: Stephen Rothwell, michel, lkp, linux-kernel, dri-devel

Hi Thomas,

On 8/26/2019 6:50 PM, Thomas Zimmermann wrote:
> Hi Feng
>
> Am 24.08.19 um 07:16 schrieb Feng Tang:
>> Hi Thomas,
>>
>> On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> I was traveling and could reply earlier. Sorry for taking so long.
>> No problem! I guessed so :)
>>
>>> Am 13.08.19 um 11:36 schrieb Feng Tang:
>>>> Hi Thomas,
>>>>
>>>> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
>>>>>> Hi,
>>>>>>
>>>>>>>> Actually we run the benchmark as a background process, do
>>>>>>>> we need to disable the cursor and test again?
>>>>>>> There's a worker thread that updates the display from the
>>>>>>> shadow buffer. The blinking cursor periodically triggers
>>>>>>> the worker thread, but the actual update is just the size
>>>>>>> of one character.
>>>>>>>
>>>>>>> The point of the test without output is to see if the
>>>>>>> regression comes from the buffer update (i.e., the memcpy
>>>>>>> from shadow buffer to VRAM), or from the worker thread. If
>>>>>>>   the regression goes away after disabling the blinking
>>>>>>> cursor, then the worker thread is the problem. If it
>>>>>>> already goes away if there's simply no output from the
>>>>>>> test, the screen update is the problem. On my machine I
>>>>>>> have to disable the blinking cursor, so I think the worker
>>>>>>>   causes the performance drop.
>>>>>> We disabled redirecting stdout/stderr to /dev/kmsg,  and the
>>>>>>   regression is gone.
>>>>>>
>>>>>> commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs
>>>>>> framebuffer console 90f479ae51a drm/mgag200: Replace struct
>>>>>> mga_fbdev with generic framebuffer emulation
>>>>>>
>>>>>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde
>>>>>> testcase/testparams/testbox ----------------
>>>>>> -------------------------- ---------------------------
>>>>>> %stddev      change         %stddev \          | \ 43785
>>>>>> 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>>>> 43785 44481        GEO-MEAN vm-scalability.median
>>>>> Till now, from Rong's tests: 1. Disabling cursor blinking
>>>>> doesn't cure the regression. 2. Disabling printint test results
>>>>> to console can workaround the regression.
>>>>>
>>>>> Also if we set the perfer_shadown to 0, the regression is also
>>>>> gone.
>>>> We also did some further break down for the time consumed by the
>>>> new code.
>>>>
>>>> The drm_fb_helper_dirty_work() calls sequentially 1.
>>>> drm_client_buffer_vmap	  (290 us) 2.
>>>> drm_fb_helper_dirty_blit_real  (19240 us) 3.
>>>> helper->fb->funcs->dirty()    ---> NULL for mgag200 driver 4.
>>>> drm_client_buffer_vunmap       (215 us)
>>>>
>>> It's somewhat different to what I observed, but maybe I just
>>> couldn't reproduce the problem correctly.
>>>
>>>> The average run time is listed after the function names.
>>>>
>>>>  From it, we can see drm_fb_helper_dirty_blit_real() takes too
>>>> long time (about 20ms for each run). I guess this is the root
>>>> cause of this regression, as the original code doesn't use this
>>>> dirty worker.
>>> True, the original code uses a temporary buffer, but updates the
>>> display immediately.
>>>
>>> My guess is that this could be a caching problem. The worker runs
>>> on a different CPU, which doesn't have the shadow buffer in cache.
>> Yes, that's my thought too. I profiled the working set size, for most
>> of the drm_fb_helper_dirty_blit_real(), it will update a buffer
>> 4096x768(3 MB), and as it is called 30~40 times per second, it surely
>> will affect the cache.
>>
>>
>>>> As said in last email, setting the prefer_shadow to 0 can avoid
>>>> the regrssion. Could it be an option?
>>> Unfortunately not. Without the shadow buffer, the console's
>>> display buffer permanently resides in video memory. It consumes
>>> significant amount of that memory (say 8 MiB out of 16 MiB). That
>>> doesn't leave enough room for anything else.
>>>
>>> The best option is to not print to the console.
>> Do we have other options here?
> I attached two patches. Both show an improvement in my setup at least.
> Could you please test them independently from each other and report back?
>
> prefetch.patch prefetches the shadow buffer two scanlines ahead during
> the blit function. The idea is to have the scanlines in cache when they
> are supposed to go to hardware.
>
> schedule.patch schedules the dirty worker on the current CPU core (i.e.,
> the one that did the drawing to the shadow buffer). Hopefully the shadow
> buffer remains in cache meanwhile.
>
> Best regards
> Thomas

Both patches have little impact on the performance from our side.

prefetch.patch:
commit:
   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic 
framebuffer emulation
   77459f56994 prefetch shadow buffer two lines ahead of blit offset

f1f8555dfb9a70a2  90f479ae51afa45efab97afdde 77459f56994ab87ee5459920b3  
testcase/testparams/testbox
----------------  -------------------------- --------------------------  
---------------------------
          %stddev      change         %stddev      change %stddev
              \          |                \          | \
      42912             -15%      36517             -17% 35515 
vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
      42912             -15%      36517             -17% 35515        
GEO-MEAN vm-scalability.median

schedule.patch:
commit:
   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic 
framebuffer emulation
   ccc5f095c61 schedule dirty worker on local core

f1f8555dfb9a70a2  90f479ae51afa45efab97afdde ccc5f095c61ff6eded0f0ab1b7  
testcase/testparams/testbox
----------------  -------------------------- --------------------------  
---------------------------
          %stddev      change         %stddev      change %stddev
              \          |                \          | \
      42912             -15%      36517             -15%      36556 ±  
4% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
      42912             -15%      36517             -15% 36556        
GEO-MEAN vm-scalability.median

Best Regards,
Rong Chen
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-27 12:33                           ` Chen, Rong A
@ 2019-08-27 17:16                             ` Thomas Zimmermann
  2019-08-28  9:37                               ` Rong Chen
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-27 17:16 UTC (permalink / raw)
  To: Chen, Rong A, Feng Tang
  Cc: Stephen Rothwell, michel, lkp, linux-kernel, dri-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 3041 bytes --]

Hi

Am 27.08.19 um 14:33 schrieb Chen, Rong A:
> 
> Both patches have little impact on the performance from our side.

Thanks for testing. Too bad they doesn't solve the issue.

There's another patch attached. Could you please tests this as well?
Thanks a lot!

The patch comes from Daniel Vetter after discussing the problem on IRC.
The idea of the patch is that the old mgag200 code might display much
less frames that the generic code, because mgag200 only prints from
non-atomic context. If we simulate this with the generic code, we should
see roughly the original performance.

Best regards
Thomas

> 
> prefetch.patch:
> commit:
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
>   77459f56994 prefetch shadow buffer two lines ahead of blit offset
> 
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde 77459f56994ab87ee5459920b3 
> testcase/testparams/testbox
> ----------------  -------------------------- -------------------------- 
> ---------------------------
>          %stddev      change         %stddev      change %stddev
>              \          |                \          | \
>      42912             -15%      36517             -17% 35515
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      42912             -15%      36517             -17% 35515       
> GEO-MEAN vm-scalability.median
> 
> schedule.patch:
> commit:
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
>   ccc5f095c61 schedule dirty worker on local core
> 
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde ccc5f095c61ff6eded0f0ab1b7 
> testcase/testparams/testbox
> ----------------  -------------------------- -------------------------- 
> ---------------------------
>          %stddev      change         %stddev      change %stddev
>              \          |                \          | \
>      42912             -15%      36517             -15%      36556 ±  4%
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      42912             -15%      36517             -15% 36556       
> GEO-MEAN vm-scalability.median
> 
> Best Regards,
> Rong Chen
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.2: usecansleep.patch --]
[-- Type: text/x-patch; name="usecansleep.patch", Size: 838 bytes --]

From e6e72031e85e1ad4cbd38fb47f899bab54bf6bdc Mon Sep 17 00:00:00 2001
From: Thomas Zimmermann <tzimmermann@suse.de>
Date: Tue, 27 Aug 2019 19:00:41 +0200
Subject: only schedule worker from non-atomic context

---
 drivers/gpu/drm/drm_fb_helper.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index a7ba5b4902d6..3a3e4784eb28 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -642,7 +642,8 @@ static void drm_fb_helper_dirty(struct fb_info *info, u32 x, u32 y,
 	clip->y2 = max_t(u32, clip->y2, y + height);
 	spin_unlock_irqrestore(&helper->dirty_lock, flags);
 
-	schedule_work(&helper->dirty_work);
+	if (drm_can_sleep())
+		schedule_work(&helper->dirty_work);
 }
 
 /**
-- 
2.22.0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-27 17:16                             ` Thomas Zimmermann
@ 2019-08-28  9:37                               ` Rong Chen
  2019-08-28 10:51                                 ` Thomas Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Rong Chen @ 2019-08-28  9:37 UTC (permalink / raw)
  To: Thomas Zimmermann, Feng Tang
  Cc: Stephen Rothwell, michel, lkp, linux-kernel, dri-devel

Hi Thomas,

On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
> Hi
>
> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>> Both patches have little impact on the performance from our side.
> Thanks for testing. Too bad they doesn't solve the issue.
>
> There's another patch attached. Could you please tests this as well?
> Thanks a lot!
>
> The patch comes from Daniel Vetter after discussing the problem on IRC.
> The idea of the patch is that the old mgag200 code might display much
> less frames that the generic code, because mgag200 only prints from
> non-atomic context. If we simulate this with the generic code, we should
> see roughly the original performance.
>
>

It's cool, the patch "usecansleep.patch" can fix the issue.

commit:
   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic 
framebuffer emulation
   b976b04c2bc only schedule worker from non-atomic context

f1f8555dfb9a70a2  90f479ae51afa45efab97afdde b976b04c2bcf33148d6c7bc1a2  
testcase/testparams/testbox
----------------  -------------------------- --------------------------  
---------------------------
          %stddev      change         %stddev      change %stddev
              \          |                \          | \
      42912             -15%      36517 44093 
vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
      42912             -15%      36517 44093        GEO-MEAN 
vm-scalability.median

Best Regards,
Rong Chen

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-28  9:37                               ` Rong Chen
@ 2019-08-28 10:51                                 ` Thomas Zimmermann
  2019-09-04  6:27                                   ` Feng Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-08-28 10:51 UTC (permalink / raw)
  To: Rong Chen, Feng Tang
  Cc: Stephen Rothwell, michel, lkp, linux-kernel, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 2398 bytes --]

Hi

Am 28.08.19 um 11:37 schrieb Rong Chen:
> Hi Thomas,
> 
> On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>>> Both patches have little impact on the performance from our side.
>> Thanks for testing. Too bad they doesn't solve the issue.
>>
>> There's another patch attached. Could you please tests this as well?
>> Thanks a lot!
>>
>> The patch comes from Daniel Vetter after discussing the problem on IRC.
>> The idea of the patch is that the old mgag200 code might display much
>> less frames that the generic code, because mgag200 only prints from
>> non-atomic context. If we simulate this with the generic code, we should
>> see roughly the original performance.
>>
>>
> 
> It's cool, the patch "usecansleep.patch" can fix the issue.

Thank you for testing. But don't get too excited, because the patch
simulates a bug that was present in the original mgag200 code. A
significant number of frames are simply skipped. That is apparently the
reason why it's faster.

Best regards
Thomas

> commit:
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
>   b976b04c2bc only schedule worker from non-atomic context
> 
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde b976b04c2bcf33148d6c7bc1a2 
> testcase/testparams/testbox
> ----------------  -------------------------- -------------------------- 
> ---------------------------
>          %stddev      change         %stddev      change %stddev
>              \          |                \          | \
>      42912             -15%      36517 44093
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      42912             -15%      36517 44093        GEO-MEAN
> vm-scalability.median
> 
> Best Regards,
> Rong Chen
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-28 10:51                                 ` Thomas Zimmermann
@ 2019-09-04  6:27                                   ` Feng Tang
  2019-09-04  6:53                                     ` Thomas Zimmermann
  2019-09-09 14:12                                     ` Thomas Zimmermann
  0 siblings, 2 replies; 61+ messages in thread
From: Feng Tang @ 2019-09-04  6:27 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Rong Chen, Stephen Rothwell, michel, lkp, linux-kernel, dri-devel

Hi Thomas,

On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
> Hi
> 
> Am 28.08.19 um 11:37 schrieb Rong Chen:
> > Hi Thomas,
> > 
> > On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
> >> Hi
> >>
> >> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
> >>> Both patches have little impact on the performance from our side.
> >> Thanks for testing. Too bad they doesn't solve the issue.
> >>
> >> There's another patch attached. Could you please tests this as well?
> >> Thanks a lot!
> >>
> >> The patch comes from Daniel Vetter after discussing the problem on IRC.
> >> The idea of the patch is that the old mgag200 code might display much
> >> less frames that the generic code, because mgag200 only prints from
> >> non-atomic context. If we simulate this with the generic code, we should
> >> see roughly the original performance.
> >>
> >>
> > 
> > It's cool, the patch "usecansleep.patch" can fix the issue.
> 
> Thank you for testing. But don't get too excited, because the patch
> simulates a bug that was present in the original mgag200 code. A
> significant number of frames are simply skipped. That is apparently the
> reason why it's faster.

Thanks for the detailed info, so the original code skips time-consuming
work inside atomic context on purpose. Is there any space to optmise it?
If 2 scheduled update worker are handled at almost same time, can one be
skipped?

Thanks,
Feng

> 
> Best regards
> Thomas

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04  6:27                                   ` Feng Tang
@ 2019-09-04  6:53                                     ` Thomas Zimmermann
  2019-09-04  8:11                                       ` Daniel Vetter
  2019-09-09 14:12                                     ` Thomas Zimmermann
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-09-04  6:53 UTC (permalink / raw)
  To: Feng Tang
  Cc: Stephen Rothwell, Rong Chen, michel, linux-kernel, dri-devel, lkp


[-- Attachment #1.1: Type: text/plain, Size: 1255 bytes --]

Hi

Am 04.09.19 um 08:27 schrieb Feng Tang:
>> Thank you for testing. But don't get too excited, because the patch
>> simulates a bug that was present in the original mgag200 code. A
>> significant number of frames are simply skipped. That is apparently the
>> reason why it's faster.
> 
> Thanks for the detailed info, so the original code skips time-consuming
> work inside atomic context on purpose. Is there any space to optmise it?
> If 2 scheduled update worker are handled at almost same time, can one be
> skipped?

To my knowledge, there's only one instance of the worker. Re-scheduling
the worker before a previous instance started, will not create a second
instance. The worker's instance will complete all pending updates. So in
some way, skipping workers already happens.

Best regards
Thomas

> 
> Thanks,
> Feng
> 
>>
>> Best regards
>> Thomas
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04  6:53                                     ` Thomas Zimmermann
@ 2019-09-04  8:11                                       ` Daniel Vetter
  2019-09-04  8:35                                         ` Feng Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Daniel Vetter @ 2019-09-04  8:11 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Feng Tang, Stephen Rothwell, Rong Chen, Michel Dänzer,
	Linux Kernel Mailing List, dri-devel, LKP

On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> Hi
>
> Am 04.09.19 um 08:27 schrieb Feng Tang:
> >> Thank you for testing. But don't get too excited, because the patch
> >> simulates a bug that was present in the original mgag200 code. A
> >> significant number of frames are simply skipped. That is apparently the
> >> reason why it's faster.
> >
> > Thanks for the detailed info, so the original code skips time-consuming
> > work inside atomic context on purpose. Is there any space to optmise it?
> > If 2 scheduled update worker are handled at almost same time, can one be
> > skipped?
>
> To my knowledge, there's only one instance of the worker. Re-scheduling
> the worker before a previous instance started, will not create a second
> instance. The worker's instance will complete all pending updates. So in
> some way, skipping workers already happens.

So I think that the most often fbcon update from atomic context is the
blinking cursor. If you disable that one you should be back to the old
performance level I think, since just writing to dmesg is from process
context, so shouldn't change.

https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking

Bunch of tricks, but tbh I haven't tested them.

In any case, I still strongly advice you don't print anything to dmesg
or fbcon while benchmarking, because dmesg/printf are anything but
fast, especially if a gpu driver is involved. There's some efforts to
make the dmesg/printk side less painful (untangling the console_lock
from printk), but fundamentally printing to the gpu from the kernel
through dmesg/fbcon won't be cheap. It's just not something we
optimize beyond "make sure it works for emergencies".
-Daniel

>
> Best regards
> Thomas
>
> >
> > Thanks,
> > Feng
> >
> >>
> >> Best regards
> >> Thomas
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
>
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04  8:11                                       ` Daniel Vetter
@ 2019-09-04  8:35                                         ` Feng Tang
  2019-09-04  8:43                                           ` Thomas Zimmermann
  2019-09-04  9:17                                           ` Daniel Vetter
  0 siblings, 2 replies; 61+ messages in thread
From: Feng Tang @ 2019-09-04  8:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Thomas Zimmermann, Stephen Rothwell, Rong Chen,
	Michel Dänzer, Linux Kernel Mailing List, dri-devel, LKP

Hi Daniel,

On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >
> > Hi
> >
> > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > >> Thank you for testing. But don't get too excited, because the patch
> > >> simulates a bug that was present in the original mgag200 code. A
> > >> significant number of frames are simply skipped. That is apparently the
> > >> reason why it's faster.
> > >
> > > Thanks for the detailed info, so the original code skips time-consuming
> > > work inside atomic context on purpose. Is there any space to optmise it?
> > > If 2 scheduled update worker are handled at almost same time, can one be
> > > skipped?
> >
> > To my knowledge, there's only one instance of the worker. Re-scheduling
> > the worker before a previous instance started, will not create a second
> > instance. The worker's instance will complete all pending updates. So in
> > some way, skipping workers already happens.
> 
> So I think that the most often fbcon update from atomic context is the
> blinking cursor. If you disable that one you should be back to the old
> performance level I think, since just writing to dmesg is from process
> context, so shouldn't change.

Hmm, then for the old driver, it should also do the most update in
non-atomic context? 

One other thing is, I profiled that updating a 3MB shadow buffer needs
20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
the cache setting of DRM shadow buffer? say the orginal code use a
cachable buffer?


> 
> https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking
> 
> Bunch of tricks, but tbh I haven't tested them.

Thomas has suggested to disable curson by
	echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink

We tried that way, and no change for the performance data.

Thanks,
Feng

> 
> In any case, I still strongly advice you don't print anything to dmesg
> or fbcon while benchmarking, because dmesg/printf are anything but
> fast, especially if a gpu driver is involved. There's some efforts to
> make the dmesg/printk side less painful (untangling the console_lock
> from printk), but fundamentally printing to the gpu from the kernel
> through dmesg/fbcon won't be cheap. It's just not something we
> optimize beyond "make sure it works for emergencies".
> -Daniel
> 
> >
> > Best regards
> > Thomas
> >
> > >
> > > Thanks,
> > > Feng
> > >
> > >>
> > >> Best regards
> > >> Thomas
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >
> >
> > --
> > Thomas Zimmermann
> > Graphics Driver Developer
> > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> > HRB 21284 (AG Nürnberg)
> >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04  8:35                                         ` Feng Tang
@ 2019-09-04  8:43                                           ` Thomas Zimmermann
  2019-09-04 14:30                                             ` Chen, Rong A
  2019-09-04  9:17                                           ` Daniel Vetter
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-09-04  8:43 UTC (permalink / raw)
  To: Feng Tang, Daniel Vetter
  Cc: Stephen Rothwell, Rong Chen, Michel Dänzer,
	Linux Kernel Mailing List, dri-devel, LKP


[-- Attachment #1.1: Type: text/plain, Size: 3766 bytes --]

Hi

Am 04.09.19 um 10:35 schrieb Feng Tang:
> Hi Daniel,
> 
> On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
>> On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>
>>> Hi
>>>
>>> Am 04.09.19 um 08:27 schrieb Feng Tang:
>>>>> Thank you for testing. But don't get too excited, because the patch
>>>>> simulates a bug that was present in the original mgag200 code. A
>>>>> significant number of frames are simply skipped. That is apparently the
>>>>> reason why it's faster.
>>>>
>>>> Thanks for the detailed info, so the original code skips time-consuming
>>>> work inside atomic context on purpose. Is there any space to optmise it?
>>>> If 2 scheduled update worker are handled at almost same time, can one be
>>>> skipped?
>>>
>>> To my knowledge, there's only one instance of the worker. Re-scheduling
>>> the worker before a previous instance started, will not create a second
>>> instance. The worker's instance will complete all pending updates. So in
>>> some way, skipping workers already happens.
>>
>> So I think that the most often fbcon update from atomic context is the
>> blinking cursor. If you disable that one you should be back to the old
>> performance level I think, since just writing to dmesg is from process
>> context, so shouldn't change.
> 
> Hmm, then for the old driver, it should also do the most update in
> non-atomic context? 
> 
> One other thing is, I profiled that updating a 3MB shadow buffer needs
> 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> the cache setting of DRM shadow buffer? say the orginal code use a
> cachable buffer?
> 
> 
>>
>> https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking
>>
>> Bunch of tricks, but tbh I haven't tested them.
> 
> Thomas has suggested to disable curson by
> 	echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
> 
> We tried that way, and no change for the performance data.

There are several ways of disabling the cursor. On my test system, I entered

  tput civis

before the test and got better performance. Did you try this as well?

Best regards
Thomas

> 
> Thanks,
> Feng
> 
>>
>> In any case, I still strongly advice you don't print anything to dmesg
>> or fbcon while benchmarking, because dmesg/printf are anything but
>> fast, especially if a gpu driver is involved. There's some efforts to
>> make the dmesg/printk side less painful (untangling the console_lock
>> from printk), but fundamentally printing to the gpu from the kernel
>> through dmesg/fbcon won't be cheap. It's just not something we
>> optimize beyond "make sure it works for emergencies".
>> -Daniel
>>
>>>
>>> Best regards
>>> Thomas
>>>
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>>>
>>>>> Best regards
>>>>> Thomas
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>
>>>
>>> --
>>> Thomas Zimmermann
>>> Graphics Driver Developer
>>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>>> HRB 21284 (AG Nürnberg)
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>>
>>
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04  8:35                                         ` Feng Tang
  2019-09-04  8:43                                           ` Thomas Zimmermann
@ 2019-09-04  9:17                                           ` Daniel Vetter
  2019-09-04 11:15                                             ` Dave Airlie
  1 sibling, 1 reply; 61+ messages in thread
From: Daniel Vetter @ 2019-09-04  9:17 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Zimmermann, Stephen Rothwell, Rong Chen,
	Michel Dänzer, Linux Kernel Mailing List, dri-devel, LKP

On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <feng.tang@intel.com> wrote:
>
> Hi Daniel,
>
> On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > >
> > > Hi
> > >
> > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > >> Thank you for testing. But don't get too excited, because the patch
> > > >> simulates a bug that was present in the original mgag200 code. A
> > > >> significant number of frames are simply skipped. That is apparently the
> > > >> reason why it's faster.
> > > >
> > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > skipped?
> > >
> > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > the worker before a previous instance started, will not create a second
> > > instance. The worker's instance will complete all pending updates. So in
> > > some way, skipping workers already happens.
> >
> > So I think that the most often fbcon update from atomic context is the
> > blinking cursor. If you disable that one you should be back to the old
> > performance level I think, since just writing to dmesg is from process
> > context, so shouldn't change.
>
> Hmm, then for the old driver, it should also do the most update in
> non-atomic context?
>
> One other thing is, I profiled that updating a 3MB shadow buffer needs
> 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> the cache setting of DRM shadow buffer? say the orginal code use a
> cachable buffer?

Hm, that would indicate the write-combining got broken somewhere. This
should definitely be faster. Also we shouldn't transfer the hole
thing, except when scrolling ...


> > https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking
> >
> > Bunch of tricks, but tbh I haven't tested them.
>
> Thomas has suggested to disable curson by
>         echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>
> We tried that way, and no change for the performance data.

Huh, if there's other atomic contexts for fbcon update then I'm not
aware ... and if it's all the updates, then you wouldn't see a hole
lot on your screen, neither with the old or new fbdev support in
mgag200. I'm a bit confused ...
-Daniel

>
> Thanks,
> Feng
>
> >
> > In any case, I still strongly advice you don't print anything to dmesg
> > or fbcon while benchmarking, because dmesg/printf are anything but
> > fast, especially if a gpu driver is involved. There's some efforts to
> > make the dmesg/printk side less painful (untangling the console_lock
> > from printk), but fundamentally printing to the gpu from the kernel
> > through dmesg/fbcon won't be cheap. It's just not something we
> > optimize beyond "make sure it works for emergencies".
> > -Daniel
> >
> > >
> > > Best regards
> > > Thomas
> > >
> > > >
> > > > Thanks,
> > > > Feng
> > > >
> > > >>
> > > >> Best regards
> > > >> Thomas
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > dri-devel@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > >
> > >
> > > --
> > > Thomas Zimmermann
> > > Graphics Driver Developer
> > > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> > > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> > > HRB 21284 (AG Nürnberg)
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04  9:17                                           ` Daniel Vetter
@ 2019-09-04 11:15                                             ` Dave Airlie
  2019-09-04 11:20                                               ` Daniel Vetter
  0 siblings, 1 reply; 61+ messages in thread
From: Dave Airlie @ 2019-09-04 11:15 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Feng Tang, Stephen Rothwell, Rong Chen, Michel Dänzer,
	Linux Kernel Mailing List, dri-devel, Thomas Zimmermann, LKP

On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > Hi Daniel,
> >
> > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > > >
> > > > Hi
> > > >
> > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > >> significant number of frames are simply skipped. That is apparently the
> > > > >> reason why it's faster.
> > > > >
> > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > skipped?
> > > >
> > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > the worker before a previous instance started, will not create a second
> > > > instance. The worker's instance will complete all pending updates. So in
> > > > some way, skipping workers already happens.
> > >
> > > So I think that the most often fbcon update from atomic context is the
> > > blinking cursor. If you disable that one you should be back to the old
> > > performance level I think, since just writing to dmesg is from process
> > > context, so shouldn't change.
> >
> > Hmm, then for the old driver, it should also do the most update in
> > non-atomic context?
> >
> > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > the cache setting of DRM shadow buffer? say the orginal code use a
> > cachable buffer?
>
> Hm, that would indicate the write-combining got broken somewhere. This
> should definitely be faster. Also we shouldn't transfer the hole
> thing, except when scrolling ...

First rule of fbcon usage, you are always effectively scrolling.

Also these devices might be on a PCIE 1x piece of wet string, not sure
if the numbers reflect that.

Dave.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04 11:15                                             ` Dave Airlie
@ 2019-09-04 11:20                                               ` Daniel Vetter
  2019-09-05  6:59                                                 ` Feng Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Daniel Vetter @ 2019-09-04 11:20 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Stephen Rothwell, Feng Tang, Rong Chen, Michel Dänzer,
	Linux Kernel Mailing List, dri-devel, Thomas Zimmermann, LKP

On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > Hi Daniel,
> > >
> > > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > > > >
> > > > > Hi
> > > > >
> > > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > > >> significant number of frames are simply skipped. That is apparently the
> > > > > >> reason why it's faster.
> > > > > >
> > > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > > skipped?
> > > > >
> > > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > > the worker before a previous instance started, will not create a second
> > > > > instance. The worker's instance will complete all pending updates. So in
> > > > > some way, skipping workers already happens.
> > > >
> > > > So I think that the most often fbcon update from atomic context is the
> > > > blinking cursor. If you disable that one you should be back to the old
> > > > performance level I think, since just writing to dmesg is from process
> > > > context, so shouldn't change.
> > >
> > > Hmm, then for the old driver, it should also do the most update in
> > > non-atomic context?
> > >
> > > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > > the cache setting of DRM shadow buffer? say the orginal code use a
> > > cachable buffer?
> >
> > Hm, that would indicate the write-combining got broken somewhere. This
> > should definitely be faster. Also we shouldn't transfer the hole
> > thing, except when scrolling ...
>
> First rule of fbcon usage, you are always effectively scrolling.
>
> Also these devices might be on a PCIE 1x piece of wet string, not sure
> if the numbers reflect that.

pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and
overhead not entirely out of the question that 150MB/s is actually the
hw limit. If it's really pcie 1x 1.0, no idea where to check that.
Also might be worth to double-check that the gpu pci bar is listed as
wc in debugfs/x86/pat_memtype_list.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04  8:43                                           ` Thomas Zimmermann
@ 2019-09-04 14:30                                             ` Chen, Rong A
  0 siblings, 0 replies; 61+ messages in thread
From: Chen, Rong A @ 2019-09-04 14:30 UTC (permalink / raw)
  To: Thomas Zimmermann, Feng Tang, Daniel Vetter
  Cc: Stephen Rothwell, Michel Dänzer, LKP,
	Linux Kernel Mailing List, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 3767 bytes --]

Hi Thomas,

On 9/4/2019 4:43 PM, Thomas Zimmermann wrote:
> Hi
>
> Am 04.09.19 um 10:35 schrieb Feng Tang:
>> Hi Daniel,
>>
>> On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
>>> On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>> Hi
>>>>
>>>> Am 04.09.19 um 08:27 schrieb Feng Tang:
>>>>>> Thank you for testing. But don't get too excited, because the patch
>>>>>> simulates a bug that was present in the original mgag200 code. A
>>>>>> significant number of frames are simply skipped. That is apparently the
>>>>>> reason why it's faster.
>>>>> Thanks for the detailed info, so the original code skips time-consuming
>>>>> work inside atomic context on purpose. Is there any space to optmise it?
>>>>> If 2 scheduled update worker are handled at almost same time, can one be
>>>>> skipped?
>>>> To my knowledge, there's only one instance of the worker. Re-scheduling
>>>> the worker before a previous instance started, will not create a second
>>>> instance. The worker's instance will complete all pending updates. So in
>>>> some way, skipping workers already happens.
>>> So I think that the most often fbcon update from atomic context is the
>>> blinking cursor. If you disable that one you should be back to the old
>>> performance level I think, since just writing to dmesg is from process
>>> context, so shouldn't change.
>> Hmm, then for the old driver, it should also do the most update in
>> non-atomic context?
>>
>> One other thing is, I profiled that updating a 3MB shadow buffer needs
>> 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
>> the cache setting of DRM shadow buffer? say the orginal code use a
>> cachable buffer?
>>
>>
>>> https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking
>>>
>>> Bunch of tricks, but tbh I haven't tested them.
>> Thomas has suggested to disable curson by
>> 	echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>>
>> We tried that way, and no change for the performance data.
> There are several ways of disabling the cursor. On my test system, I entered
>
>    tput civis
>
> before the test and got better performance. Did you try this as well?

There's no obvious change on our system.

Best Regards,
Rong Chen

>
> Best regards
> Thomas
>
>> Thanks,
>> Feng
>>
>>> In any case, I still strongly advice you don't print anything to dmesg
>>> or fbcon while benchmarking, because dmesg/printf are anything but
>>> fast, especially if a gpu driver is involved. There's some efforts to
>>> make the dmesg/printk side less painful (untangling the console_lock
>>> from printk), but fundamentally printing to the gpu from the kernel
>>> through dmesg/fbcon won't be cheap. It's just not something we
>>> optimize beyond "make sure it works for emergencies".
>>> -Daniel
>>>
>>>> Best regards
>>>> Thomas
>>>>
>>>>> Thanks,
>>>>> Feng
>>>>>
>>>>>> Best regards
>>>>>> Thomas
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>
>>>> --
>>>> Thomas Zimmermann
>>>> Graphics Driver Developer
>>>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>>>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>>>> HRB 21284 (AG Nürnberg)
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>>
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
> _______________________________________________
> LKP mailing list
> LKP@lists.01.org
> https://lists.01.org/mailman/listinfo/lkp


[-- Attachment #1.2: Type: text/html, Size: 6615 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04 11:20                                               ` Daniel Vetter
@ 2019-09-05  6:59                                                 ` Feng Tang
  2019-09-05 10:37                                                   ` Daniel Vetter
  0 siblings, 1 reply; 61+ messages in thread
From: Feng Tang @ 2019-09-05  6:59 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, Stephen Rothwell, Rong Chen, Michel Dänzer,
	Linux Kernel Mailing List, dri-devel, Thomas Zimmermann, LKP

Hi Vetter,

On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
> On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <feng.tang@intel.com> wrote:
> > > >
> > > > Hi Daniel,
> > > >
> > > > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > > > > >
> > > > > > Hi
> > > > > >
> > > > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > > > >> significant number of frames are simply skipped. That is apparently the
> > > > > > >> reason why it's faster.
> > > > > > >
> > > > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > > > skipped?
> > > > > >
> > > > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > > > the worker before a previous instance started, will not create a second
> > > > > > instance. The worker's instance will complete all pending updates. So in
> > > > > > some way, skipping workers already happens.
> > > > >
> > > > > So I think that the most often fbcon update from atomic context is the
> > > > > blinking cursor. If you disable that one you should be back to the old
> > > > > performance level I think, since just writing to dmesg is from process
> > > > > context, so shouldn't change.
> > > >
> > > > Hmm, then for the old driver, it should also do the most update in
> > > > non-atomic context?
> > > >
> > > > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > > > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > > > the cache setting of DRM shadow buffer? say the orginal code use a
> > > > cachable buffer?
> > >
> > > Hm, that would indicate the write-combining got broken somewhere. This
> > > should definitely be faster. Also we shouldn't transfer the hole
> > > thing, except when scrolling ...
> >
> > First rule of fbcon usage, you are always effectively scrolling.
> >
> > Also these devices might be on a PCIE 1x piece of wet string, not sure
> > if the numbers reflect that.
> 
> pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and
> overhead not entirely out of the question that 150MB/s is actually the
> hw limit. If it's really pcie 1x 1.0, no idea where to check that.
> Also might be worth to double-check that the gpu pci bar is listed as
> wc in debugfs/x86/pat_memtype_list.

Here is some dump of the device info and the pat_memtype_list, while it is
running other 0day task:

controller info
=================
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller])
	Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1)
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 16
	NUMA node: 0
	Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M]
	Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K]
	Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [dc] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit-
		Address: 00000000  Data: 0000
	Kernel driver in use: mgag200
	Kernel modules: mgag200


Related pat setting
===================
uncached-minus @ 0xc0000000-0xc0001000
uncached-minus @ 0xc0000000-0xd0000000
uncached-minus @ 0xc0008000-0xc0009000
uncached-minus @ 0xc0009000-0xc000a000
uncached-minus @ 0xc0010000-0xc0011000
uncached-minus @ 0xc0011000-0xc0012000
uncached-minus @ 0xc0012000-0xc0013000
uncached-minus @ 0xc0013000-0xc0014000
uncached-minus @ 0xc0018000-0xc0019000
uncached-minus @ 0xc0019000-0xc001a000
uncached-minus @ 0xc001a000-0xc001b000
write-combining @ 0xd0000000-0xd0300000
write-combining @ 0xd0000000-0xd1000000
uncached-minus @ 0xd1800000-0xd1804000
uncached-minus @ 0xd1900000-0xd1980000
uncached-minus @ 0xd1980000-0xd1981000
uncached-minus @ 0xd1a00000-0xd1a80000
uncached-minus @ 0xd1a80000-0xd1a81000
uncached-minus @ 0xd1f10000-0xd1f11000
uncached-minus @ 0xd1f11000-0xd1f12000
uncached-minus @ 0xd1f12000-0xd1f13000

Host bridge info
================
00:00.0 Host bridge: Intel Corporation Device 7853
	Subsystem: Intel Corporation Device 0000
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 0
	NUMA node: 0
	Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us
			ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
		RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [e0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
	Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?>
	Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
	Capabilities: [250 v1] #19
	Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?>
	Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>


Thanks,
Feng


>
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-05  6:59                                                 ` Feng Tang
@ 2019-09-05 10:37                                                   ` Daniel Vetter
  2019-09-05 10:48                                                     ` Feng Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Daniel Vetter @ 2019-09-05 10:37 UTC (permalink / raw)
  To: Feng Tang
  Cc: Dave Airlie, Stephen Rothwell, Rong Chen, Michel Dänzer,
	Linux Kernel Mailing List, dri-devel, Thomas Zimmermann, LKP

On Thu, Sep 5, 2019 at 8:58 AM Feng Tang <feng.tang@intel.com> wrote:
>
> Hi Vetter,
>
> On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
> > On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie <airlied@gmail.com> wrote:
> > >
> > > On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <feng.tang@intel.com> wrote:
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > > > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > > > > > >
> > > > > > > Hi
> > > > > > >
> > > > > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > > > > >> significant number of frames are simply skipped. That is apparently the
> > > > > > > >> reason why it's faster.
> > > > > > > >
> > > > > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > > > > skipped?
> > > > > > >
> > > > > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > > > > the worker before a previous instance started, will not create a second
> > > > > > > instance. The worker's instance will complete all pending updates. So in
> > > > > > > some way, skipping workers already happens.
> > > > > >
> > > > > > So I think that the most often fbcon update from atomic context is the
> > > > > > blinking cursor. If you disable that one you should be back to the old
> > > > > > performance level I think, since just writing to dmesg is from process
> > > > > > context, so shouldn't change.
> > > > >
> > > > > Hmm, then for the old driver, it should also do the most update in
> > > > > non-atomic context?
> > > > >
> > > > > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > > > > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > > > > the cache setting of DRM shadow buffer? say the orginal code use a
> > > > > cachable buffer?
> > > >
> > > > Hm, that would indicate the write-combining got broken somewhere. This
> > > > should definitely be faster. Also we shouldn't transfer the hole
> > > > thing, except when scrolling ...
> > >
> > > First rule of fbcon usage, you are always effectively scrolling.
> > >
> > > Also these devices might be on a PCIE 1x piece of wet string, not sure
> > > if the numbers reflect that.
> >
> > pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and
> > overhead not entirely out of the question that 150MB/s is actually the
> > hw limit. If it's really pcie 1x 1.0, no idea where to check that.
> > Also might be worth to double-check that the gpu pci bar is listed as
> > wc in debugfs/x86/pat_memtype_list.
>
> Here is some dump of the device info and the pat_memtype_list, while it is
> running other 0day task:

Looks all good, I guess Dave is right with this probably only being a
real slow, real old pcie link, plus maybe some inefficiencies in the
mapping. Your 150MB/s, was that just the copy, or did you include all
the setup/map/unmap/teardown too in your measurement in the trace?
-Daniel

>
> controller info
> =================
> 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller])
>         Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1)
>         Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 16
>         NUMA node: 0
>         Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M]
>         Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K]
>         Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M]
>         Expansion ROM at 000c0000 [disabled] [size=128K]
>         Capabilities: [dc] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                         MaxPayload 128 bytes, MaxReadReq 128 bytes
>                 DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
>                 LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>         Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit-
>                 Address: 00000000  Data: 0000
>         Kernel driver in use: mgag200
>         Kernel modules: mgag200
>
>
> Related pat setting
> ===================
> uncached-minus @ 0xc0000000-0xc0001000
> uncached-minus @ 0xc0000000-0xd0000000
> uncached-minus @ 0xc0008000-0xc0009000
> uncached-minus @ 0xc0009000-0xc000a000
> uncached-minus @ 0xc0010000-0xc0011000
> uncached-minus @ 0xc0011000-0xc0012000
> uncached-minus @ 0xc0012000-0xc0013000
> uncached-minus @ 0xc0013000-0xc0014000
> uncached-minus @ 0xc0018000-0xc0019000
> uncached-minus @ 0xc0019000-0xc001a000
> uncached-minus @ 0xc001a000-0xc001b000
> write-combining @ 0xd0000000-0xd0300000
> write-combining @ 0xd0000000-0xd1000000
> uncached-minus @ 0xd1800000-0xd1804000
> uncached-minus @ 0xd1900000-0xd1980000
> uncached-minus @ 0xd1980000-0xd1981000
> uncached-minus @ 0xd1a00000-0xd1a80000
> uncached-minus @ 0xd1a80000-0xd1a81000
> uncached-minus @ 0xd1f10000-0xd1f11000
> uncached-minus @ 0xd1f11000-0xd1f12000
> uncached-minus @ 0xd1f12000-0xd1f13000
>
> Host bridge info
> ================
> 00:00.0 Host bridge: Intel Corporation Device 7853
>         Subsystem: Intel Corporation Device 0000
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 0
>         NUMA node: 0
>         Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
>                 DevCap: MaxPayload 128 bytes, PhantFunc 0
>                         ExtTag- RBE+
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                         MaxPayload 128 bytes, MaxReadReq 128 bytes
>                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>                 LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us
>                         ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
>                 RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
>                 RootCap: CRSVisible-
>                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>                 DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
>                 LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>         Capabilities: [e0] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
>         Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?>
>         Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
>         Capabilities: [250 v1] #19
>         Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?>
>         Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>
>
>
> Thanks,
> Feng
>
>
> >
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-05 10:37                                                   ` Daniel Vetter
@ 2019-09-05 10:48                                                     ` Feng Tang
  0 siblings, 0 replies; 61+ messages in thread
From: Feng Tang @ 2019-09-05 10:48 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Stephen Rothwell, Chen, Rong A, LKP, Michel D?nzer,
	Linux Kernel Mailing List, dri-devel, Thomas Zimmermann

On Thu, Sep 05, 2019 at 06:37:47PM +0800, Daniel Vetter wrote:
> On Thu, Sep 5, 2019 at 8:58 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > Hi Vetter,
> >
> > On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
> > > On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > > On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >
> > > > > On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <feng.tang@intel.com> wrote:
> > > > > >
> > > > > > Hi Daniel,
> > > > > >
> > > > > > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > > > > > > >
> > > > > > > > Hi
> > > > > > > >
> > > > > > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > > > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > > > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > > > > > >> significant number of frames are simply skipped. That is apparently the
> > > > > > > > >> reason why it's faster.
> > > > > > > > >
> > > > > > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > > > > > skipped?
> > > > > > > >
> > > > > > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > > > > > the worker before a previous instance started, will not create a second
> > > > > > > > instance. The worker's instance will complete all pending updates. So in
> > > > > > > > some way, skipping workers already happens.
> > > > > > >
> > > > > > > So I think that the most often fbcon update from atomic context is the
> > > > > > > blinking cursor. If you disable that one you should be back to the old
> > > > > > > performance level I think, since just writing to dmesg is from process
> > > > > > > context, so shouldn't change.
> > > > > >
> > > > > > Hmm, then for the old driver, it should also do the most update in
> > > > > > non-atomic context?
> > > > > >
> > > > > > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > > > > > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > > > > > the cache setting of DRM shadow buffer? say the orginal code use a
> > > > > > cachable buffer?
> > > > >
> > > > > Hm, that would indicate the write-combining got broken somewhere. This
> > > > > should definitely be faster. Also we shouldn't transfer the hole
> > > > > thing, except when scrolling ...
> > > >
> > > > First rule of fbcon usage, you are always effectively scrolling.
> > > >
> > > > Also these devices might be on a PCIE 1x piece of wet string, not sure
> > > > if the numbers reflect that.
> > >
> > > pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and
> > > overhead not entirely out of the question that 150MB/s is actually the
> > > hw limit. If it's really pcie 1x 1.0, no idea where to check that.
> > > Also might be worth to double-check that the gpu pci bar is listed as
> > > wc in debugfs/x86/pat_memtype_list.
> >
> > Here is some dump of the device info and the pat_memtype_list, while it is
> > running other 0day task:
> 
> Looks all good, I guess Dave is right with this probably only being a
> real slow, real old pcie link, plus maybe some inefficiencies in the
> mapping. Your 150MB/s, was that just the copy, or did you include all
> the setup/map/unmap/teardown too in your measurement in the trace?


Following is the breakdown, the 19240 us is the memory copy time

The drm_fb_helper_dirty_work() calls sequentially 
1. drm_client_buffer_vmap	  (290 us)
2. drm_fb_helper_dirty_blit_real  (19240 us)
3. helper->fb->funcs->dirty()    ---> NULL for mgag200 driver
4. drm_client_buffer_vunmap       (215 us)

Thanks,
Feng


> -Daniel
> 
> >
> > controller info
> > =================
> > 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller])
> >         Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1)
> >         Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> >         Interrupt: pin A routed to IRQ 16
> >         NUMA node: 0
> >         Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M]
> >         Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K]
> >         Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M]
> >         Expansion ROM at 000c0000 [disabled] [size=128K]
> >         Capabilities: [dc] Power Management version 2
> >                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> >                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> >         Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00
> >                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
> >                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> >                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> >                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
> >                         MaxPayload 128 bytes, MaxReadReq 128 bytes
> >                 DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
> >                 LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
> >                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
> >                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> >                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> >                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >         Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit-
> >                 Address: 00000000  Data: 0000
> >         Kernel driver in use: mgag200
> >         Kernel modules: mgag200
> >
> >
> > Related pat setting
> > ===================
> > uncached-minus @ 0xc0000000-0xc0001000
> > uncached-minus @ 0xc0000000-0xd0000000
> > uncached-minus @ 0xc0008000-0xc0009000
> > uncached-minus @ 0xc0009000-0xc000a000
> > uncached-minus @ 0xc0010000-0xc0011000
> > uncached-minus @ 0xc0011000-0xc0012000
> > uncached-minus @ 0xc0012000-0xc0013000
> > uncached-minus @ 0xc0013000-0xc0014000
> > uncached-minus @ 0xc0018000-0xc0019000
> > uncached-minus @ 0xc0019000-0xc001a000
> > uncached-minus @ 0xc001a000-0xc001b000
> > write-combining @ 0xd0000000-0xd0300000
> > write-combining @ 0xd0000000-0xd1000000
> > uncached-minus @ 0xd1800000-0xd1804000
> > uncached-minus @ 0xd1900000-0xd1980000
> > uncached-minus @ 0xd1980000-0xd1981000
> > uncached-minus @ 0xd1a00000-0xd1a80000
> > uncached-minus @ 0xd1a80000-0xd1a81000
> > uncached-minus @ 0xd1f10000-0xd1f11000
> > uncached-minus @ 0xd1f11000-0xd1f12000
> > uncached-minus @ 0xd1f12000-0xd1f13000
> >
> > Host bridge info
> > ================
> > 00:00.0 Host bridge: Intel Corporation Device 7853
> >         Subsystem: Intel Corporation Device 0000
> >         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
> >         Interrupt: pin A routed to IRQ 0
> >         NUMA node: 0
> >         Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
> >                 DevCap: MaxPayload 128 bytes, PhantFunc 0
> >                         ExtTag- RBE+
> >                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> >                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> >                         MaxPayload 128 bytes, MaxReadReq 128 bytes
> >                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> >                 LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us
> >                         ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
> >                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> >                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> >                 LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
> >                 RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
> >                 RootCap: CRSVisible-
> >                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> >                 DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
> >                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
> >                 LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> >                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> >                          Compliance De-emphasis: -6dB
> >                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> >                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> >         Capabilities: [e0] Power Management version 3
> >                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> >                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> >         Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
> >         Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?>
> >         Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
> >         Capabilities: [250 v1] #19
> >         Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?>
> >         Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>
> >
> >
> > Thanks,
> > Feng
> >
> >
> > >
> > > -Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> 
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-04  6:27                                   ` Feng Tang
  2019-09-04  6:53                                     ` Thomas Zimmermann
@ 2019-09-09 14:12                                     ` Thomas Zimmermann
  2019-09-16  9:06                                       ` Feng Tang
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2019-09-09 14:12 UTC (permalink / raw)
  To: Feng Tang
  Cc: Stephen Rothwell, Rong Chen, michel, linux-kernel, dri-devel, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 2116 bytes --]

Hi

Am 04.09.19 um 08:27 schrieb Feng Tang:
> Hi Thomas,
> 
> On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 28.08.19 um 11:37 schrieb Rong Chen:
>>> Hi Thomas,
>>>
>>> On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
>>>> Hi
>>>>
>>>> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>>>>> Both patches have little impact on the performance from our side.
>>>> Thanks for testing. Too bad they doesn't solve the issue.
>>>>
>>>> There's another patch attached. Could you please tests this as well?
>>>> Thanks a lot!
>>>>
>>>> The patch comes from Daniel Vetter after discussing the problem on IRC.
>>>> The idea of the patch is that the old mgag200 code might display much
>>>> less frames that the generic code, because mgag200 only prints from
>>>> non-atomic context. If we simulate this with the generic code, we should
>>>> see roughly the original performance.
>>>>
>>>>
>>>
>>> It's cool, the patch "usecansleep.patch" can fix the issue.
>>
>> Thank you for testing. But don't get too excited, because the patch
>> simulates a bug that was present in the original mgag200 code. A
>> significant number of frames are simply skipped. That is apparently the
>> reason why it's faster.
> 
> Thanks for the detailed info, so the original code skips time-consuming
> work inside atomic context on purpose. Is there any space to optmise it?
> If 2 scheduled update worker are handled at almost same time, can one be
> skipped?

We discussed ideas on IRC and decided that screen updates could be
synchronized with vblank intervals. This may give some rate limiting to
the output.

If you like, you could try the patch set at [1]. It adds the respective
code to console and mgag200.

Best regards
Thomas

[1]
https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html

> 
> Thanks,
> Feng
> 
>>
>> Best regards
>> Thomas

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-09 14:12                                     ` Thomas Zimmermann
@ 2019-09-16  9:06                                       ` Feng Tang
  2019-09-17  8:48                                         ` Thomas Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Feng Tang @ 2019-09-16  9:06 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Rong Chen, Stephen Rothwell, michel, lkp, linux-kernel, dri-devel

Hi Thomas,

On Mon, Sep 09, 2019 at 04:12:37PM +0200, Thomas Zimmermann wrote:
> Hi
> 
> Am 04.09.19 um 08:27 schrieb Feng Tang:
> > Hi Thomas,
> > 
> > On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
> >> Hi
> >>
> >> Am 28.08.19 um 11:37 schrieb Rong Chen:
> >>> Hi Thomas,
> >>>
> >>> On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
> >>>> Hi
> >>>>
> >>>> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
> >>>>> Both patches have little impact on the performance from our side.
> >>>> Thanks for testing. Too bad they doesn't solve the issue.
> >>>>
> >>>> There's another patch attached. Could you please tests this as well?
> >>>> Thanks a lot!
> >>>>
> >>>> The patch comes from Daniel Vetter after discussing the problem on IRC.
> >>>> The idea of the patch is that the old mgag200 code might display much
> >>>> less frames that the generic code, because mgag200 only prints from
> >>>> non-atomic context. If we simulate this with the generic code, we should
> >>>> see roughly the original performance.
> >>>>
> >>>>
> >>>
> >>> It's cool, the patch "usecansleep.patch" can fix the issue.
> >>
> >> Thank you for testing. But don't get too excited, because the patch
> >> simulates a bug that was present in the original mgag200 code. A
> >> significant number of frames are simply skipped. That is apparently the
> >> reason why it's faster.
> > 
> > Thanks for the detailed info, so the original code skips time-consuming
> > work inside atomic context on purpose. Is there any space to optmise it?
> > If 2 scheduled update worker are handled at almost same time, can one be
> > skipped?
> 
> We discussed ideas on IRC and decided that screen updates could be
> synchronized with vblank intervals. This may give some rate limiting to
> the output.
> 
> If you like, you could try the patch set at [1]. It adds the respective
> code to console and mgag200.

I just tried the 2 patches, no obvious change (comparing to the
18.8% regression), both in overall benchmark and micro-profiling.

90f479ae51afa45e 04a0983095feaee022cdd65e3e4 
---------------- --------------------------- 
     37236 ±  3%      +2.5%      38167 ±  3%  vm-scalability.median
      0.15 ± 24%     -25.1%       0.11 ± 23%  vm-scalability.median_stddev
      0.15 ± 23%     -25.1%       0.11 ± 22%  vm-scalability.stddev
  12767318 ±  4%      +2.5%   13089177 ±  3%  vm-scalability.throughput
 
Thanks,
Feng

> 
> Best regards
> Thomas
> 
> [1]
> https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html
> 
> > 
> > Thanks,
> > Feng
> > 
> >>
> >> Best regards
> >> Thomas
> 
> -- 
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-09-16  9:06                                       ` Feng Tang
@ 2019-09-17  8:48                                         ` Thomas Zimmermann
  0 siblings, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2019-09-17  8:48 UTC (permalink / raw)
  To: Feng Tang
  Cc: Stephen Rothwell, Rong Chen, michel, linux-kernel, dri-devel, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 3446 bytes --]

Hi

Am 16.09.19 um 11:06 schrieb Feng Tang:
> Hi Thomas,
> 
> On Mon, Sep 09, 2019 at 04:12:37PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 04.09.19 um 08:27 schrieb Feng Tang:
>>> Hi Thomas,
>>>
>>> On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
>>>> Hi
>>>>
>>>> Am 28.08.19 um 11:37 schrieb Rong Chen:
>>>>> Hi Thomas,
>>>>>
>>>>> On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
>>>>>> Hi
>>>>>>
>>>>>> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>>>>>>> Both patches have little impact on the performance from our side.
>>>>>> Thanks for testing. Too bad they doesn't solve the issue.
>>>>>>
>>>>>> There's another patch attached. Could you please tests this as well?
>>>>>> Thanks a lot!
>>>>>>
>>>>>> The patch comes from Daniel Vetter after discussing the problem on IRC.
>>>>>> The idea of the patch is that the old mgag200 code might display much
>>>>>> less frames that the generic code, because mgag200 only prints from
>>>>>> non-atomic context. If we simulate this with the generic code, we should
>>>>>> see roughly the original performance.
>>>>>>
>>>>>>
>>>>>
>>>>> It's cool, the patch "usecansleep.patch" can fix the issue.
>>>>
>>>> Thank you for testing. But don't get too excited, because the patch
>>>> simulates a bug that was present in the original mgag200 code. A
>>>> significant number of frames are simply skipped. That is apparently the
>>>> reason why it's faster.
>>>
>>> Thanks for the detailed info, so the original code skips time-consuming
>>> work inside atomic context on purpose. Is there any space to optmise it?
>>> If 2 scheduled update worker are handled at almost same time, can one be
>>> skipped?
>>
>> We discussed ideas on IRC and decided that screen updates could be
>> synchronized with vblank intervals. This may give some rate limiting to
>> the output.
>>
>> If you like, you could try the patch set at [1]. It adds the respective
>> code to console and mgag200.
> 
> I just tried the 2 patches, no obvious change (comparing to the
> 18.8% regression), both in overall benchmark and micro-profiling.
> 
> 90f479ae51afa45e 04a0983095feaee022cdd65e3e4 
> ---------------- --------------------------- 
>      37236 ±  3%      +2.5%      38167 ±  3%  vm-scalability.median
>       0.15 ± 24%     -25.1%       0.11 ± 23%  vm-scalability.median_stddev
>       0.15 ± 23%     -25.1%       0.11 ± 22%  vm-scalability.stddev
>   12767318 ±  4%      +2.5%   13089177 ±  3%  vm-scalability.throughput

Thank you for testing. I wish we'd seen at least some improvement.

Best regards
Thomas

> Thanks,
> Feng
> 
>>
>> Best regards
>> Thomas
>>
>> [1]
>> https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html
>>
>>>
>>> Thanks,
>>> Feng
>>>
>>>>
>>>> Best regards
>>>> Thomas
>>
>> -- 
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>> HRB 21284 (AG Nürnberg)
>>
> 
> 
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2019-08-05 12:52         ` Feng Tang
@ 2020-01-06 13:19           ` Thomas Zimmermann
  2020-01-08  2:25             ` Rong Chen
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Zimmermann @ 2020-01-06 13:19 UTC (permalink / raw)
  To: Feng Tang
  Cc: Stephen Rothwell, kernel test robot, michel, dri-devel, ying.huang, lkp


[-- Attachment #1.1.1: Type: text/plain, Size: 5740 bytes --]

Hi Feng,

do you still have the test setup that produced the performance penalty?

If so, could you give a try to the patchset at [1]? I think I've fixed
the remaining issues in earlier versions and I'd like to see if it
actually improves performance.

Best regards
Thomas

[1]
https://lists.freedesktop.org/archives/dri-devel/2019-December/247771.html

Am 05.08.19 um 14:52 schrieb Feng Tang:
> Hi Thomas,
> 
> On Mon, Aug 05, 2019 at 12:22:11PM +0200, Thomas Zimmermann wrote:
> 
> 	[snip] 
> 
>>>>   2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>>>>   2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc --prefault
>>>>     -O -U 815394406
>>>>   917318700 bytes / 659419 usecs = 1358497 KB/s
>>>>   917318700 bytes / 659658 usecs = 1358005 KB/s
>>>>   917318700 bytes / 659916 usecs = 1357474 KB/s
>>>>   917318700 bytes / 660168 usecs = 1356956 KB/s
>>>>
>>>> Rong, Feng, could you confirm this by disabling the cursor or blinking?
>>>
>>> Glad to know this method restored the drop. Rong is running the case.
>>>
>>> While I have another finds, as I noticed your patch changed the bpp from
>>> 24 to 32, I had a patch to change it back to 24, and run the case in
>>> the weekend, the -18% regrssion was reduced to about -5%. Could this
>>> be related?
>>
>> In the original code, the fbdev console already ran with 32 bpp [1] and
>> 16 bpp was selected for low-end devices. [2][3] The patch only set the
>> same values for userspace; nothing changed for the console.
> 
> I did the experiment becasue I checked the commit 
> 
> 90f479ae51afa4 drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
> 
> in which there is code:
> 
> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
> index b10f726..a977333 100644
> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>  	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>  		dev->mode_config.preferred_depth = 16;
>  	else
> -		dev->mode_config.preferred_depth = 24;
> +		dev->mode_config.preferred_depth = 32;
>  	dev->mode_config.prefer_shadow = 1;
>  
> My debug patch was kind of restoring of this part.
> 
> Thanks,
> Feng
> 
>>
>> Best regards
>> Thomas
>>
>> [1]
>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n259
>> [2]
>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n263
>> [3]
>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n286
>>
>>>
>>> commit: 
>>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
>>>   01e75fea0d5 mgag200: restore the depth back to 24
>>>
>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5 
>>> ---------------- --------------------------- --------------------------- 
>>>      43921 ±  2%     -18.3%      35884            -4.8%      41826        vm-scalability.median
>>>   14889337           -17.5%   12291029            -4.1%   14278574        vm-scalability.throughput
>>>  
>>> commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
>>> Author: Feng Tang <feng.tang@intel.com>
>>> Date:   Fri Aug 2 15:09:19 2019 +0800
>>>
>>>     mgag200: restore the depth back to 24
>>>     
>>>     Signed-off-by: Feng Tang <feng.tang@intel.com>
>>>
>>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
>>> index a977333..ac8f6c9 100644
>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>>>  	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>>  		dev->mode_config.preferred_depth = 16;
>>>  	else
>>> -		dev->mode_config.preferred_depth = 32;
>>> +		dev->mode_config.preferred_depth = 24;>  	dev->mode_config.prefer_shadow = 1;
>>>  
>>>  	r = mgag200_modeset_init(mdev);
>>>
>>> Thanks,
>>> Feng
>>>
>>>>
>>>>
>>>> The difference between mgag200's original fbdev support and generic
>>>> fbdev emulation is generic fbdev's worker task that updates the VRAM
>>>> buffer from the shadow buffer. mgag200 does this immediately, but relies
>>>> on drm_can_sleep(), which is deprecated.
>>>>
>>>> I think that the worker task interferes with the test case, as the
>>>> worker has been in fbdev emulation since forever and no performance
>>>> regressions have been reported so far.
>>>>
>>>>
>>>> So unless there's a report where this problem happens in a real-world
>>>> use case, I'd like to keep code as it is. And apparently there's always
>>>> the workaround of disabling the cursor blinking.
>>>>
>>>> Best regards
>>>> Thomas
>>>>
>>
>> -- 
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>> HRB 21284 (AG Nürnberg)
>>
> 
> 
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2020-01-06 13:19           ` Thomas Zimmermann
@ 2020-01-08  2:25             ` Rong Chen
  2020-01-08  5:20               ` Thomas Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Rong Chen @ 2020-01-08  2:25 UTC (permalink / raw)
  To: Thomas Zimmermann, Feng Tang
  Cc: Stephen Rothwell, michel, lkp, dri-devel, ying.huang


[-- Attachment #1.1: Type: text/plain, Size: 6625 bytes --]

Hi Thomas,

The previous throughput was reduced from 43955 to 35691, and there is a little increase in next-20200106,
but there is no obvious change after the patchset:
  
commit:
   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")

f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
---------------- ---------------------------
          %stddev     %change         %stddev
              \          |                \
      43955 ±  2%     -18.8%      35691        vm-scalability.median

commit:

   9eb1b48ca4 ("Add linux-next specific files for 20200106")
   5f20199bac ("drm/fb-helper: Synchronize dirty worker with vblank")

  next-20200106  5f20199bac9b2de71fd2158b90
----------------  --------------------------
          %stddev      change         %stddev
              \          |                \
      38550                       38744
      38549                       38744        vm-scalability.median


Best Regards,
Rong Chen

On 1/6/20 9:19 PM, Thomas Zimmermann wrote:
> Hi Feng,
>
> do you still have the test setup that produced the performance penalty?
>
> If so, could you give a try to the patchset at [1]? I think I've fixed
> the remaining issues in earlier versions and I'd like to see if it
> actually improves performance.
>
> Best regards
> Thomas
>
> [1]
> https://lists.freedesktop.org/archives/dri-devel/2019-December/247771.html
>
> Am 05.08.19 um 14:52 schrieb Feng Tang:
>> Hi Thomas,
>>
>> On Mon, Aug 05, 2019 at 12:22:11PM +0200, Thomas Zimmermann wrote:
>>
>> 	[snip]
>>
>>>>>    2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>>>>>    2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc --prefault
>>>>>      -O -U 815394406
>>>>>    917318700 bytes / 659419 usecs = 1358497 KB/s
>>>>>    917318700 bytes / 659658 usecs = 1358005 KB/s
>>>>>    917318700 bytes / 659916 usecs = 1357474 KB/s
>>>>>    917318700 bytes / 660168 usecs = 1356956 KB/s
>>>>>
>>>>> Rong, Feng, could you confirm this by disabling the cursor or blinking?
>>>> Glad to know this method restored the drop. Rong is running the case.
>>>>
>>>> While I have another finds, as I noticed your patch changed the bpp from
>>>> 24 to 32, I had a patch to change it back to 24, and run the case in
>>>> the weekend, the -18% regrssion was reduced to about -5%. Could this
>>>> be related?
>>> In the original code, the fbdev console already ran with 32 bpp [1] and
>>> 16 bpp was selected for low-end devices. [2][3] The patch only set the
>>> same values for userspace; nothing changed for the console.
>> I did the experiment becasue I checked the commit
>>
>> 90f479ae51afa4 drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
>>
>> in which there is code:
>>
>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
>> index b10f726..a977333 100644
>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>>   	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>   		dev->mode_config.preferred_depth = 16;
>>   	else
>> -		dev->mode_config.preferred_depth = 24;
>> +		dev->mode_config.preferred_depth = 32;
>>   	dev->mode_config.prefer_shadow = 1;
>>   
>> My debug patch was kind of restoring of this part.
>>
>> Thanks,
>> Feng
>>
>>> Best regards
>>> Thomas
>>>
>>> [1]
>>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n259
>>> [2]
>>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n263
>>> [3]
>>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n286
>>>
>>>> commit:
>>>>    f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>>    90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
>>>>    01e75fea0d5 mgag200: restore the depth back to 24
>>>>
>>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
>>>> ---------------- --------------------------- ---------------------------
>>>>       43921 ±  2%     -18.3%      35884            -4.8%      41826        vm-scalability.median
>>>>    14889337           -17.5%   12291029            -4.1%   14278574        vm-scalability.throughput
>>>>   
>>>> commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
>>>> Author: Feng Tang <feng.tang@intel.com>
>>>> Date:   Fri Aug 2 15:09:19 2019 +0800
>>>>
>>>>      mgag200: restore the depth back to 24
>>>>      
>>>>      Signed-off-by: Feng Tang <feng.tang@intel.com>
>>>>
>>>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
>>>> index a977333..ac8f6c9 100644
>>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>>>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>>>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>>>>   	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>>>   		dev->mode_config.preferred_depth = 16;
>>>>   	else
>>>> -		dev->mode_config.preferred_depth = 32;
>>>> +		dev->mode_config.preferred_depth = 24;>  	dev->mode_config.prefer_shadow = 1;
>>>>   
>>>>   	r = mgag200_modeset_init(mdev);
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>>>
>>>>> The difference between mgag200's original fbdev support and generic
>>>>> fbdev emulation is generic fbdev's worker task that updates the VRAM
>>>>> buffer from the shadow buffer. mgag200 does this immediately, but relies
>>>>> on drm_can_sleep(), which is deprecated.
>>>>>
>>>>> I think that the worker task interferes with the test case, as the
>>>>> worker has been in fbdev emulation since forever and no performance
>>>>> regressions have been reported so far.
>>>>>
>>>>>
>>>>> So unless there's a report where this problem happens in a real-world
>>>>> use case, I'd like to keep code as it is. And apparently there's always
>>>>> the workaround of disabling the cursor blinking.
>>>>>
>>>>> Best regards
>>>>> Thomas
>>>>>
>>> -- 
>>> Thomas Zimmermann
>>> Graphics Driver Developer
>>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>>> HRB 21284 (AG Nürnberg)
>>>
>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>


[-- Attachment #1.2: Type: text/html, Size: 8925 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression
  2020-01-08  2:25             ` Rong Chen
@ 2020-01-08  5:20               ` Thomas Zimmermann
  0 siblings, 0 replies; 61+ messages in thread
From: Thomas Zimmermann @ 2020-01-08  5:20 UTC (permalink / raw)
  To: Rong Chen, Feng Tang; +Cc: Stephen Rothwell, michel, lkp, dri-devel, ying.huang


[-- Attachment #1.1.1: Type: text/plain, Size: 7319 bytes --]

Hi

Am 08.01.20 um 03:25 schrieb Rong Chen:
> Hi Thomas,
> 
> The previous throughput was reduced from 43955 to 35691, and there is a little increase in next-20200106,
> but there is no obvious change after the patchset:

OK, I would have hoped for some improvements. Anyway, thanks for testing.

Best regards
Thomas

>  
> commit: 
>   f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console")
>   90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
> 
> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>      43955 ±  2%     -18.8%      35691        vm-scalability.median
> 
> commit: 
> 
>   9eb1b48ca4 ("Add linux-next specific files for 20200106")
>   5f20199bac ("drm/fb-helper: Synchronize dirty worker with vblank")
> 
>  next-20200106  5f20199bac9b2de71fd2158b90
> ----------------  --------------------------
>          %stddev      change         %stddev
>              \          |                \  
>      38550                       38744       
>      38549                       38744        vm-scalability.median
> 
> 
> Best Regards,
> Rong Chen
> 
> On 1/6/20 9:19 PM, Thomas Zimmermann wrote:
>> Hi Feng,
>>
>> do you still have the test setup that produced the performance penalty?
>>
>> If so, could you give a try to the patchset at [1]? I think I've fixed
>> the remaining issues in earlier versions and I'd like to see if it
>> actually improves performance.
>>
>> Best regards
>> Thomas
>>
>> [1]
>> https://lists.freedesktop.org/archives/dri-devel/2019-December/247771.html
>>
>> Am 05.08.19 um 14:52 schrieb Feng Tang:
>>> Hi Thomas,
>>>
>>> On Mon, Aug 05, 2019 at 12:22:11PM +0200, Thomas Zimmermann wrote:
>>>
>>> 	[snip] 
>>>
>>>>>>   2019-08-03 19:29:17  ./case-anon-cow-seq-hugetlb
>>>>>>   2019-08-03 19:29:17  ./usemem --runtime 300 -n 4 --prealloc --prefault
>>>>>>     -O -U 815394406
>>>>>>   917318700 bytes / 659419 usecs = 1358497 KB/s
>>>>>>   917318700 bytes / 659658 usecs = 1358005 KB/s
>>>>>>   917318700 bytes / 659916 usecs = 1357474 KB/s
>>>>>>   917318700 bytes / 660168 usecs = 1356956 KB/s
>>>>>>
>>>>>> Rong, Feng, could you confirm this by disabling the cursor or blinking?
>>>>> Glad to know this method restored the drop. Rong is running the case.
>>>>>
>>>>> While I have another finds, as I noticed your patch changed the bpp from
>>>>> 24 to 32, I had a patch to change it back to 24, and run the case in
>>>>> the weekend, the -18% regrssion was reduced to about -5%. Could this
>>>>> be related?
>>>> In the original code, the fbdev console already ran with 32 bpp [1] and
>>>> 16 bpp was selected for low-end devices. [2][3] The patch only set the
>>>> same values for userspace; nothing changed for the console.
>>> I did the experiment becasue I checked the commit 
>>>
>>> 90f479ae51afa4 drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
>>>
>>> in which there is code:
>>>
>>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
>>> index b10f726..a977333 100644
>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>>>  	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>>  		dev->mode_config.preferred_depth = 16;
>>>  	else
>>> -		dev->mode_config.preferred_depth = 24;
>>> +		dev->mode_config.preferred_depth = 32;
>>>  	dev->mode_config.prefer_shadow = 1;
>>>  
>>> My debug patch was kind of restoring of this part.
>>>
>>> Thanks,
>>> Feng
>>>
>>>> Best regards
>>>> Thomas
>>>>
>>>> [1]
>>>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n259
>>>> [2]
>>>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n263
>>>> [3]
>>>> https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag200_fb.c?id=5d17718997367c435dbe5341a8e270d9b19478d3#n286
>>>>
>>>>> commit: 
>>>>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
>>>>>   01e75fea0d5 mgag200: restore the depth back to 24
>>>>>
>>>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5 
>>>>> ---------------- --------------------------- --------------------------- 
>>>>>      43921 ±  2%     -18.3%      35884            -4.8%      41826        vm-scalability.median
>>>>>   14889337           -17.5%   12291029            -4.1%   14278574        vm-scalability.throughput
>>>>>  
>>>>> commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74
>>>>> Author: Feng Tang <feng.tang@intel.com>
>>>>> Date:   Fri Aug 2 15:09:19 2019 +0800
>>>>>
>>>>>     mgag200: restore the depth back to 24
>>>>>     
>>>>>     Signed-off-by: Feng Tang <feng.tang@intel.com>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c
>>>>> index a977333..ac8f6c9 100644
>>>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>>>>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>>>>> @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>>>>>  	if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024))
>>>>>  		dev->mode_config.preferred_depth = 16;
>>>>>  	else
>>>>> -		dev->mode_config.preferred_depth = 32;
>>>>> +		dev->mode_config.preferred_depth = 24;>  	dev->mode_config.prefer_shadow = 1;
>>>>>  
>>>>>  	r = mgag200_modeset_init(mdev);
>>>>>
>>>>> Thanks,
>>>>> Feng
>>>>>
>>>>>> The difference between mgag200's original fbdev support and generic
>>>>>> fbdev emulation is generic fbdev's worker task that updates the VRAM
>>>>>> buffer from the shadow buffer. mgag200 does this immediately, but relies
>>>>>> on drm_can_sleep(), which is deprecated.
>>>>>>
>>>>>> I think that the worker task interferes with the test case, as the
>>>>>> worker has been in fbdev emulation since forever and no performance
>>>>>> regressions have been reported so far.
>>>>>>
>>>>>>
>>>>>> So unless there's a report where this problem happens in a real-world
>>>>>> use case, I'd like to keep code as it is. And apparently there's always
>>>>>> the workaround of disabling the cursor blinking.
>>>>>>
>>>>>> Best regards
>>>>>> Thomas
>>>>>>
>>>> -- 
>>>> Thomas Zimmermann
>>>> Graphics Driver Developer
>>>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>>>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>>>> HRB 21284 (AG Nürnberg)
>>>>
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2020-01-08  5:20 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190729095155.GP22106@shao2-debian>
2019-07-30 17:50 ` [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression Thomas Zimmermann
2019-07-30 18:12   ` Daniel Vetter
2019-07-30 18:50     ` Thomas Zimmermann
2019-07-30 18:59       ` Daniel Vetter
2019-07-30 20:26         ` Dave Airlie
2019-07-31  8:13           ` Daniel Vetter
2019-07-31  9:25             ` [LKP] " Huang, Ying
2019-07-31 10:12               ` Thomas Zimmermann
2019-07-31 10:21               ` Michel Dänzer
2019-08-01  6:19                 ` Rong Chen
2019-08-01  8:37                   ` Feng Tang
2019-08-01  9:59                     ` Thomas Zimmermann
2019-08-01 11:25                       ` Feng Tang
2019-08-01 11:58                         ` Thomas Zimmermann
2019-08-02  7:11                           ` Rong Chen
2019-08-02  8:23                             ` Thomas Zimmermann
2019-08-02  9:20                             ` Thomas Zimmermann
2019-08-01  9:57                   ` Thomas Zimmermann
2019-08-01 13:30                   ` Michel Dänzer
2019-08-02  8:17                     ` Thomas Zimmermann
2019-07-31 10:10             ` Thomas Zimmermann
2019-08-02  9:11               ` Daniel Vetter
2019-08-02  9:26                 ` Thomas Zimmermann
2019-08-04 18:39   ` Thomas Zimmermann
2019-08-05  7:02     ` Feng Tang
2019-08-05 10:22       ` Thomas Zimmermann
2019-08-05 12:52         ` Feng Tang
2020-01-06 13:19           ` Thomas Zimmermann
2020-01-08  2:25             ` Rong Chen
2020-01-08  5:20               ` Thomas Zimmermann
     [not found]       ` <c0c3f387-dc93-3146-788c-23258b28a015@intel.com>
2019-08-05 10:25         ` Thomas Zimmermann
2019-08-06 12:59           ` [LKP] " Chen, Rong A
2019-08-07 10:42             ` Thomas Zimmermann
2019-08-09  8:12               ` Rong Chen
2019-08-12  7:25                 ` Feng Tang
2019-08-13  9:36                   ` Feng Tang
2019-08-16  6:55                     ` Feng Tang
2019-08-22 17:25                     ` Thomas Zimmermann
2019-08-22 20:02                       ` Dave Airlie
2019-08-23  9:54                         ` Thomas Zimmermann
2019-08-24  5:16                       ` Feng Tang
2019-08-26 10:50                         ` Thomas Zimmermann
2019-08-27 12:33                           ` Chen, Rong A
2019-08-27 17:16                             ` Thomas Zimmermann
2019-08-28  9:37                               ` Rong Chen
2019-08-28 10:51                                 ` Thomas Zimmermann
2019-09-04  6:27                                   ` Feng Tang
2019-09-04  6:53                                     ` Thomas Zimmermann
2019-09-04  8:11                                       ` Daniel Vetter
2019-09-04  8:35                                         ` Feng Tang
2019-09-04  8:43                                           ` Thomas Zimmermann
2019-09-04 14:30                                             ` Chen, Rong A
2019-09-04  9:17                                           ` Daniel Vetter
2019-09-04 11:15                                             ` Dave Airlie
2019-09-04 11:20                                               ` Daniel Vetter
2019-09-05  6:59                                                 ` Feng Tang
2019-09-05 10:37                                                   ` Daniel Vetter
2019-09-05 10:48                                                     ` Feng Tang
2019-09-09 14:12                                     ` Thomas Zimmermann
2019-09-16  9:06                                       ` Feng Tang
2019-09-17  8:48                                         ` Thomas Zimmermann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).