* [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression @ 2016-06-06 2:27 kernel test robot 2016-06-06 9:51 ` Kirill A. Shutemov 0 siblings, 1 reply; 23+ messages in thread From: kernel test robot @ 2016-06-06 2:27 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linus Torvalds, Michal Hocko, Minchan Kim, Rik van Riel, Mel Gorman, Michal Hocko, Vinayak Menon, Andrew Morton, LKML, lkp [-- Attachment #1: Type: text/plain, Size: 4496 bytes --] FYI, we noticed a -6.3% regression of unixbench.score due to commit: commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master in testcase: unixbench on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 Details are as below: --------------------------------------------------------------------------------------------------> ========================================================================================= compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench commit: 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 5c0a85fad949212b3e059692deecdeed74ae7ec7 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de ---------------- -------------------------- fail:runs %reproduction fail:runs | | | 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] %stddev %change %stddev \ | \ 14321 ± 0% -6.3% 13425 ± 0% unixbench.score 1996897 ± 0% -6.1% 1874635 ± 0% unixbench.time.involuntary_context_switches 1.721e+08 ± 0% -6.2% 1.613e+08 ± 0% unixbench.time.minor_page_faults 758.65 ± 0% -3.0% 735.86 ± 0% unixbench.time.system_time 387.66 ± 0% +5.4% 408.49 ± 0% unixbench.time.user_time 5950278 ± 0% -6.2% 5583456 ± 0% unixbench.time.voluntary_context_switches 1960642 ± 0% -11.4% 1737753 ± 0% cpuidle.C1-HSW.usage 5851 ± 0% -43.8% 3286 ± 1% proc-vmstat.nr_active_file 46185 ± 0% -21.2% 36385 ± 2% meminfo.Active 23404 ± 0% -43.8% 13147 ± 1% meminfo.Active(file) 4109 ± 5% -19.6% 3302 ± 4% slabinfo.pid.active_objs 4109 ± 5% -19.6% 3302 ± 4% slabinfo.pid.num_objs 94603 ± 0% -5.7% 89247 ± 0% vmstat.system.cs 8976 ± 0% -2.5% 8754 ± 0% vmstat.system.in 3.38 ± 2% +11.8% 3.77 ± 0% turbostat.CPU%c3 0.24 ±101% -86.3% 0.03 ± 54% turbostat.Pkg%pc3 66.53 ± 0% -1.7% 65.41 ± 0% turbostat.PkgWatt 2061 ± 1% -8.5% 1886 ± 0% sched_debug.cfs_rq:/.exec_clock.stddev 737154 ± 5% +10.8% 817107 ± 3% sched_debug.cpu.avg_idle.max 133057 ± 5% -33.2% 88864 ± 11% sched_debug.cpu.avg_idle.min 181562 ± 8% +15.9% 210434 ± 3% sched_debug.cpu.avg_idle.stddev 0.97 ± 7% +19.0% 1.16 ± 8% sched_debug.cpu.clock.stddev 0.97 ± 7% +19.0% 1.16 ± 8% sched_debug.cpu.clock_task.stddev 248.06 ± 11% +31.0% 324.94 ± 8% sched_debug.cpu.cpu_load[1].max 55.65 ± 14% +28.1% 71.30 ± 8% sched_debug.cpu.cpu_load[1].stddev 233.38 ± 10% +34.4% 313.56 ± 8% sched_debug.cpu.cpu_load[2].max 49.79 ± 15% +35.6% 67.50 ± 9% sched_debug.cpu.cpu_load[2].stddev 233.25 ± 12% +29.9% 302.94 ± 6% sched_debug.cpu.cpu_load[3].max 46.56 ± 8% +12.2% 52.25 ± 6% sched_debug.cpu.cpu_load[3].min 48.51 ± 15% +31.4% 63.76 ± 7% sched_debug.cpu.cpu_load[3].stddev 238.44 ± 12% +19.0% 283.69 ± 3% sched_debug.cpu.cpu_load[4].max 49.56 ± 9% +13.4% 56.19 ± 4% sched_debug.cpu.cpu_load[4].min 48.22 ± 13% +20.1% 57.93 ± 5% sched_debug.cpu.cpu_load[4].stddev 14792 ± 30% +71.9% 25424 ± 17% sched_debug.cpu.curr->pid.avg 42862 ± 1% +42.6% 61121 ± 0% sched_debug.cpu.curr->pid.max 19466 ± 10% +35.4% 26351 ± 9% sched_debug.cpu.curr->pid.stddev 1067 ± 6% -14.9% 909.35 ± 4% sched_debug.cpu.ttwu_local.stddev To reproduce: git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Thanks, Xiaolong [-- Attachment #2: job.yaml --] [-- Type: text/plain, Size: 3497 bytes --] --- LKP_SERVER: inn LKP_CGI_PORT: 80 LKP_CIFS_PORT: 139 testcase: unixbench default-monitors: wait: activate-monitor kmsg: uptime: iostat: heartbeat: vmstat: numa-numastat: numa-vmstat: numa-meminfo: proc-vmstat: proc-stat: interval: 10 meminfo: slabinfo: interrupts: lock_stat: latency_stats: softirqs: bdi_dev_mapping: diskstats: nfsstat: cpuidle: cpufreq-stats: turbostat: pmeter: sched_debug: interval: 60 cpufreq_governor: performance NFS_HANG_DF_TIMEOUT: 200 NFS_HANG_CHECK_INTERVAL: 900 default-watchdogs: oom-killer: watchdog: nfs-hang: commit: 5c0a85fad949212b3e059692deecdeed74ae7ec7 model: Haswell High-end Desktop nr_cpu: 16 memory: 16G hdd_partitions: swap_partitions: rootfs_partition: description: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory category: benchmark nr_task: 1 unixbench: test: shell8 queue: bisect testbox: lituya tbox_group: lituya kconfig: x86_64-rhel enqueue_time: 2016-06-04 03:26:52.444586006 +08:00 compiler: gcc-4.9 rootfs: debian-x86_64-2015-02-07.cgz id: 101932ca34f6ff20613b88f6bed66fbc4afdfb95 user: lkp head_commit: 73aa85b30706f742655a10c967c033b56c731aff base_commit: 1a695a905c18548062509178b98bc91e67510864 branch: internal-devel/devel-hourly-2016060108-internal result_root: "/result/unixbench/performance-1-shell8/lituya/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/1" job_file: "/lkp/scheduled/lituya/bisect_unixbench-performance-1-shell8-debian-x86_64-2015-02-07.cgz-x86_64-rhel-5c0a85fad949212b3e059692deecdeed74ae7ec7-20160604-57400-1fovod8-1.yaml" max_uptime: 1032.28 initrd: "/osimage/debian/debian-x86_64-2015-02-07.cgz" bootloader_append: - root=/dev/ram0 - user=lkp - job=/lkp/scheduled/lituya/bisect_unixbench-performance-1-shell8-debian-x86_64-2015-02-07.cgz-x86_64-rhel-5c0a85fad949212b3e059692deecdeed74ae7ec7-20160604-57400-1fovod8-1.yaml - ARCH=x86_64 - kconfig=x86_64-rhel - branch=internal-devel/devel-hourly-2016060108-internal - commit=5c0a85fad949212b3e059692deecdeed74ae7ec7 - BOOT_IMAGE=/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/vmlinuz-4.6.0-06629-g5c0a85f - max_uptime=1032 - RESULT_ROOT=/result/unixbench/performance-1-shell8/lituya/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/1 - LKP_SERVER=inn - |2- earlyprintk=ttyS0,115200 systemd.log_level=err debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 vga=normal rw lkp_initrd: "/lkp/lkp/lkp-x86_64.cgz" modules_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/modules.cgz" bm_initrd: "/osimage/deps/debian-x86_64-2015-02-07.cgz/lkp.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/run-ipconfig.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/turbostat.cgz,/lkp/benchmarks/turbostat.cgz,/lkp/benchmarks/unixbench.cgz" linux_headers_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/linux-headers.cgz" repeat_to: 2 kernel: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/vmlinuz-4.6.0-06629-g5c0a85f" dequeue_time: 2016-06-04 03:40:50.807274201 +08:00 job_state: finished loadavg: 5.70 2.70 1.05 1/257 3744 start_time: '1465010834' end_time: '1465011023' version: "/lkp/lkp/.src-20160603-214427" [-- Attachment #3: reproduce --] [-- Type: text/plain, Size: 1532 bytes --] 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu12/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu13/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu14/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu15/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor 2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor 2016-06-04 11:23:33 ./Run shell8 -c 1 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-06 2:27 [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression kernel test robot @ 2016-06-06 9:51 ` Kirill A. Shutemov 2016-06-08 7:21 ` [LKP] " Huang, Ying 0 siblings, 1 reply; 23+ messages in thread From: Kirill A. Shutemov @ 2016-06-06 9:51 UTC (permalink / raw) To: kernel test robot Cc: Linus Torvalds, Michal Hocko, Minchan Kim, Rik van Riel, Mel Gorman, Michal Hocko, Vinayak Menon, Andrew Morton, LKML, lkp On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: > > FYI, we noticed a -6.3% regression of unixbench.score due to commit: > > commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > > in testcase: unixbench > on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory > with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 > > > Details are as below: > --------------------------------------------------------------------------------------------------> > > > ========================================================================================= > compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: > gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench > > commit: > 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 > 5c0a85fad949212b3e059692deecdeed74ae7ec7 > > 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de > ---------------- -------------------------- > fail:runs %reproduction fail:runs > | | | > 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] > %stddev %change %stddev > \ | \ > 14321 ± 0% -6.3% 13425 ± 0% unixbench.score > 1996897 ± 0% -6.1% 1874635 ± 0% unixbench.time.involuntary_context_switches > 1.721e+08 ± 0% -6.2% 1.613e+08 ± 0% unixbench.time.minor_page_faults > 758.65 ± 0% -3.0% 735.86 ± 0% unixbench.time.system_time > 387.66 ± 0% +5.4% 408.49 ± 0% unixbench.time.user_time > 5950278 ± 0% -6.2% 5583456 ± 0% unixbench.time.voluntary_context_switches That's weird. I don't understand why the change would reduce number or minor faults. It should stay the same on x86-64. Rise of user_time is puzzling too. Hm. Is reproducible? Across reboot? -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-06 9:51 ` Kirill A. Shutemov @ 2016-06-08 7:21 ` Huang, Ying 2016-06-08 8:41 ` Huang, Ying 0 siblings, 1 reply; 23+ messages in thread From: Huang, Ying @ 2016-06-08 7:21 UTC (permalink / raw) To: Kirill A. Shutemov Cc: kernel test robot, Rik van Riel, Michal Hocko, lkp, LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, Linus Torvalds "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: >> >> FYI, we noticed a -6.3% regression of unixbench.score due to commit: >> >> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master >> >> in testcase: unixbench >> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory >> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 >> >> >> Details are as below: >> --------------------------------------------------------------------------------------------------> >> >> >> ========================================================================================= >> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: >> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench >> >> commit: >> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 >> 5c0a85fad949212b3e059692deecdeed74ae7ec7 >> >> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de >> ---------------- -------------------------- >> fail:runs %reproduction fail:runs >> | | | >> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] >> %stddev %change %stddev >> \ | \ >> 14321 . 0% -6.3% 13425 . 0% unixbench.score >> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches >> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults >> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time >> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time >> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches > > That's weird. > > I don't understand why the change would reduce number or minor faults. > It should stay the same on x86-64. Rise of user_time is puzzling too. unixbench runs in fixed time mode. That is, the total time to run unixbench is fixed, but the work done varies. So the minor_page_faults change may reflect only the work done. > Hm. Is reproducible? Across reboot? Yes. LKP will run every benchmark after reboot via kexec. We run 3 times for both the commit and its parent. The result is quite stable. You can find the standard deviation in percent is near 0 across different runs. Here is another comparison with profile data. ========================================================================================= compiler/cpufreq_governor/debug-setup/kconfig/nr_task/rootfs/tbox_group/test/testcase: gcc-4.9/performance/profile/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench commit: 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 5c0a85fad949212b3e059692deecdeed74ae7ec7 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de ---------------- -------------------------- %stddev %change %stddev \ | \ 14056 ± 0% -6.3% 13172 ± 0% unixbench.score 6464046 ± 0% -6.1% 6071922 ± 0% unixbench.time.involuntary_context_switches 5.555e+08 ± 0% -6.2% 5.211e+08 ± 0% unixbench.time.minor_page_faults 2537 ± 0% -3.2% 2455 ± 0% unixbench.time.system_time 1284 ± 0% +5.8% 1359 ± 0% unixbench.time.user_time 19192611 ± 0% -6.2% 18010830 ± 0% unixbench.time.voluntary_context_switches 7709931 ± 0% -11.0% 6860574 ± 0% cpuidle.C1-HSW.usage 6900 ± 1% -43.9% 3871 ± 0% proc-vmstat.nr_active_file 40813 ± 1% -77.9% 9015 ±114% softirqs.NET_RX 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file) 93169 ± 0% -5.8% 87766 ± 0% vmstat.system.cs 19768 ± 0% -1.7% 19437 ± 0% vmstat.system.in 6.22 ± 0% +10.3% 6.86 ± 0% turbostat.CPU%c3 0.02 ± 20% -85.7% 0.00 ±141% turbostat.Pkg%pc3 68.99 ± 0% -1.7% 67.84 ± 0% turbostat.PkgWatt 1.38 ± 5% -42.0% 0.80 ± 5% perf-profile.cycles-pp.page_remove_rmap.unmap_page_range.unmap_single_vma.unmap_vmas.exit_mmap 0.83 ± 4% +28.8% 1.07 ± 21% perf-profile.cycles-pp.release_pages.free_pages_and_swap_cache.tlb_flush_mmu_free.tlb_finish_mmu.exit_mmap 1.55 ± 3% -10.6% 1.38 ± 2% perf-profile.cycles-pp.unmap_single_vma.unmap_vmas.exit_mmap.mmput.flush_old_exec 1.59 ± 3% -9.8% 1.44 ± 3% perf-profile.cycles-pp.unmap_vmas.exit_mmap.mmput.flush_old_exec.load_elf_binary 389.00 ± 0% +32.1% 514.00 ± 8% slabinfo.file_lock_cache.active_objs 389.00 ± 0% +32.1% 514.00 ± 8% slabinfo.file_lock_cache.num_objs 7075 ± 3% -17.7% 5823 ± 7% slabinfo.pid.active_objs 7075 ± 3% -17.7% 5823 ± 7% slabinfo.pid.num_objs 0.67 ± 34% +86.4% 1.24 ± 30% sched_debug.cfs_rq:/.runnable_load_avg.min -9013 ± -1% +14.4% -10315 ± -9% sched_debug.cfs_rq:/.spread0.avg 83127 ± 5% +16.9% 97163 ± 8% sched_debug.cpu.avg_idle.min 17777 ± 16% +66.6% 29608 ± 22% sched_debug.cpu.curr->pid.avg 50223 ± 10% +49.3% 74974 ± 0% sched_debug.cpu.curr->pid.max 22281 ± 13% +51.8% 33816 ± 6% sched_debug.cpu.curr->pid.stddev 251.79 ± 5% -13.8% 217.15 ± 5% sched_debug.cpu.nr_uninterruptible.max -261.12 ± -2% -13.4% -226.03 ± -1% sched_debug.cpu.nr_uninterruptible.min 221.14 ± 3% -14.7% 188.60 ± 1% sched_debug.cpu.nr_uninterruptible.stddev 1.94e+11 ± 0% -5.8% 1.827e+11 ± 0% perf-stat.L1-dcache-load-misses 3.496e+12 ± 0% -6.5% 3.268e+12 ± 0% perf-stat.L1-dcache-loads 2.262e+12 ± 1% -5.5% 2.137e+12 ± 0% perf-stat.L1-dcache-stores 9.711e+10 ± 0% -3.7% 9.353e+10 ± 0% perf-stat.L1-icache-load-misses 8.051e+08 ± 0% -8.8% 7.343e+08 ± 1% perf-stat.LLC-load-misses 7.184e+10 ± 1% -5.6% 6.78e+10 ± 0% perf-stat.LLC-loads 5.867e+08 ± 2% -7.0% 5.456e+08 ± 0% perf-stat.LLC-store-misses 1.524e+10 ± 1% -5.6% 1.438e+10 ± 0% perf-stat.LLC-stores 2.711e+12 ± 0% -6.3% 2.539e+12 ± 0% perf-stat.branch-instructions 5.948e+10 ± 0% -3.9% 5.715e+10 ± 0% perf-stat.branch-load-misses 2.715e+12 ± 0% -6.4% 2.542e+12 ± 0% perf-stat.branch-loads 5.947e+10 ± 0% -3.9% 5.713e+10 ± 0% perf-stat.branch-misses 1.448e+09 ± 0% -9.3% 1.313e+09 ± 1% perf-stat.cache-misses 1.931e+11 ± 0% -5.8% 1.818e+11 ± 0% perf-stat.cache-references 58882705 ± 0% -5.8% 55467522 ± 0% perf-stat.context-switches 17037466 ± 0% -6.1% 15999111 ± 0% perf-stat.cpu-migrations 6.732e+09 ± 1% +90.7% 1.284e+10 ± 0% perf-stat.dTLB-load-misses 3.474e+12 ± 0% -6.6% 3.245e+12 ± 0% perf-stat.dTLB-loads 1.215e+09 ± 0% -5.5% 1.149e+09 ± 0% perf-stat.dTLB-store-misses 2.286e+12 ± 0% -5.8% 2.153e+12 ± 0% perf-stat.dTLB-stores 3.511e+09 ± 0% +20.4% 4.226e+09 ± 0% perf-stat.iTLB-load-misses 2.317e+09 ± 0% -6.8% 2.16e+09 ± 0% perf-stat.iTLB-loads 1.343e+13 ± 0% -6.0% 1.263e+13 ± 0% perf-stat.instructions 5.504e+08 ± 0% -6.2% 5.163e+08 ± 0% perf-stat.minor-faults 8.09e+08 ± 1% -9.0% 7.36e+08 ± 1% perf-stat.node-loads 5.932e+08 ± 0% -8.7% 5.417e+08 ± 1% perf-stat.node-stores 5.504e+08 ± 0% -6.2% 5.163e+08 ± 0% perf-stat.page-faults Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-08 7:21 ` [LKP] " Huang, Ying @ 2016-06-08 8:41 ` Huang, Ying 2016-06-08 8:58 ` Kirill A. Shutemov 0 siblings, 1 reply; 23+ messages in thread From: Huang, Ying @ 2016-06-08 8:41 UTC (permalink / raw) To: Huang, Ying Cc: Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, lkp "Huang, Ying" <ying.huang@intel.com> writes: > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: >>> >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: >>> >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master >>> >>> in testcase: unixbench >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 >>> >>> >>> Details are as below: >>> --------------------------------------------------------------------------------------------------> >>> >>> >>> ========================================================================================= >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench >>> >>> commit: >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 >>> >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de >>> ---------------- -------------------------- >>> fail:runs %reproduction fail:runs >>> | | | >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] >>> %stddev %change %stddev >>> \ | \ >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches >> >> That's weird. >> >> I don't understand why the change would reduce number or minor faults. >> It should stay the same on x86-64. Rise of user_time is puzzling too. > > unixbench runs in fixed time mode. That is, the total time to run > unixbench is fixed, but the work done varies. So the minor_page_faults > change may reflect only the work done. > >> Hm. Is reproducible? Across reboot? > And FYI, there is no swap setup for test, all root file system including benchmark files are in tmpfs, so no real page reclaim will be triggered. But it appears that active file cache reduced after the commit. 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file) I think this is the expected behavior of the commit? Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-08 8:41 ` Huang, Ying @ 2016-06-08 8:58 ` Kirill A. Shutemov 2016-06-12 0:49 ` Huang, Ying 2016-06-14 8:57 ` Minchan Kim 0 siblings, 2 replies; 23+ messages in thread From: Kirill A. Shutemov @ 2016-06-08 8:58 UTC (permalink / raw) To: Huang, Ying Cc: Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, lkp On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: > "Huang, Ying" <ying.huang@intel.com> writes: > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: > >>> > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: > >>> > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > >>> > >>> in testcase: unixbench > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 > >>> > >>> > >>> Details are as below: > >>> --------------------------------------------------------------------------------------------------> > >>> > >>> > >>> ========================================================================================= > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench > >>> > >>> commit: > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 > >>> > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de > >>> ---------------- -------------------------- > >>> fail:runs %reproduction fail:runs > >>> | | | > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] > >>> %stddev %change %stddev > >>> \ | \ > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches > >> > >> That's weird. > >> > >> I don't understand why the change would reduce number or minor faults. > >> It should stay the same on x86-64. Rise of user_time is puzzling too. > > > > unixbench runs in fixed time mode. That is, the total time to run > > unixbench is fixed, but the work done varies. So the minor_page_faults > > change may reflect only the work done. > > > >> Hm. Is reproducible? Across reboot? > > > > And FYI, there is no swap setup for test, all root file system including > benchmark files are in tmpfs, so no real page reclaim will be > triggered. But it appears that active file cache reduced after the > commit. > > 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active > 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file) > > I think this is the expected behavior of the commit? Yes, it's expected. After the change faularound would produce old pte. It means there's more chance for these pages to be on inactive lru, unless somebody actually touch them and flip accessed bit. I wounder if this regression can attributed to cost of setting accessed bit. It looks too high, but who knows. I don't have time to do testing myself right now. I will put this on todo list. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-08 8:58 ` Kirill A. Shutemov @ 2016-06-12 0:49 ` Huang, Ying 2016-06-12 1:02 ` Linus Torvalds 2016-06-14 8:57 ` Minchan Kim 1 sibling, 1 reply; 23+ messages in thread From: Huang, Ying @ 2016-06-12 0:49 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, lkp "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: >> "Huang, Ying" <ying.huang@intel.com> writes: >> >> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: >> > >> >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: >> >>> >> >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: >> >>> >> >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") >> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master >> >>> >> >>> in testcase: unixbench >> >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory >> >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 >> >>> >> >>> >> >>> Details are as below: >> >>> --------------------------------------------------------------------------------------------------> >> >>> >> >>> >> >>> ========================================================================================= >> >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: >> >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench >> >>> >> >>> commit: >> >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 >> >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 >> >>> >> >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de >> >>> ---------------- -------------------------- >> >>> fail:runs %reproduction fail:runs >> >>> | | | >> >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] >> >>> %stddev %change %stddev >> >>> \ | \ >> >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score >> >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches >> >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults >> >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time >> >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time >> >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches >> >> >> >> That's weird. >> >> >> >> I don't understand why the change would reduce number or minor faults. >> >> It should stay the same on x86-64. Rise of user_time is puzzling too. >> > >> > unixbench runs in fixed time mode. That is, the total time to run >> > unixbench is fixed, but the work done varies. So the minor_page_faults >> > change may reflect only the work done. >> > >> >> Hm. Is reproducible? Across reboot? >> > >> >> And FYI, there is no swap setup for test, all root file system including >> benchmark files are in tmpfs, so no real page reclaim will be >> triggered. But it appears that active file cache reduced after the >> commit. >> >> 111331 . 1% -13.3% 96503 . 0% meminfo.Active >> 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file) >> >> I think this is the expected behavior of the commit? > > Yes, it's expected. > > After the change faularound would produce old pte. It means there's more > chance for these pages to be on inactive lru, unless somebody actually > touch them and flip accessed bit. > > I wounder if this regression can attributed to cost of setting accessed > bit. It looks too high, but who knows. >From perf profile, the time spent in page_fault and its children functions are almost same (7.85% vs 7.81%). So the time spent in page fault and page table operation itself doesn't changed much. So, you mean CPU may be slower to load the page table entry to TLB if accessed bit is not set? > I don't have time to do testing myself right now. I will put this on todo > list. Which kind of test your want to do? I want to check whether I can help. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-12 0:49 ` Huang, Ying @ 2016-06-12 1:02 ` Linus Torvalds 2016-06-13 9:02 ` Huang, Ying 2016-06-13 12:52 ` Kirill A. Shutemov 0 siblings, 2 replies; 23+ messages in thread From: Linus Torvalds @ 2016-06-12 1:02 UTC (permalink / raw) To: Huang, Ying Cc: Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote: > > From perf profile, the time spent in page_fault and its children > functions are almost same (7.85% vs 7.81%). So the time spent in page > fault and page table operation itself doesn't changed much. So, you > mean CPU may be slower to load the page table entry to TLB if accessed > bit is not set? So the CPU does take a microfault internally when it needs to set the accessed/dirty bit. It's not architecturally visible, but you can see it when you do timing loops. I've timed it at over a thousand cycles on at least some CPU's, but that's still peanuts compared to a real page fault. It shouldn't be *that* noticeable, ie no way it's a 6% regression on its own. Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-12 1:02 ` Linus Torvalds @ 2016-06-13 9:02 ` Huang, Ying 2016-06-14 13:38 ` Minchan Kim 2016-06-13 12:52 ` Kirill A. Shutemov 1 sibling, 1 reply; 23+ messages in thread From: Huang, Ying @ 2016-06-13 9:02 UTC (permalink / raw) To: Linus Torvalds Cc: Huang, Ying, Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote: >> >> From perf profile, the time spent in page_fault and its children >> functions are almost same (7.85% vs 7.81%). So the time spent in page >> fault and page table operation itself doesn't changed much. So, you >> mean CPU may be slower to load the page table entry to TLB if accessed >> bit is not set? > > So the CPU does take a microfault internally when it needs to set the > accessed/dirty bit. It's not architecturally visible, but you can see > it when you do timing loops. > > I've timed it at over a thousand cycles on at least some CPU's, but > that's still peanuts compared to a real page fault. It shouldn't be > *that* noticeable, ie no way it's a 6% regression on its own. I done some simple counting, and found that about 3.15e9 PTE are set to old during the test after the commit. This may interpret the user_time increase as below, because these accessed bit microfault is accounted as user time. 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time I also make a one line debug patch as below on top of the commit to set the PTE to young unconditionally, which recover the regression. modified mm/filemap.c @@ -2193,7 +2193,7 @@ repeat: if (file->f_ra.mmap_miss > 0) file->f_ra.mmap_miss--; addr = address + (page->index - vmf->pgoff) * PAGE_SIZE; - do_set_pte(vma, addr, page, pte, false, false, true); + do_set_pte(vma, addr, page, pte, false, false, false); unlock_page(page); atomic64_inc(&old_pte_count); goto next; Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-13 9:02 ` Huang, Ying @ 2016-06-14 13:38 ` Minchan Kim 2016-06-15 23:42 ` Huang, Ying 0 siblings, 1 reply; 23+ messages in thread From: Minchan Kim @ 2016-06-14 13:38 UTC (permalink / raw) To: Huang, Ying Cc: Linus Torvalds, Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, LKP On Mon, Jun 13, 2016 at 05:02:15PM +0800, Huang, Ying wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote: > >> > >> From perf profile, the time spent in page_fault and its children > >> functions are almost same (7.85% vs 7.81%). So the time spent in page > >> fault and page table operation itself doesn't changed much. So, you > >> mean CPU may be slower to load the page table entry to TLB if accessed > >> bit is not set? > > > > So the CPU does take a microfault internally when it needs to set the > > accessed/dirty bit. It's not architecturally visible, but you can see > > it when you do timing loops. > > > > I've timed it at over a thousand cycles on at least some CPU's, but > > that's still peanuts compared to a real page fault. It shouldn't be > > *that* noticeable, ie no way it's a 6% regression on its own. > > I done some simple counting, and found that about 3.15e9 PTE are set to > old during the test after the commit. This may interpret the user_time > increase as below, because these accessed bit microfault is accounted as > user time. > > 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time > > I also make a one line debug patch as below on top of the commit to set > the PTE to young unconditionally, which recover the regression. With this patch, meminfo.Active(file) is almost same unlike previous experiment? > > modified mm/filemap.c > @@ -2193,7 +2193,7 @@ repeat: > if (file->f_ra.mmap_miss > 0) > file->f_ra.mmap_miss--; > addr = address + (page->index - vmf->pgoff) * PAGE_SIZE; > - do_set_pte(vma, addr, page, pte, false, false, true); > + do_set_pte(vma, addr, page, pte, false, false, false); > unlock_page(page); > atomic64_inc(&old_pte_count); > goto next; > > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-14 13:38 ` Minchan Kim @ 2016-06-15 23:42 ` Huang, Ying 0 siblings, 0 replies; 23+ messages in thread From: Huang, Ying @ 2016-06-15 23:42 UTC (permalink / raw) To: Minchan Kim Cc: Huang, Ying, Linus Torvalds, Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, LKP Minchan Kim <minchan@kernel.org> writes: > On Mon, Jun 13, 2016 at 05:02:15PM +0800, Huang, Ying wrote: >> Linus Torvalds <torvalds@linux-foundation.org> writes: >> >> > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> From perf profile, the time spent in page_fault and its children >> >> functions are almost same (7.85% vs 7.81%). So the time spent in page >> >> fault and page table operation itself doesn't changed much. So, you >> >> mean CPU may be slower to load the page table entry to TLB if accessed >> >> bit is not set? >> > >> > So the CPU does take a microfault internally when it needs to set the >> > accessed/dirty bit. It's not architecturally visible, but you can see >> > it when you do timing loops. >> > >> > I've timed it at over a thousand cycles on at least some CPU's, but >> > that's still peanuts compared to a real page fault. It shouldn't be >> > *that* noticeable, ie no way it's a 6% regression on its own. >> >> I done some simple counting, and found that about 3.15e9 PTE are set to >> old during the test after the commit. This may interpret the user_time >> increase as below, because these accessed bit microfault is accounted as >> user time. >> >> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time >> >> I also make a one line debug patch as below on top of the commit to set >> the PTE to young unconditionally, which recover the regression. > > With this patch, meminfo.Active(file) is almost same unlike previous > experiment? Yes. meminfo.Active(file) is almost same of that of the parent commit of the first bad commit. Best Regards, Huang, Ying >> >> modified mm/filemap.c >> @@ -2193,7 +2193,7 @@ repeat: >> if (file->f_ra.mmap_miss > 0) >> file->f_ra.mmap_miss--; >> addr = address + (page->index - vmf->pgoff) * PAGE_SIZE; >> - do_set_pte(vma, addr, page, pte, false, false, true); >> + do_set_pte(vma, addr, page, pte, false, false, false); >> unlock_page(page); >> atomic64_inc(&old_pte_count); >> goto next; >> >> Best Regards, >> Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-12 1:02 ` Linus Torvalds 2016-06-13 9:02 ` Huang, Ying @ 2016-06-13 12:52 ` Kirill A. Shutemov 2016-06-14 6:11 ` Linus Torvalds 1 sibling, 1 reply; 23+ messages in thread From: Kirill A. Shutemov @ 2016-06-13 12:52 UTC (permalink / raw) To: Linus Torvalds Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP, Dave Hansen On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote: > > > > From perf profile, the time spent in page_fault and its children > > functions are almost same (7.85% vs 7.81%). So the time spent in page > > fault and page table operation itself doesn't changed much. So, you > > mean CPU may be slower to load the page table entry to TLB if accessed > > bit is not set? > > So the CPU does take a microfault internally when it needs to set the > accessed/dirty bit. It's not architecturally visible, but you can see > it when you do timing loops. > > I've timed it at over a thousand cycles on at least some CPU's, but > that's still peanuts compared to a real page fault. It shouldn't be > *that* noticeable, ie no way it's a 6% regression on its own. Looks like setting accessed bit is the problem. Withouth mkold: Score: 1952.9 Performance counter stats for './Run shell8 -c 1' (3 runs): 468,562,316,621 cycles:u ( +- 0.02% ) 4,596,299,472 dtlb_load_misses_walk_duration:u ( +- 0.07% ) 5,245,488,559 itlb_misses_walk_duration:u ( +- 0.10% ) 189.336404566 seconds time elapsed ( +- 0.01% ) With mkold: Score: 1885.5 Performance counter stats for './Run shell8 -c 1' (3 runs): 503,185,676,256 cycles:u ( +- 0.06% ) 8,137,007,894 dtlb_load_misses_walk_duration:u ( +- 0.85% ) 7,220,632,283 itlb_misses_walk_duration:u ( +- 1.40% ) 189.363223499 seconds time elapsed ( +- 0.01% ) We spend 36% more time in page walk only, about 1% of total userspace time. Combining this with page walk footprint on caches, I guess we can get to this 3.5% score difference I see. I'm not sure if there's anything we can do to solve the issue without screwing relacim logic again. :( -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-13 12:52 ` Kirill A. Shutemov @ 2016-06-14 6:11 ` Linus Torvalds 2016-06-14 8:26 ` Kirill A. Shutemov 2016-06-14 14:03 ` Christian Borntraeger 0 siblings, 2 replies; 23+ messages in thread From: Linus Torvalds @ 2016-06-14 6:11 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP, Dave Hansen On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote: >> >> I've timed it at over a thousand cycles on at least some CPU's, but >> that's still peanuts compared to a real page fault. It shouldn't be >> *that* noticeable, ie no way it's a 6% regression on its own. > > Looks like setting accessed bit is the problem. Ok. I've definitely seen it as an issue, but never to the point of several percent on a real benchmark that wasn't explicitly testing that cost. I reported the excessive dirty/accessed bit cost to Intel back in the P4 days, but it's apparently not been high enough for anybody to care. > We spend 36% more time in page walk only, about 1% of total userspace time. > Combining this with page walk footprint on caches, I guess we can get to > this 3.5% score difference I see. > > I'm not sure if there's anything we can do to solve the issue without > screwing relacim logic again. :( I think we should say "screw the reclaim logic" for now, and revert commit 5c0a85fad949 for now. Considering how much trouble the accessed bit is on some other architectures too, I wonder if we should strive to simply not care about it, and always leaving it set. And then rely entirely on just unmapping the pages and making the "we took a page fault after unmapping" be the real activity tester. So get rid of the "if the page is young, mark it old but leave it in the page tables" logic entirely. When we unmap a page, it will always either be in the swap cache or the page cache anyway, so faulting it in again should be just a minor fault with no actual IO happening. That might be less of an impact in the end - yes, the unmap and re-fault is much more expensive, but it presumably happens to much fewer pages. What do you think? Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-14 6:11 ` Linus Torvalds @ 2016-06-14 8:26 ` Kirill A. Shutemov 2016-06-14 16:07 ` Rik van Riel 2016-06-14 14:03 ` Christian Borntraeger 1 sibling, 1 reply; 23+ messages in thread From: Kirill A. Shutemov @ 2016-06-14 8:26 UTC (permalink / raw) To: Linus Torvalds, Rik van Riel, Mel Gorman Cc: Kirill A. Shutemov, Huang, Ying, Michal Hocko, LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Andrew Morton, LKP, Dave Hansen, Vladimir Davydov On Mon, Jun 13, 2016 at 11:11:05PM -0700, Linus Torvalds wrote: > On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote: > >> > >> I've timed it at over a thousand cycles on at least some CPU's, but > >> that's still peanuts compared to a real page fault. It shouldn't be > >> *that* noticeable, ie no way it's a 6% regression on its own. > > > > Looks like setting accessed bit is the problem. > > Ok. I've definitely seen it as an issue, but never to the point of > several percent on a real benchmark that wasn't explicitly testing > that cost. > > I reported the excessive dirty/accessed bit cost to Intel back in the > P4 days, but it's apparently not been high enough for anybody to care. > > > We spend 36% more time in page walk only, about 1% of total userspace time. > > Combining this with page walk footprint on caches, I guess we can get to > > this 3.5% score difference I see. > > > > I'm not sure if there's anything we can do to solve the issue without > > screwing relacim logic again. :( > > I think we should say "screw the reclaim logic" for now, and revert > commit 5c0a85fad949 for now. Okay. I'll prepare the patch. > Considering how much trouble the accessed bit is on some other > architectures too, I wonder if we should strive to simply not care > about it, and always leaving it set. And then rely entirely on just > unmapping the pages and making the "we took a page fault after > unmapping" be the real activity tester. > > So get rid of the "if the page is young, mark it old but leave it in > the page tables" logic entirely. When we unmap a page, it will always > either be in the swap cache or the page cache anyway, so faulting it > in again should be just a minor fault with no actual IO happening. > > That might be less of an impact in the end - yes, the unmap and > re-fault is much more expensive, but it presumably happens to much > fewer pages. > > What do you think? Well, we cannot do this for anonymous memory. No swap -- no swap cache, if I read code correctly. I guess it's doable for file mappings. Although I would expect regressions in other benchmarks. IIUC, it would require page unmapping to propogate page to active list, which is suboptimal. And implications for page_idle is not clear to me. Rik, Mel, any comments? -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-14 8:26 ` Kirill A. Shutemov @ 2016-06-14 16:07 ` Rik van Riel 0 siblings, 0 replies; 23+ messages in thread From: Rik van Riel @ 2016-06-14 16:07 UTC (permalink / raw) To: Kirill A. Shutemov, Linus Torvalds, Mel Gorman Cc: Kirill A. Shutemov, Huang, Ying, Michal Hocko, LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Andrew Morton, LKP, Dave Hansen, Vladimir Davydov [-- Attachment #1: Type: text/plain, Size: 3579 bytes --] On Tue, 2016-06-14 at 11:26 +0300, Kirill A. Shutemov wrote: > On Mon, Jun 13, 2016 at 11:11:05PM -0700, Linus Torvalds wrote: > > > > On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov > > <kirill.shutemov@linux.intel.com> wrote: > > > > > > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote: > > > > > > > > > > > > I've timed it at over a thousand cycles on at least some CPU's, > > > > but > > > > that's still peanuts compared to a real page fault. It > > > > shouldn't be > > > > *that* noticeable, ie no way it's a 6% regression on its own. > > > Looks like setting accessed bit is the problem. > > Ok. I've definitely seen it as an issue, but never to the point of > > several percent on a real benchmark that wasn't explicitly testing > > that cost. > > > > I reported the excessive dirty/accessed bit cost to Intel back in > > the > > P4 days, but it's apparently not been high enough for anybody to > > care. > > > > > > > > We spend 36% more time in page walk only, about 1% of total > > > userspace time. > > > Combining this with page walk footprint on caches, I guess we can > > > get to > > > this 3.5% score difference I see. > > > > > > I'm not sure if there's anything we can do to solve the issue > > > without > > > screwing relacim logic again. :( > > I think we should say "screw the reclaim logic" for now, and revert > > commit 5c0a85fad949 for now. > Okay. I'll prepare the patch. > > > > > Considering how much trouble the accessed bit is on some other > > architectures too, I wonder if we should strive to simply not care > > about it, and always leaving it set. And then rely entirely on just > > unmapping the pages and making the "we took a page fault after > > unmapping" be the real activity tester. > > > > So get rid of the "if the page is young, mark it old but leave it > > in > > the page tables" logic entirely. When we unmap a page, it will > > always > > either be in the swap cache or the page cache anyway, so faulting > > it > > in again should be just a minor fault with no actual IO happening. > > > > That might be less of an impact in the end - yes, the unmap and > > re-fault is much more expensive, but it presumably happens to much > > fewer pages. > > > > What do you think? > Well, we cannot do this for anonymous memory. No swap -- no swap > cache, if > I read code correctly. > > I guess it's doable for file mappings. Although I would expect > regressions > in other benchmarks. IIUC, it would require page unmapping to > propogate > page to active list, which is suboptimal. > > And implications for page_idle is not clear to me. > > Rik, Mel, any comments? We can clear the accessed/young bit when anon pages are moved from the active to the inactive list. Reclaim does not care about the young bit on active anon pages at all. For anon pages it uses a two hand clock algorithm, with only pages on the inactive list being cared about. For file pages, I believe we do look at the young bit on mapped pages when they reach the end of the inactive list. Again, we only care about the young bit on inactive pages. One option may be to count on actively used file pages actually being on the active list, and always set the young bit on ptes when the page is already active. Then we can let reclaim do its thing with the smaller number of pages that are on the inactive list, while doing the faster thing for pages that are on the active list. Does that make sense? -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-14 6:11 ` Linus Torvalds 2016-06-14 8:26 ` Kirill A. Shutemov @ 2016-06-14 14:03 ` Christian Borntraeger 1 sibling, 0 replies; 23+ messages in thread From: Christian Borntraeger @ 2016-06-14 14:03 UTC (permalink / raw) To: Linus Torvalds, Kirill A. Shutemov Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP, Dave Hansen, Martin Schwidefsky, linux-s390 On 06/14/2016 08:11 AM, Linus Torvalds wrote: > On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: >> On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote: >>> >>> I've timed it at over a thousand cycles on at least some CPU's, but >>> that's still peanuts compared to a real page fault. It shouldn't be >>> *that* noticeable, ie no way it's a 6% regression on its own. >> >> Looks like setting accessed bit is the problem. > > Ok. I've definitely seen it as an issue, but never to the point of > several percent on a real benchmark that wasn't explicitly testing > that cost. > > I reported the excessive dirty/accessed bit cost to Intel back in the > P4 days, but it's apparently not been high enough for anybody to care. > >> We spend 36% more time in page walk only, about 1% of total userspace time. >> Combining this with page walk footprint on caches, I guess we can get to >> this 3.5% score difference I see. >> >> I'm not sure if there's anything we can do to solve the issue without >> screwing relacim logic again. :( > > I think we should say "screw the reclaim logic" for now, and revert > commit 5c0a85fad949 for now. > > Considering how much trouble the accessed bit is on some other > architectures too, I wonder if we should strive to simply not care > about it, and always leaving it set. And then rely entirely on just > unmapping the pages and making the "we took a page fault after > unmapping" be the real activity tester. > > So get rid of the "if the page is young, mark it old but leave it in > the page tables" logic entirely. When we unmap a page, it will always > either be in the swap cache or the page cache anyway, so faulting it > in again should be just a minor fault with no actual IO happening. > > That might be less of an impact in the end - yes, the unmap and > re-fault is much more expensive, but it presumably happens to much > fewer pages. FWIW, something like that is what Martin did for s390 3 years ago. We now use invalidation and page faults to implement the *young functions in pgtable.h (basically using a SW young bit). This helped us to get rid of the storage keys (which contain the HW reference bit). The performance did not seem to suffer. See commit 0944fe3f4a323f436180d39402cae7f9c46ead17 s390/mm: implement software referenced bits > > What do you think? Your proposal would be to do the software tracking via invalidation/fault part of the generic mm code and not to hide it in the architecture backend. Correct? > > Linus > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-08 8:58 ` Kirill A. Shutemov 2016-06-12 0:49 ` Huang, Ying @ 2016-06-14 8:57 ` Minchan Kim 2016-06-14 14:34 ` Kirill A. Shutemov 1 sibling, 1 reply; 23+ messages in thread From: Minchan Kim @ 2016-06-14 8:57 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote: > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: > > "Huang, Ying" <ying.huang@intel.com> writes: > > > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > > > > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: > > >>> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: > > >>> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > > >>> > > >>> in testcase: unixbench > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 > > >>> > > >>> > > >>> Details are as below: > > >>> --------------------------------------------------------------------------------------------------> > > >>> > > >>> > > >>> ========================================================================================= > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench > > >>> > > >>> commit: > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 > > >>> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de > > >>> ---------------- -------------------------- > > >>> fail:runs %reproduction fail:runs > > >>> | | | > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] > > >>> %stddev %change %stddev > > >>> \ | \ > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches > > >> > > >> That's weird. > > >> > > >> I don't understand why the change would reduce number or minor faults. > > >> It should stay the same on x86-64. Rise of user_time is puzzling too. > > > > > > unixbench runs in fixed time mode. That is, the total time to run > > > unixbench is fixed, but the work done varies. So the minor_page_faults > > > change may reflect only the work done. > > > > > >> Hm. Is reproducible? Across reboot? > > > > > > > And FYI, there is no swap setup for test, all root file system including > > benchmark files are in tmpfs, so no real page reclaim will be > > triggered. But it appears that active file cache reduced after the > > commit. > > > > 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active > > 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file) > > > > I think this is the expected behavior of the commit? > > Yes, it's expected. > > After the change faularound would produce old pte. It means there's more > chance for these pages to be on inactive lru, unless somebody actually > touch them and flip accessed bit. Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan anonymous LRU list on swapless system so I really wonder why active file LRU is shrunk. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-14 8:57 ` Minchan Kim @ 2016-06-14 14:34 ` Kirill A. Shutemov 2016-06-15 23:52 ` Huang, Ying 0 siblings, 1 reply; 23+ messages in thread From: Kirill A. Shutemov @ 2016-06-14 14:34 UTC (permalink / raw) To: Minchan Kim Cc: Kirill A. Shutemov, Huang, Ying, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote: > On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote: > > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: > > > "Huang, Ying" <ying.huang@intel.com> writes: > > > > > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > > > > > > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: > > > >>> > > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: > > > >>> > > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") > > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > > > >>> > > > >>> in testcase: unixbench > > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory > > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 > > > >>> > > > >>> > > > >>> Details are as below: > > > >>> --------------------------------------------------------------------------------------------------> > > > >>> > > > >>> > > > >>> ========================================================================================= > > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: > > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench > > > >>> > > > >>> commit: > > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 > > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 > > > >>> > > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de > > > >>> ---------------- -------------------------- > > > >>> fail:runs %reproduction fail:runs > > > >>> | | | > > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] > > > >>> %stddev %change %stddev > > > >>> \ | \ > > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score > > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches > > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults > > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time > > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time > > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches > > > >> > > > >> That's weird. > > > >> > > > >> I don't understand why the change would reduce number or minor faults. > > > >> It should stay the same on x86-64. Rise of user_time is puzzling too. > > > > > > > > unixbench runs in fixed time mode. That is, the total time to run > > > > unixbench is fixed, but the work done varies. So the minor_page_faults > > > > change may reflect only the work done. > > > > > > > >> Hm. Is reproducible? Across reboot? > > > > > > > > > > And FYI, there is no swap setup for test, all root file system including > > > benchmark files are in tmpfs, so no real page reclaim will be > > > triggered. But it appears that active file cache reduced after the > > > commit. > > > > > > 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active > > > 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file) > > > > > > I think this is the expected behavior of the commit? > > > > Yes, it's expected. > > > > After the change faularound would produce old pte. It means there's more > > chance for these pages to be on inactive lru, unless somebody actually > > touch them and flip accessed bit. > > Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan > anonymous LRU list on swapless system so I really wonder why active file > LRU is shrunk. Hm. Good point. I don't why we have anything on file lru if there's no filesystems except tmpfs. Ying, how do you get stuff to the tmpfs? -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-14 14:34 ` Kirill A. Shutemov @ 2016-06-15 23:52 ` Huang, Ying 2016-06-16 0:13 ` Minchan Kim 0 siblings, 1 reply; 23+ messages in thread From: Huang, Ying @ 2016-06-15 23:52 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Minchan Kim, Kirill A. Shutemov, Huang, Ying, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp "Kirill A. Shutemov" <kirill@shutemov.name> writes: > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote: >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote: >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: >> > > "Huang, Ying" <ying.huang@intel.com> writes: >> > > >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: >> > > > >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: >> > > >>> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: >> > > >>> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master >> > > >>> >> > > >>> in testcase: unixbench >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 >> > > >>> >> > > >>> >> > > >>> Details are as below: >> > > >>> --------------------------------------------------------------------------------------------------> >> > > >>> >> > > >>> >> > > >>> ========================================================================================= >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench >> > > >>> >> > > >>> commit: >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 >> > > >>> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de >> > > >>> ---------------- -------------------------- >> > > >>> fail:runs %reproduction fail:runs >> > > >>> | | | >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] >> > > >>> %stddev %change %stddev >> > > >>> \ | \ >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches >> > > >> >> > > >> That's weird. >> > > >> >> > > >> I don't understand why the change would reduce number or minor faults. >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too. >> > > > >> > > > unixbench runs in fixed time mode. That is, the total time to run >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults >> > > > change may reflect only the work done. >> > > > >> > > >> Hm. Is reproducible? Across reboot? >> > > > >> > > >> > > And FYI, there is no swap setup for test, all root file system including >> > > benchmark files are in tmpfs, so no real page reclaim will be >> > > triggered. But it appears that active file cache reduced after the >> > > commit. >> > > >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file) >> > > >> > > I think this is the expected behavior of the commit? >> > >> > Yes, it's expected. >> > >> > After the change faularound would produce old pte. It means there's more >> > chance for these pages to be on inactive lru, unless somebody actually >> > touch them and flip accessed bit. >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan >> anonymous LRU list on swapless system so I really wonder why active file >> LRU is shrunk. > > Hm. Good point. I don't why we have anything on file lru if there's no > filesystems except tmpfs. > > Ying, how do you get stuff to the tmpfs? We put root file system and benchmark into a set of compressed cpio archive, then concatenate them into one initrd, and finally kernel use that initrd as initramfs. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-15 23:52 ` Huang, Ying @ 2016-06-16 0:13 ` Minchan Kim 2016-06-16 22:27 ` Huang, Ying 0 siblings, 1 reply; 23+ messages in thread From: Minchan Kim @ 2016-06-16 0:13 UTC (permalink / raw) To: Huang, Ying Cc: Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote: > "Kirill A. Shutemov" <kirill@shutemov.name> writes: > > > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote: > >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote: > >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: > >> > > "Huang, Ying" <ying.huang@intel.com> writes: > >> > > > >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > >> > > > > >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: > >> > > >>> > >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: > >> > > >>> > >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") > >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > >> > > >>> > >> > > >>> in testcase: unixbench > >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory > >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 > >> > > >>> > >> > > >>> > >> > > >>> Details are as below: > >> > > >>> --------------------------------------------------------------------------------------------------> > >> > > >>> > >> > > >>> > >> > > >>> ========================================================================================= > >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: > >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench > >> > > >>> > >> > > >>> commit: > >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 > >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 > >> > > >>> > >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de > >> > > >>> ---------------- -------------------------- > >> > > >>> fail:runs %reproduction fail:runs > >> > > >>> | | | > >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] > >> > > >>> %stddev %change %stddev > >> > > >>> \ | \ > >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score > >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches > >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults > >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time > >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time > >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches > >> > > >> > >> > > >> That's weird. > >> > > >> > >> > > >> I don't understand why the change would reduce number or minor faults. > >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too. > >> > > > > >> > > > unixbench runs in fixed time mode. That is, the total time to run > >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults > >> > > > change may reflect only the work done. > >> > > > > >> > > >> Hm. Is reproducible? Across reboot? > >> > > > > >> > > > >> > > And FYI, there is no swap setup for test, all root file system including > >> > > benchmark files are in tmpfs, so no real page reclaim will be > >> > > triggered. But it appears that active file cache reduced after the > >> > > commit. > >> > > > >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active > >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file) > >> > > > >> > > I think this is the expected behavior of the commit? > >> > > >> > Yes, it's expected. > >> > > >> > After the change faularound would produce old pte. It means there's more > >> > chance for these pages to be on inactive lru, unless somebody actually > >> > touch them and flip accessed bit. > >> > >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan > >> anonymous LRU list on swapless system so I really wonder why active file > >> LRU is shrunk. > > > > Hm. Good point. I don't why we have anything on file lru if there's no > > filesystems except tmpfs. > > > > Ying, how do you get stuff to the tmpfs? > > We put root file system and benchmark into a set of compressed cpio > archive, then concatenate them into one initrd, and finally kernel use > that initrd as initramfs. I see. Could you share your 4 full vmstat(/proc/vmstat) files? old: cat /proc/vmstat > before.old.vmstat do benchmark cat /proc/vmstat > after.old.vmstat new: cat /proc/vmstat > before.new.vmstat do benchmark cat /proc/vmstat > after.new.vmstat IOW, I want to see stats related to reclaim. Thanks. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-16 0:13 ` Minchan Kim @ 2016-06-16 22:27 ` Huang, Ying 2016-06-17 5:41 ` Minchan Kim 0 siblings, 1 reply; 23+ messages in thread From: Huang, Ying @ 2016-06-16 22:27 UTC (permalink / raw) To: Minchan Kim Cc: Huang, Ying, Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp [-- Attachment #1: Type: text/plain, Size: 5440 bytes --] Minchan Kim <minchan@kernel.org> writes: > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote: >> "Kirill A. Shutemov" <kirill@shutemov.name> writes: >> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote: >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote: >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: >> >> > > "Huang, Ying" <ying.huang@intel.com> writes: >> >> > > >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: >> >> > > > >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: >> >> > > >>> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: >> >> > > >>> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master >> >> > > >>> >> >> > > >>> in testcase: unixbench >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 >> >> > > >>> >> >> > > >>> >> >> > > >>> Details are as below: >> >> > > >>> --------------------------------------------------------------------------------------------------> >> >> > > >>> >> >> > > >>> >> >> > > >>> ========================================================================================= >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench >> >> > > >>> >> >> > > >>> commit: >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 >> >> > > >>> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de >> >> > > >>> ---------------- -------------------------- >> >> > > >>> fail:runs %reproduction fail:runs >> >> > > >>> | | | >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] >> >> > > >>> %stddev %change %stddev >> >> > > >>> \ | \ >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches >> >> > > >> >> >> > > >> That's weird. >> >> > > >> >> >> > > >> I don't understand why the change would reduce number or minor faults. >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too. >> >> > > > >> >> > > > unixbench runs in fixed time mode. That is, the total time to run >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults >> >> > > > change may reflect only the work done. >> >> > > > >> >> > > >> Hm. Is reproducible? Across reboot? >> >> > > > >> >> > > >> >> > > And FYI, there is no swap setup for test, all root file system including >> >> > > benchmark files are in tmpfs, so no real page reclaim will be >> >> > > triggered. But it appears that active file cache reduced after the >> >> > > commit. >> >> > > >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file) >> >> > > >> >> > > I think this is the expected behavior of the commit? >> >> > >> >> > Yes, it's expected. >> >> > >> >> > After the change faularound would produce old pte. It means there's more >> >> > chance for these pages to be on inactive lru, unless somebody actually >> >> > touch them and flip accessed bit. >> >> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan >> >> anonymous LRU list on swapless system so I really wonder why active file >> >> LRU is shrunk. >> > >> > Hm. Good point. I don't why we have anything on file lru if there's no >> > filesystems except tmpfs. >> > >> > Ying, how do you get stuff to the tmpfs? >> >> We put root file system and benchmark into a set of compressed cpio >> archive, then concatenate them into one initrd, and finally kernel use >> that initrd as initramfs. > > I see. > > Could you share your 4 full vmstat(/proc/vmstat) files? > > old: > > cat /proc/vmstat > before.old.vmstat > do benchmark > cat /proc/vmstat > after.old.vmstat > > new: > > cat /proc/vmstat > before.new.vmstat > do benchmark > cat /proc/vmstat > after.new.vmstat > > IOW, I want to see stats related to reclaim. Hi, The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first bad commit (fbc-proc-vmstat.gz) are attached with the email. The contents of the file is more than the vmstat before and after benchmark running, but are sampled every 1 seconds. Every sample begin with "time: <time>". You can check the first and last samples. The first /proc/vmstat capturing is started at the same time of the benchmark, so it is not exactly the vmstat before the benchmark running. [-- Attachment #2: parent-proc-vmstat.gz --] [-- Type: application/gzip, Size: 78486 bytes --] [-- Attachment #3: fbc-proc-vmstat.gz --] [-- Type: application/gzip, Size: 77915 bytes --] [-- Attachment #4: Type: text/plain, Size: 27 bytes --] Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-16 22:27 ` Huang, Ying @ 2016-06-17 5:41 ` Minchan Kim 2016-06-17 19:26 ` Huang, Ying 0 siblings, 1 reply; 23+ messages in thread From: Minchan Kim @ 2016-06-17 5:41 UTC (permalink / raw) To: Huang, Ying Cc: Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote: > Minchan Kim <minchan@kernel.org> writes: > > > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote: > >> "Kirill A. Shutemov" <kirill@shutemov.name> writes: > >> > >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote: > >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote: > >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: > >> >> > > "Huang, Ying" <ying.huang@intel.com> writes: > >> >> > > > >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > >> >> > > > > >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: > >> >> > > >>> > >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: > >> >> > > >>> > >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") > >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > >> >> > > >>> > >> >> > > >>> in testcase: unixbench > >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory > >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 > >> >> > > >>> > >> >> > > >>> > >> >> > > >>> Details are as below: > >> >> > > >>> --------------------------------------------------------------------------------------------------> > >> >> > > >>> > >> >> > > >>> > >> >> > > >>> ========================================================================================= > >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: > >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench > >> >> > > >>> > >> >> > > >>> commit: > >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 > >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 > >> >> > > >>> > >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de > >> >> > > >>> ---------------- -------------------------- > >> >> > > >>> fail:runs %reproduction fail:runs > >> >> > > >>> | | | > >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] > >> >> > > >>> %stddev %change %stddev > >> >> > > >>> \ | \ > >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score > >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches > >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults > >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time > >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time > >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches > >> >> > > >> > >> >> > > >> That's weird. > >> >> > > >> > >> >> > > >> I don't understand why the change would reduce number or minor faults. > >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too. > >> >> > > > > >> >> > > > unixbench runs in fixed time mode. That is, the total time to run > >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults > >> >> > > > change may reflect only the work done. > >> >> > > > > >> >> > > >> Hm. Is reproducible? Across reboot? > >> >> > > > > >> >> > > > >> >> > > And FYI, there is no swap setup for test, all root file system including > >> >> > > benchmark files are in tmpfs, so no real page reclaim will be > >> >> > > triggered. But it appears that active file cache reduced after the > >> >> > > commit. > >> >> > > > >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active > >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file) > >> >> > > > >> >> > > I think this is the expected behavior of the commit? > >> >> > > >> >> > Yes, it's expected. > >> >> > > >> >> > After the change faularound would produce old pte. It means there's more > >> >> > chance for these pages to be on inactive lru, unless somebody actually > >> >> > touch them and flip accessed bit. > >> >> > >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan > >> >> anonymous LRU list on swapless system so I really wonder why active file > >> >> LRU is shrunk. > >> > > >> > Hm. Good point. I don't why we have anything on file lru if there's no > >> > filesystems except tmpfs. > >> > > >> > Ying, how do you get stuff to the tmpfs? > >> > >> We put root file system and benchmark into a set of compressed cpio > >> archive, then concatenate them into one initrd, and finally kernel use > >> that initrd as initramfs. > > > > I see. > > > > Could you share your 4 full vmstat(/proc/vmstat) files? > > > > old: > > > > cat /proc/vmstat > before.old.vmstat > > do benchmark > > cat /proc/vmstat > after.old.vmstat > > > > new: > > > > cat /proc/vmstat > before.new.vmstat > > do benchmark > > cat /proc/vmstat > after.new.vmstat > > > > IOW, I want to see stats related to reclaim. > > Hi, > > The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first > bad commit (fbc-proc-vmstat.gz) are attached with the email. > > The contents of the file is more than the vmstat before and after > benchmark running, but are sampled every 1 seconds. Every sample begin > with "time: <time>". You can check the first and last samples. The > first /proc/vmstat capturing is started at the same time of the > benchmark, so it is not exactly the vmstat before the benchmark running. > Thanks for the testing! nr_active_file was shrunk 48% but the vaule itself is not huge so I don't think it affects performance a lot. There was no reclaim activity for testing. :( pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too because unixbench was time fixed mode and 6% regressed so no doubt. No interesting data. It seems you tested it with THP, maybe always mode? I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n again? it might you already did. Is it still 6% regressed with disabling THP? nr_free_pages -6663 -6461 96.97% nr_alloc_batch 2594 4013 154.70% nr_inactive_anon 112 112 100.00% nr_active_anon 2536 2159 85.13% nr_inactive_file -567 -227 40.04% nr_active_file 648 315 48.61% nr_unevictable 0 0 0.00% nr_mlock 0 0 0.00% nr_anon_pages 2634 2161 82.04% nr_mapped 511 530 103.72% nr_file_pages 207 215 103.86% nr_dirty -7 -6 85.71% nr_writeback 0 0 0.00% nr_slab_reclaimable 158 328 207.59% nr_slab_unreclaimable 2208 2115 95.79% nr_page_table_pages 268 247 92.16% nr_kernel_stack 143 80 55.94% nr_unstable 1 1 100.00% nr_bounce 0 0 0.00% nr_vmscan_write 0 0 0.00% nr_vmscan_immediate_reclaim 0 0 0.00% nr_writeback_temp 0 0 0.00% nr_isolated_anon 0 0 0.00% nr_isolated_file 0 0 0.00% nr_shmem 131 131 100.00% nr_dirtied 67 78 116.42% nr_written 74 84 113.51% nr_pages_scanned 0 0 0.00% numa_hit 483752446 453696304 93.79% numa_miss 0 0 0.00% numa_foreign 0 0 0.00% numa_interleave 0 0 0.00% numa_local 483752445 453696304 93.79% numa_other 1 0 0.00% workingset_refault 0 0 0.00% workingset_activate 0 0 0.00% workingset_nodereclaim 0 0 0.00% nr_anon_transparent_hugepages 1 0 0.00% nr_free_cma 0 0 0.00% nr_dirty_threshold -1316 -1274 96.81% nr_dirty_background_threshold -658 -637 96.81% pgpgin 0 0 0.00% pgpgout 0 0 0.00% pswpin 0 0 0.00% pswpout 0 0 0.00% pgalloc_dma 0 0 0.00% pgalloc_dma32 60130977 56323630 93.67% pgalloc_normal 457203182 428863437 93.80% pgalloc_movable 0 0 0.00% pgfree 517327743 485181251 93.79% pgactivate 2059556 1930950 93.76% pgdeactivate 0 0 0.00% pgfault 572723351 537107146 93.78% pgmajfault 0 0 0.00% pglazyfreed 0 0 0.00% pgrefill_dma 0 0 0.00% pgrefill_dma32 0 0 0.00% pgrefill_normal 0 0 0.00% pgrefill_movable 0 0 0.00% pgsteal_kswapd_dma 0 0 0.00% pgsteal_kswapd_dma32 0 0 0.00% pgsteal_kswapd_normal 0 0 0.00% pgsteal_kswapd_movable 0 0 0.00% pgsteal_direct_dma 0 0 0.00% pgsteal_direct_dma32 0 0 0.00% pgsteal_direct_normal 0 0 0.00% pgsteal_direct_movable 0 0 0.00% pgscan_kswapd_dma 0 0 0.00% pgscan_kswapd_dma32 0 0 0.00% pgscan_kswapd_normal 0 0 0.00% pgscan_kswapd_movable 0 0 0.00% pgscan_direct_dma 0 0 0.00% pgscan_direct_dma32 0 0 0.00% pgscan_direct_normal 0 0 0.00% pgscan_direct_movable 0 0 0.00% pgscan_direct_throttle 0 0 0.00% zone_reclaim_failed 0 0 0.00% pginodesteal 0 0 0.00% slabs_scanned 0 0 0.00% kswapd_inodesteal 0 0 0.00% kswapd_low_wmark_hit_quickly 0 0 0.00% kswapd_high_wmark_hit_quickly 0 0 0.00% pageoutrun 0 0 0.00% allocstall 0 0 0.00% pgrotated 0 0 0.00% drop_pagecache 0 0 0.00% drop_slab 0 0 0.00% numa_pte_updates 0 0 0.00% numa_huge_pte_updates 0 0 0.00% numa_hint_faults 0 0 0.00% numa_hint_faults_local 0 0 0.00% numa_pages_migrated 0 0 0.00% pgmigrate_success 0 0 0.00% pgmigrate_fail 0 0 0.00% compact_migrate_scanned 0 0 0.00% compact_free_scanned 0 0 0.00% compact_isolated 0 0 0.00% compact_stall 0 0 0.00% compact_fail 0 0 0.00% compact_success 0 0 0.00% compact_daemon_wake 0 0 0.00% htlb_buddy_alloc_success 0 0 0.00% htlb_buddy_alloc_fail 0 0 0.00% unevictable_pgs_culled 0 0 0.00% unevictable_pgs_scanned 0 0 0.00% unevictable_pgs_rescued 0 0 0.00% unevictable_pgs_mlocked 0 0 0.00% unevictable_pgs_munlocked 0 0 0.00% unevictable_pgs_cleared 0 0 0.00% unevictable_pgs_stranded 0 0 0.00% thp_fault_alloc 22731 21604 95.04% thp_fault_fallback 0 0 0.00% thp_collapse_alloc 1 0 0.00% thp_collapse_alloc_failed 0 0 0.00% thp_split_page 0 0 0.00% thp_split_page_failed 0 0 0.00% thp_deferred_split_page 22731 21604 95.04% thp_split_pmd 0 0 0.00% thp_zero_page_alloc 0 0 0.00% thp_zero_page_alloc_failed 0 0 0.00% balloon_inflate 0 0 0.00% balloon_deflate 0 0 0.00% balloon_migrate 0 0 0.00% ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-17 5:41 ` Minchan Kim @ 2016-06-17 19:26 ` Huang, Ying 2016-06-20 0:06 ` Minchan Kim 0 siblings, 1 reply; 23+ messages in thread From: Huang, Ying @ 2016-06-17 19:26 UTC (permalink / raw) To: Minchan Kim Cc: Huang, Ying, Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp Minchan Kim <minchan@kernel.org> writes: > On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote: >> Minchan Kim <minchan@kernel.org> writes: >> >> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote: >> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes: >> >> >> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote: >> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote: >> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: >> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes: >> >> >> > > >> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: >> >> >> > > > >> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: >> >> >> > > >>> >> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: >> >> >> > > >>> >> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") >> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master >> >> >> > > >>> >> >> >> > > >>> in testcase: unixbench >> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory >> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > >>> Details are as below: >> >> >> > > >>> --------------------------------------------------------------------------------------------------> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > >>> ========================================================================================= >> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: >> >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench >> >> >> > > >>> >> >> >> > > >>> commit: >> >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 >> >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 >> >> >> > > >>> >> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de >> >> >> > > >>> ---------------- -------------------------- >> >> >> > > >>> fail:runs %reproduction fail:runs >> >> >> > > >>> | | | >> >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] >> >> >> > > >>> %stddev %change %stddev >> >> >> > > >>> \ | \ >> >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score >> >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches >> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults >> >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time >> >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time >> >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches >> >> >> > > >> >> >> >> > > >> That's weird. >> >> >> > > >> >> >> >> > > >> I don't understand why the change would reduce number or minor faults. >> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too. >> >> >> > > > >> >> >> > > > unixbench runs in fixed time mode. That is, the total time to run >> >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults >> >> >> > > > change may reflect only the work done. >> >> >> > > > >> >> >> > > >> Hm. Is reproducible? Across reboot? >> >> >> > > > >> >> >> > > >> >> >> > > And FYI, there is no swap setup for test, all root file system including >> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be >> >> >> > > triggered. But it appears that active file cache reduced after the >> >> >> > > commit. >> >> >> > > >> >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active >> >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file) >> >> >> > > >> >> >> > > I think this is the expected behavior of the commit? >> >> >> > >> >> >> > Yes, it's expected. >> >> >> > >> >> >> > After the change faularound would produce old pte. It means there's more >> >> >> > chance for these pages to be on inactive lru, unless somebody actually >> >> >> > touch them and flip accessed bit. >> >> >> >> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan >> >> >> anonymous LRU list on swapless system so I really wonder why active file >> >> >> LRU is shrunk. >> >> > >> >> > Hm. Good point. I don't why we have anything on file lru if there's no >> >> > filesystems except tmpfs. >> >> > >> >> > Ying, how do you get stuff to the tmpfs? >> >> >> >> We put root file system and benchmark into a set of compressed cpio >> >> archive, then concatenate them into one initrd, and finally kernel use >> >> that initrd as initramfs. >> > >> > I see. >> > >> > Could you share your 4 full vmstat(/proc/vmstat) files? >> > >> > old: >> > >> > cat /proc/vmstat > before.old.vmstat >> > do benchmark >> > cat /proc/vmstat > after.old.vmstat >> > >> > new: >> > >> > cat /proc/vmstat > before.new.vmstat >> > do benchmark >> > cat /proc/vmstat > after.new.vmstat >> > >> > IOW, I want to see stats related to reclaim. >> >> Hi, >> >> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first >> bad commit (fbc-proc-vmstat.gz) are attached with the email. >> >> The contents of the file is more than the vmstat before and after >> benchmark running, but are sampled every 1 seconds. Every sample begin >> with "time: <time>". You can check the first and last samples. The >> first /proc/vmstat capturing is started at the same time of the >> benchmark, so it is not exactly the vmstat before the benchmark running. >> > > Thanks for the testing! > > nr_active_file was shrunk 48% but the vaule itself is not huge so > I don't think it affects performance a lot. > > There was no reclaim activity for testing. :( > > pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too > because unixbench was time fixed mode and 6% regressed so no > doubt. > > No interesting data. > > It seems you tested it with THP, maybe always mode? Yes. With following in kconfig. CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y > I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n > again? it might you already did. > Is it still 6% regressed with disabling THP? Yes. I disabled THP via echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag The regression is the same as before. ========================================================================================= compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled: gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never commit: 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 5c0a85fad949212b3e059692deecdeed74ae7ec7 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de ---------------- -------------------------- %stddev %change %stddev \ | \ 14332 ± 0% -6.2% 13438 ± 0% unixbench.score 6662206 ± 0% -6.2% 6252260 ± 0% unixbench.time.involuntary_context_switches 5.734e+08 ± 0% -6.2% 5.376e+08 ± 0% unixbench.time.minor_page_faults 2527 ± 0% -3.2% 2446 ± 0% unixbench.time.system_time 1291 ± 0% +5.4% 1361 ± 0% unixbench.time.user_time 19875455 ± 0% -6.3% 18622488 ± 0% unixbench.time.voluntary_context_switches 6570355 ± 0% -11.9% 5787517 ± 0% cpuidle.C1-HSW.usage 17257 ± 34% -59.1% 7055 ± 7% latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath 5976 ± 0% -43.0% 3404 ± 0% proc-vmstat.nr_active_file 45729 ± 1% -22.5% 35439 ± 1% meminfo.Active 23905 ± 0% -43.0% 13619 ± 0% meminfo.Active(file) 8465 ± 3% -29.8% 5940 ± 3% slabinfo.pid.active_objs 8476 ± 3% -29.9% 5940 ± 3% slabinfo.pid.num_objs 3.46 ± 0% +12.5% 3.89 ± 0% turbostat.CPU%c3 67.09 ± 0% -2.1% 65.65 ± 0% turbostat.PkgWatt 96090 ± 0% -5.8% 90479 ± 0% vmstat.system.cs 9083 ± 0% -2.7% 8833 ± 0% vmstat.system.in 467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.MIN_vruntime.avg 7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.MIN_vruntime.max 1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.MIN_vruntime.stddev 467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.max_vruntime.avg 7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.max_vruntime.max 1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.max_vruntime.stddev -10724 ± -7% -12.0% -9433 ± -3% sched_debug.cfs_rq:/.spread0.avg -17721 ± -4% -9.8% -15978 ± -2% sched_debug.cfs_rq:/.spread0.min 90355 ± 9% +14.1% 103099 ± 5% sched_debug.cpu.avg_idle.min 0.12 ± 35% +325.0% 0.52 ± 46% sched_debug.cpu.cpu_load[0].min 21913 ± 2% +29.1% 28288 ± 14% sched_debug.cpu.curr->pid.avg 49953 ± 3% +30.2% 65038 ± 0% sched_debug.cpu.curr->pid.max 23062 ± 2% +30.1% 29996 ± 4% sched_debug.cpu.curr->pid.stddev 274.39 ± 5% -10.2% 246.27 ± 3% sched_debug.cpu.nr_uninterruptible.max 242.73 ± 4% -13.5% 209.90 ± 2% sched_debug.cpu.nr_uninterruptible.stddev Best Regards, Huang, Ying > nr_free_pages -6663 -6461 96.97% > nr_alloc_batch 2594 4013 154.70% > nr_inactive_anon 112 112 100.00% > nr_active_anon 2536 2159 85.13% > nr_inactive_file -567 -227 40.04% > nr_active_file 648 315 48.61% > nr_unevictable 0 0 0.00% > nr_mlock 0 0 0.00% > nr_anon_pages 2634 2161 82.04% > nr_mapped 511 530 103.72% > nr_file_pages 207 215 103.86% > nr_dirty -7 -6 85.71% > nr_writeback 0 0 0.00% > nr_slab_reclaimable 158 328 207.59% > nr_slab_unreclaimable 2208 2115 95.79% > nr_page_table_pages 268 247 92.16% > nr_kernel_stack 143 80 55.94% > nr_unstable 1 1 100.00% > nr_bounce 0 0 0.00% > nr_vmscan_write 0 0 0.00% > nr_vmscan_immediate_reclaim 0 0 0.00% > nr_writeback_temp 0 0 0.00% > nr_isolated_anon 0 0 0.00% > nr_isolated_file 0 0 0.00% > nr_shmem 131 131 100.00% > nr_dirtied 67 78 116.42% > nr_written 74 84 113.51% > nr_pages_scanned 0 0 0.00% > numa_hit 483752446 453696304 93.79% > numa_miss 0 0 0.00% > numa_foreign 0 0 0.00% > numa_interleave 0 0 0.00% > numa_local 483752445 453696304 93.79% > numa_other 1 0 0.00% > workingset_refault 0 0 0.00% > workingset_activate 0 0 0.00% > workingset_nodereclaim 0 0 0.00% > nr_anon_transparent_hugepages 1 0 0.00% > nr_free_cma 0 0 0.00% > nr_dirty_threshold -1316 -1274 96.81% > nr_dirty_background_threshold -658 -637 96.81% > pgpgin 0 0 0.00% > pgpgout 0 0 0.00% > pswpin 0 0 0.00% > pswpout 0 0 0.00% > pgalloc_dma 0 0 0.00% > pgalloc_dma32 60130977 56323630 93.67% > pgalloc_normal 457203182 428863437 93.80% > pgalloc_movable 0 0 0.00% > pgfree 517327743 485181251 93.79% > pgactivate 2059556 1930950 93.76% > pgdeactivate 0 0 0.00% > pgfault 572723351 537107146 93.78% > pgmajfault 0 0 0.00% > pglazyfreed 0 0 0.00% > pgrefill_dma 0 0 0.00% > pgrefill_dma32 0 0 0.00% > pgrefill_normal 0 0 0.00% > pgrefill_movable 0 0 0.00% > pgsteal_kswapd_dma 0 0 0.00% > pgsteal_kswapd_dma32 0 0 0.00% > pgsteal_kswapd_normal 0 0 0.00% > pgsteal_kswapd_movable 0 0 0.00% > pgsteal_direct_dma 0 0 0.00% > pgsteal_direct_dma32 0 0 0.00% > pgsteal_direct_normal 0 0 0.00% > pgsteal_direct_movable 0 0 0.00% > pgscan_kswapd_dma 0 0 0.00% > pgscan_kswapd_dma32 0 0 0.00% > pgscan_kswapd_normal 0 0 0.00% > pgscan_kswapd_movable 0 0 0.00% > pgscan_direct_dma 0 0 0.00% > pgscan_direct_dma32 0 0 0.00% > pgscan_direct_normal 0 0 0.00% > pgscan_direct_movable 0 0 0.00% > pgscan_direct_throttle 0 0 0.00% > zone_reclaim_failed 0 0 0.00% > pginodesteal 0 0 0.00% > slabs_scanned 0 0 0.00% > kswapd_inodesteal 0 0 0.00% > kswapd_low_wmark_hit_quickly 0 0 0.00% > kswapd_high_wmark_hit_quickly 0 0 0.00% > pageoutrun 0 0 0.00% > allocstall 0 0 0.00% > pgrotated 0 0 0.00% > drop_pagecache 0 0 0.00% > drop_slab 0 0 0.00% > numa_pte_updates 0 0 0.00% > numa_huge_pte_updates 0 0 0.00% > numa_hint_faults 0 0 0.00% > numa_hint_faults_local 0 0 0.00% > numa_pages_migrated 0 0 0.00% > pgmigrate_success 0 0 0.00% > pgmigrate_fail 0 0 0.00% > compact_migrate_scanned 0 0 0.00% > compact_free_scanned 0 0 0.00% > compact_isolated 0 0 0.00% > compact_stall 0 0 0.00% > compact_fail 0 0 0.00% > compact_success 0 0 0.00% > compact_daemon_wake 0 0 0.00% > htlb_buddy_alloc_success 0 0 0.00% > htlb_buddy_alloc_fail 0 0 0.00% > unevictable_pgs_culled 0 0 0.00% > unevictable_pgs_scanned 0 0 0.00% > unevictable_pgs_rescued 0 0 0.00% > unevictable_pgs_mlocked 0 0 0.00% > unevictable_pgs_munlocked 0 0 0.00% > unevictable_pgs_cleared 0 0 0.00% > unevictable_pgs_stranded 0 0 0.00% > thp_fault_alloc 22731 21604 95.04% > thp_fault_fallback 0 0 0.00% > thp_collapse_alloc 1 0 0.00% > thp_collapse_alloc_failed 0 0 0.00% > thp_split_page 0 0 0.00% > thp_split_page_failed 0 0 0.00% > thp_deferred_split_page 22731 21604 95.04% > thp_split_pmd 0 0 0.00% > thp_zero_page_alloc 0 0 0.00% > thp_zero_page_alloc_failed 0 0 0.00% > balloon_inflate 0 0 0.00% > balloon_deflate 0 0 0.00% > balloon_migrate 0 0 0.00% ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression 2016-06-17 19:26 ` Huang, Ying @ 2016-06-20 0:06 ` Minchan Kim 0 siblings, 0 replies; 23+ messages in thread From: Minchan Kim @ 2016-06-20 0:06 UTC (permalink / raw) To: Huang, Ying Cc: Minchan Kim, Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86 On Fri, Jun 17, 2016 at 12:26:51PM -0700, Huang, Ying wrote: > Minchan Kim <minchan@kernel.org> writes: > > > On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote: > >> Minchan Kim <minchan@kernel.org> writes: > >> > >> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote: > >> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes: > >> >> > >> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote: > >> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote: > >> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote: > >> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes: > >> >> >> > > > >> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes: > >> >> >> > > > > >> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote: > >> >> >> > > >>> > >> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit: > >> >> >> > > >>> > >> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes") > >> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > >> >> >> > > >>> > >> >> >> > > >>> in testcase: unixbench > >> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory > >> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8 > >> >> >> > > >>> > >> >> >> > > >>> > >> >> >> > > >>> Details are as below: > >> >> >> > > >>> --------------------------------------------------------------------------------------------------> > >> >> >> > > >>> > >> >> >> > > >>> > >> >> >> > > >>> ========================================================================================= > >> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase: > >> >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench > >> >> >> > > >>> > >> >> >> > > >>> commit: > >> >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 > >> >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7 > >> >> >> > > >>> > >> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de > >> >> >> > > >>> ---------------- -------------------------- > >> >> >> > > >>> fail:runs %reproduction fail:runs > >> >> >> > > >>> | | | > >> >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#] > >> >> >> > > >>> %stddev %change %stddev > >> >> >> > > >>> \ | \ > >> >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score > >> >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches > >> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults > >> >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time > >> >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time > >> >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches > >> >> >> > > >> > >> >> >> > > >> That's weird. > >> >> >> > > >> > >> >> >> > > >> I don't understand why the change would reduce number or minor faults. > >> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too. > >> >> >> > > > > >> >> >> > > > unixbench runs in fixed time mode. That is, the total time to run > >> >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults > >> >> >> > > > change may reflect only the work done. > >> >> >> > > > > >> >> >> > > >> Hm. Is reproducible? Across reboot? > >> >> >> > > > > >> >> >> > > > >> >> >> > > And FYI, there is no swap setup for test, all root file system including > >> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be > >> >> >> > > triggered. But it appears that active file cache reduced after the > >> >> >> > > commit. > >> >> >> > > > >> >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active > >> >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file) > >> >> >> > > > >> >> >> > > I think this is the expected behavior of the commit? > >> >> >> > > >> >> >> > Yes, it's expected. > >> >> >> > > >> >> >> > After the change faularound would produce old pte. It means there's more > >> >> >> > chance for these pages to be on inactive lru, unless somebody actually > >> >> >> > touch them and flip accessed bit. > >> >> >> > >> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan > >> >> >> anonymous LRU list on swapless system so I really wonder why active file > >> >> >> LRU is shrunk. > >> >> > > >> >> > Hm. Good point. I don't why we have anything on file lru if there's no > >> >> > filesystems except tmpfs. > >> >> > > >> >> > Ying, how do you get stuff to the tmpfs? > >> >> > >> >> We put root file system and benchmark into a set of compressed cpio > >> >> archive, then concatenate them into one initrd, and finally kernel use > >> >> that initrd as initramfs. > >> > > >> > I see. > >> > > >> > Could you share your 4 full vmstat(/proc/vmstat) files? > >> > > >> > old: > >> > > >> > cat /proc/vmstat > before.old.vmstat > >> > do benchmark > >> > cat /proc/vmstat > after.old.vmstat > >> > > >> > new: > >> > > >> > cat /proc/vmstat > before.new.vmstat > >> > do benchmark > >> > cat /proc/vmstat > after.new.vmstat > >> > > >> > IOW, I want to see stats related to reclaim. > >> > >> Hi, > >> > >> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first > >> bad commit (fbc-proc-vmstat.gz) are attached with the email. > >> > >> The contents of the file is more than the vmstat before and after > >> benchmark running, but are sampled every 1 seconds. Every sample begin > >> with "time: <time>". You can check the first and last samples. The > >> first /proc/vmstat capturing is started at the same time of the > >> benchmark, so it is not exactly the vmstat before the benchmark running. > >> > > > > Thanks for the testing! > > > > nr_active_file was shrunk 48% but the vaule itself is not huge so > > I don't think it affects performance a lot. > > > > There was no reclaim activity for testing. :( > > > > pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too > > because unixbench was time fixed mode and 6% regressed so no > > doubt. > > > > No interesting data. > > > > It seems you tested it with THP, maybe always mode? > > Yes. With following in kconfig. > > CONFIG_TRANSPARENT_HUGEPAGE=y > CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y > > > I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n > > again? it might you already did. > > Is it still 6% regressed with disabling THP? > > Yes. I disabled THP via > > echo never > /sys/kernel/mm/transparent_hugepage/enabled > echo never > /sys/kernel/mm/transparent_hugepage/defrag > > The regression is the same as before. Still, 6% user_time regression and there is no difference with previous experiment which enabled THP so I agree the regression is caused by just access bit setting from CPU side, which is rather surprising to me. I don't know how unixbench shell scripts testing touch the memory but it should be per-page overhead and 6% regression for that is too heavy. Anyway, at least, it would be better to notice it to x86 maintainer. Thanks for the test! > > ========================================================================================= > compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled: > gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never > > commit: > 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 > 5c0a85fad949212b3e059692deecdeed74ae7ec7 > > 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de > ---------------- -------------------------- > %stddev %change %stddev > \ | \ > 14332 ± 0% -6.2% 13438 ± 0% unixbench.score > 6662206 ± 0% -6.2% 6252260 ± 0% unixbench.time.involuntary_context_switches > 5.734e+08 ± 0% -6.2% 5.376e+08 ± 0% unixbench.time.minor_page_faults > 2527 ± 0% -3.2% 2446 ± 0% unixbench.time.system_time > 1291 ± 0% +5.4% 1361 ± 0% unixbench.time.user_time > 19875455 ± 0% -6.3% 18622488 ± 0% unixbench.time.voluntary_context_switches > 6570355 ± 0% -11.9% 5787517 ± 0% cpuidle.C1-HSW.usage > 17257 ± 34% -59.1% 7055 ± 7% latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath > 5976 ± 0% -43.0% 3404 ± 0% proc-vmstat.nr_active_file > 45729 ± 1% -22.5% 35439 ± 1% meminfo.Active > 23905 ± 0% -43.0% 13619 ± 0% meminfo.Active(file) > 8465 ± 3% -29.8% 5940 ± 3% slabinfo.pid.active_objs > 8476 ± 3% -29.9% 5940 ± 3% slabinfo.pid.num_objs > 3.46 ± 0% +12.5% 3.89 ± 0% turbostat.CPU%c3 > 67.09 ± 0% -2.1% 65.65 ± 0% turbostat.PkgWatt > 96090 ± 0% -5.8% 90479 ± 0% vmstat.system.cs > 9083 ± 0% -2.7% 8833 ± 0% vmstat.system.in > 467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.MIN_vruntime.avg > 7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.MIN_vruntime.max > 1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.MIN_vruntime.stddev > 467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.max_vruntime.avg > 7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.max_vruntime.max > 1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.max_vruntime.stddev > -10724 ± -7% -12.0% -9433 ± -3% sched_debug.cfs_rq:/.spread0.avg > -17721 ± -4% -9.8% -15978 ± -2% sched_debug.cfs_rq:/.spread0.min > 90355 ± 9% +14.1% 103099 ± 5% sched_debug.cpu.avg_idle.min > 0.12 ± 35% +325.0% 0.52 ± 46% sched_debug.cpu.cpu_load[0].min > 21913 ± 2% +29.1% 28288 ± 14% sched_debug.cpu.curr->pid.avg > 49953 ± 3% +30.2% 65038 ± 0% sched_debug.cpu.curr->pid.max > 23062 ± 2% +30.1% 29996 ± 4% sched_debug.cpu.curr->pid.stddev > 274.39 ± 5% -10.2% 246.27 ± 3% sched_debug.cpu.nr_uninterruptible.max > 242.73 ± 4% -13.5% 209.90 ± 2% sched_debug.cpu.nr_uninterruptible.stddev > > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2016-06-20 0:06 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-06 2:27 [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression kernel test robot 2016-06-06 9:51 ` Kirill A. Shutemov 2016-06-08 7:21 ` [LKP] " Huang, Ying 2016-06-08 8:41 ` Huang, Ying 2016-06-08 8:58 ` Kirill A. Shutemov 2016-06-12 0:49 ` Huang, Ying 2016-06-12 1:02 ` Linus Torvalds 2016-06-13 9:02 ` Huang, Ying 2016-06-14 13:38 ` Minchan Kim 2016-06-15 23:42 ` Huang, Ying 2016-06-13 12:52 ` Kirill A. Shutemov 2016-06-14 6:11 ` Linus Torvalds 2016-06-14 8:26 ` Kirill A. Shutemov 2016-06-14 16:07 ` Rik van Riel 2016-06-14 14:03 ` Christian Borntraeger 2016-06-14 8:57 ` Minchan Kim 2016-06-14 14:34 ` Kirill A. Shutemov 2016-06-15 23:52 ` Huang, Ying 2016-06-16 0:13 ` Minchan Kim 2016-06-16 22:27 ` Huang, Ying 2016-06-17 5:41 ` Minchan Kim 2016-06-17 19:26 ` Huang, Ying 2016-06-20 0:06 ` Minchan Kim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).