* [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-06 2:27 ` kernel test robot
0 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2016-06-06 2:27 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Linus Torvalds, Michal Hocko, Minchan Kim, Rik van Riel,
Mel Gorman, Michal Hocko, Vinayak Menon, Andrew Morton, LKML,
lkp
[-- Attachment #1: Type: text/plain, Size: 4496 bytes --]
FYI, we noticed a -6.3% regression of unixbench.score due to commit:
commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
in testcase: unixbench
on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
Details are as below:
-------------------------------------------------------------------------------------------------->
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
commit:
4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
5c0a85fad949212b3e059692deecdeed74ae7ec7
4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
---------------- --------------------------
fail:runs %reproduction fail:runs
| | |
3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
%stddev %change %stddev
\ | \
14321 ± 0% -6.3% 13425 ± 0% unixbench.score
1996897 ± 0% -6.1% 1874635 ± 0% unixbench.time.involuntary_context_switches
1.721e+08 ± 0% -6.2% 1.613e+08 ± 0% unixbench.time.minor_page_faults
758.65 ± 0% -3.0% 735.86 ± 0% unixbench.time.system_time
387.66 ± 0% +5.4% 408.49 ± 0% unixbench.time.user_time
5950278 ± 0% -6.2% 5583456 ± 0% unixbench.time.voluntary_context_switches
1960642 ± 0% -11.4% 1737753 ± 0% cpuidle.C1-HSW.usage
5851 ± 0% -43.8% 3286 ± 1% proc-vmstat.nr_active_file
46185 ± 0% -21.2% 36385 ± 2% meminfo.Active
23404 ± 0% -43.8% 13147 ± 1% meminfo.Active(file)
4109 ± 5% -19.6% 3302 ± 4% slabinfo.pid.active_objs
4109 ± 5% -19.6% 3302 ± 4% slabinfo.pid.num_objs
94603 ± 0% -5.7% 89247 ± 0% vmstat.system.cs
8976 ± 0% -2.5% 8754 ± 0% vmstat.system.in
3.38 ± 2% +11.8% 3.77 ± 0% turbostat.CPU%c3
0.24 ±101% -86.3% 0.03 ± 54% turbostat.Pkg%pc3
66.53 ± 0% -1.7% 65.41 ± 0% turbostat.PkgWatt
2061 ± 1% -8.5% 1886 ± 0% sched_debug.cfs_rq:/.exec_clock.stddev
737154 ± 5% +10.8% 817107 ± 3% sched_debug.cpu.avg_idle.max
133057 ± 5% -33.2% 88864 ± 11% sched_debug.cpu.avg_idle.min
181562 ± 8% +15.9% 210434 ± 3% sched_debug.cpu.avg_idle.stddev
0.97 ± 7% +19.0% 1.16 ± 8% sched_debug.cpu.clock.stddev
0.97 ± 7% +19.0% 1.16 ± 8% sched_debug.cpu.clock_task.stddev
248.06 ± 11% +31.0% 324.94 ± 8% sched_debug.cpu.cpu_load[1].max
55.65 ± 14% +28.1% 71.30 ± 8% sched_debug.cpu.cpu_load[1].stddev
233.38 ± 10% +34.4% 313.56 ± 8% sched_debug.cpu.cpu_load[2].max
49.79 ± 15% +35.6% 67.50 ± 9% sched_debug.cpu.cpu_load[2].stddev
233.25 ± 12% +29.9% 302.94 ± 6% sched_debug.cpu.cpu_load[3].max
46.56 ± 8% +12.2% 52.25 ± 6% sched_debug.cpu.cpu_load[3].min
48.51 ± 15% +31.4% 63.76 ± 7% sched_debug.cpu.cpu_load[3].stddev
238.44 ± 12% +19.0% 283.69 ± 3% sched_debug.cpu.cpu_load[4].max
49.56 ± 9% +13.4% 56.19 ± 4% sched_debug.cpu.cpu_load[4].min
48.22 ± 13% +20.1% 57.93 ± 5% sched_debug.cpu.cpu_load[4].stddev
14792 ± 30% +71.9% 25424 ± 17% sched_debug.cpu.curr->pid.avg
42862 ± 1% +42.6% 61121 ± 0% sched_debug.cpu.curr->pid.max
19466 ± 10% +35.4% 26351 ± 9% sched_debug.cpu.curr->pid.stddev
1067 ± 6% -14.9% 909.35 ± 4% sched_debug.cpu.ttwu_local.stddev
To reproduce:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Xiaolong
[-- Attachment #2: job.yaml --]
[-- Type: text/plain, Size: 3497 bytes --]
---
LKP_SERVER: inn
LKP_CGI_PORT: 80
LKP_CIFS_PORT: 139
testcase: unixbench
default-monitors:
wait: activate-monitor
kmsg:
uptime:
iostat:
heartbeat:
vmstat:
numa-numastat:
numa-vmstat:
numa-meminfo:
proc-vmstat:
proc-stat:
interval: 10
meminfo:
slabinfo:
interrupts:
lock_stat:
latency_stats:
softirqs:
bdi_dev_mapping:
diskstats:
nfsstat:
cpuidle:
cpufreq-stats:
turbostat:
pmeter:
sched_debug:
interval: 60
cpufreq_governor: performance
NFS_HANG_DF_TIMEOUT: 200
NFS_HANG_CHECK_INTERVAL: 900
default-watchdogs:
oom-killer:
watchdog:
nfs-hang:
commit: 5c0a85fad949212b3e059692deecdeed74ae7ec7
model: Haswell High-end Desktop
nr_cpu: 16
memory: 16G
hdd_partitions:
swap_partitions:
rootfs_partition:
description: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
category: benchmark
nr_task: 1
unixbench:
test: shell8
queue: bisect
testbox: lituya
tbox_group: lituya
kconfig: x86_64-rhel
enqueue_time: 2016-06-04 03:26:52.444586006 +08:00
compiler: gcc-4.9
rootfs: debian-x86_64-2015-02-07.cgz
id: 101932ca34f6ff20613b88f6bed66fbc4afdfb95
user: lkp
head_commit: 73aa85b30706f742655a10c967c033b56c731aff
base_commit: 1a695a905c18548062509178b98bc91e67510864
branch: internal-devel/devel-hourly-2016060108-internal
result_root: "/result/unixbench/performance-1-shell8/lituya/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/1"
job_file: "/lkp/scheduled/lituya/bisect_unixbench-performance-1-shell8-debian-x86_64-2015-02-07.cgz-x86_64-rhel-5c0a85fad949212b3e059692deecdeed74ae7ec7-20160604-57400-1fovod8-1.yaml"
max_uptime: 1032.28
initrd: "/osimage/debian/debian-x86_64-2015-02-07.cgz"
bootloader_append:
- root=/dev/ram0
- user=lkp
- job=/lkp/scheduled/lituya/bisect_unixbench-performance-1-shell8-debian-x86_64-2015-02-07.cgz-x86_64-rhel-5c0a85fad949212b3e059692deecdeed74ae7ec7-20160604-57400-1fovod8-1.yaml
- ARCH=x86_64
- kconfig=x86_64-rhel
- branch=internal-devel/devel-hourly-2016060108-internal
- commit=5c0a85fad949212b3e059692deecdeed74ae7ec7
- BOOT_IMAGE=/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/vmlinuz-4.6.0-06629-g5c0a85f
- max_uptime=1032
- RESULT_ROOT=/result/unixbench/performance-1-shell8/lituya/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/1
- LKP_SERVER=inn
- |2-
earlyprintk=ttyS0,115200 systemd.log_level=err
debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100
panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0
console=ttyS0,115200 console=tty0 vga=normal
rw
lkp_initrd: "/lkp/lkp/lkp-x86_64.cgz"
modules_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/modules.cgz"
bm_initrd: "/osimage/deps/debian-x86_64-2015-02-07.cgz/lkp.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/run-ipconfig.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/turbostat.cgz,/lkp/benchmarks/turbostat.cgz,/lkp/benchmarks/unixbench.cgz"
linux_headers_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/linux-headers.cgz"
repeat_to: 2
kernel: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/vmlinuz-4.6.0-06629-g5c0a85f"
dequeue_time: 2016-06-04 03:40:50.807274201 +08:00
job_state: finished
loadavg: 5.70 2.70 1.05 1/257 3744
start_time: '1465010834'
end_time: '1465011023'
version: "/lkp/lkp/.src-20160603-214427"
[-- Attachment #3: reproduce --]
[-- Type: text/plain, Size: 1532 bytes --]
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu12/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu13/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu14/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu15/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor
2016-06-04 11:23:33 ./Run shell8 -c 1
^ permalink raw reply [flat|nested] 46+ messages in thread
* [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-06 2:27 ` kernel test robot
0 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2016-06-06 2:27 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 4656 bytes --]
FYI, we noticed a -6.3% regression of unixbench.score due to commit:
commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
in testcase: unixbench
on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
Details are as below:
-------------------------------------------------------------------------------------------------->
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
commit:
4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
5c0a85fad949212b3e059692deecdeed74ae7ec7
4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
---------------- --------------------------
fail:runs %reproduction fail:runs
| | |
3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
%stddev %change %stddev
\ | \
14321 ± 0% -6.3% 13425 ± 0% unixbench.score
1996897 ± 0% -6.1% 1874635 ± 0% unixbench.time.involuntary_context_switches
1.721e+08 ± 0% -6.2% 1.613e+08 ± 0% unixbench.time.minor_page_faults
758.65 ± 0% -3.0% 735.86 ± 0% unixbench.time.system_time
387.66 ± 0% +5.4% 408.49 ± 0% unixbench.time.user_time
5950278 ± 0% -6.2% 5583456 ± 0% unixbench.time.voluntary_context_switches
1960642 ± 0% -11.4% 1737753 ± 0% cpuidle.C1-HSW.usage
5851 ± 0% -43.8% 3286 ± 1% proc-vmstat.nr_active_file
46185 ± 0% -21.2% 36385 ± 2% meminfo.Active
23404 ± 0% -43.8% 13147 ± 1% meminfo.Active(file)
4109 ± 5% -19.6% 3302 ± 4% slabinfo.pid.active_objs
4109 ± 5% -19.6% 3302 ± 4% slabinfo.pid.num_objs
94603 ± 0% -5.7% 89247 ± 0% vmstat.system.cs
8976 ± 0% -2.5% 8754 ± 0% vmstat.system.in
3.38 ± 2% +11.8% 3.77 ± 0% turbostat.CPU%c3
0.24 ±101% -86.3% 0.03 ± 54% turbostat.Pkg%pc3
66.53 ± 0% -1.7% 65.41 ± 0% turbostat.PkgWatt
2061 ± 1% -8.5% 1886 ± 0% sched_debug.cfs_rq:/.exec_clock.stddev
737154 ± 5% +10.8% 817107 ± 3% sched_debug.cpu.avg_idle.max
133057 ± 5% -33.2% 88864 ± 11% sched_debug.cpu.avg_idle.min
181562 ± 8% +15.9% 210434 ± 3% sched_debug.cpu.avg_idle.stddev
0.97 ± 7% +19.0% 1.16 ± 8% sched_debug.cpu.clock.stddev
0.97 ± 7% +19.0% 1.16 ± 8% sched_debug.cpu.clock_task.stddev
248.06 ± 11% +31.0% 324.94 ± 8% sched_debug.cpu.cpu_load[1].max
55.65 ± 14% +28.1% 71.30 ± 8% sched_debug.cpu.cpu_load[1].stddev
233.38 ± 10% +34.4% 313.56 ± 8% sched_debug.cpu.cpu_load[2].max
49.79 ± 15% +35.6% 67.50 ± 9% sched_debug.cpu.cpu_load[2].stddev
233.25 ± 12% +29.9% 302.94 ± 6% sched_debug.cpu.cpu_load[3].max
46.56 ± 8% +12.2% 52.25 ± 6% sched_debug.cpu.cpu_load[3].min
48.51 ± 15% +31.4% 63.76 ± 7% sched_debug.cpu.cpu_load[3].stddev
238.44 ± 12% +19.0% 283.69 ± 3% sched_debug.cpu.cpu_load[4].max
49.56 ± 9% +13.4% 56.19 ± 4% sched_debug.cpu.cpu_load[4].min
48.22 ± 13% +20.1% 57.93 ± 5% sched_debug.cpu.cpu_load[4].stddev
14792 ± 30% +71.9% 25424 ± 17% sched_debug.cpu.curr->pid.avg
42862 ± 1% +42.6% 61121 ± 0% sched_debug.cpu.curr->pid.max
19466 ± 10% +35.4% 26351 ± 9% sched_debug.cpu.curr->pid.stddev
1067 ± 6% -14.9% 909.35 ± 4% sched_debug.cpu.ttwu_local.stddev
To reproduce:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Xiaolong
[-- Attachment #2: job.yaml --]
[-- Type: text/plain, Size: 3497 bytes --]
---
LKP_SERVER: inn
LKP_CGI_PORT: 80
LKP_CIFS_PORT: 139
testcase: unixbench
default-monitors:
wait: activate-monitor
kmsg:
uptime:
iostat:
heartbeat:
vmstat:
numa-numastat:
numa-vmstat:
numa-meminfo:
proc-vmstat:
proc-stat:
interval: 10
meminfo:
slabinfo:
interrupts:
lock_stat:
latency_stats:
softirqs:
bdi_dev_mapping:
diskstats:
nfsstat:
cpuidle:
cpufreq-stats:
turbostat:
pmeter:
sched_debug:
interval: 60
cpufreq_governor: performance
NFS_HANG_DF_TIMEOUT: 200
NFS_HANG_CHECK_INTERVAL: 900
default-watchdogs:
oom-killer:
watchdog:
nfs-hang:
commit: 5c0a85fad949212b3e059692deecdeed74ae7ec7
model: Haswell High-end Desktop
nr_cpu: 16
memory: 16G
hdd_partitions:
swap_partitions:
rootfs_partition:
description: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
category: benchmark
nr_task: 1
unixbench:
test: shell8
queue: bisect
testbox: lituya
tbox_group: lituya
kconfig: x86_64-rhel
enqueue_time: 2016-06-04 03:26:52.444586006 +08:00
compiler: gcc-4.9
rootfs: debian-x86_64-2015-02-07.cgz
id: 101932ca34f6ff20613b88f6bed66fbc4afdfb95
user: lkp
head_commit: 73aa85b30706f742655a10c967c033b56c731aff
base_commit: 1a695a905c18548062509178b98bc91e67510864
branch: internal-devel/devel-hourly-2016060108-internal
result_root: "/result/unixbench/performance-1-shell8/lituya/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/1"
job_file: "/lkp/scheduled/lituya/bisect_unixbench-performance-1-shell8-debian-x86_64-2015-02-07.cgz-x86_64-rhel-5c0a85fad949212b3e059692deecdeed74ae7ec7-20160604-57400-1fovod8-1.yaml"
max_uptime: 1032.28
initrd: "/osimage/debian/debian-x86_64-2015-02-07.cgz"
bootloader_append:
- root=/dev/ram0
- user=lkp
- job=/lkp/scheduled/lituya/bisect_unixbench-performance-1-shell8-debian-x86_64-2015-02-07.cgz-x86_64-rhel-5c0a85fad949212b3e059692deecdeed74ae7ec7-20160604-57400-1fovod8-1.yaml
- ARCH=x86_64
- kconfig=x86_64-rhel
- branch=internal-devel/devel-hourly-2016060108-internal
- commit=5c0a85fad949212b3e059692deecdeed74ae7ec7
- BOOT_IMAGE=/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/vmlinuz-4.6.0-06629-g5c0a85f
- max_uptime=1032
- RESULT_ROOT=/result/unixbench/performance-1-shell8/lituya/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/1
- LKP_SERVER=inn
- |2-
earlyprintk=ttyS0,115200 systemd.log_level=err
debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100
panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0
console=ttyS0,115200 console=tty0 vga=normal
rw
lkp_initrd: "/lkp/lkp/lkp-x86_64.cgz"
modules_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/modules.cgz"
bm_initrd: "/osimage/deps/debian-x86_64-2015-02-07.cgz/lkp.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/run-ipconfig.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/turbostat.cgz,/lkp/benchmarks/turbostat.cgz,/lkp/benchmarks/unixbench.cgz"
linux_headers_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/linux-headers.cgz"
repeat_to: 2
kernel: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/vmlinuz-4.6.0-06629-g5c0a85f"
dequeue_time: 2016-06-04 03:40:50.807274201 +08:00
job_state: finished
loadavg: 5.70 2.70 1.05 1/257 3744
start_time: '1465010834'
end_time: '1465011023'
version: "/lkp/lkp/.src-20160603-214427"
[-- Attachment #3: reproduce.ksh --]
[-- Type: text/plain, Size: 1532 bytes --]
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu12/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu13/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu14/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu15/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor
2016-06-04 11:23:33 ./Run shell8 -c 1
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-06 2:27 ` kernel test robot
@ 2016-06-06 9:51 ` Kirill A. Shutemov
-1 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-06 9:51 UTC (permalink / raw)
To: kernel test robot
Cc: Linus Torvalds, Michal Hocko, Minchan Kim, Rik van Riel,
Mel Gorman, Michal Hocko, Vinayak Menon, Andrew Morton, LKML,
lkp
On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>
> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>
> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>
> in testcase: unixbench
> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>
> commit:
> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>
> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> ---------------- --------------------------
> fail:runs %reproduction fail:runs
> | | |
> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> %stddev %change %stddev
> \ | \
> 14321 ± 0% -6.3% 13425 ± 0% unixbench.score
> 1996897 ± 0% -6.1% 1874635 ± 0% unixbench.time.involuntary_context_switches
> 1.721e+08 ± 0% -6.2% 1.613e+08 ± 0% unixbench.time.minor_page_faults
> 758.65 ± 0% -3.0% 735.86 ± 0% unixbench.time.system_time
> 387.66 ± 0% +5.4% 408.49 ± 0% unixbench.time.user_time
> 5950278 ± 0% -6.2% 5583456 ± 0% unixbench.time.voluntary_context_switches
That's weird.
I don't understand why the change would reduce number or minor faults.
It should stay the same on x86-64. Rise of user_time is puzzling too.
Hm. Is reproducible? Across reboot?
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-06 9:51 ` Kirill A. Shutemov
0 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-06 9:51 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 2148 bytes --]
On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>
> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>
> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>
> in testcase: unixbench
> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>
> commit:
> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>
> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> ---------------- --------------------------
> fail:runs %reproduction fail:runs
> | | |
> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> %stddev %change %stddev
> \ | \
> 14321 ± 0% -6.3% 13425 ± 0% unixbench.score
> 1996897 ± 0% -6.1% 1874635 ± 0% unixbench.time.involuntary_context_switches
> 1.721e+08 ± 0% -6.2% 1.613e+08 ± 0% unixbench.time.minor_page_faults
> 758.65 ± 0% -3.0% 735.86 ± 0% unixbench.time.system_time
> 387.66 ± 0% +5.4% 408.49 ± 0% unixbench.time.user_time
> 5950278 ± 0% -6.2% 5583456 ± 0% unixbench.time.voluntary_context_switches
That's weird.
I don't understand why the change would reduce number or minor faults.
It should stay the same on x86-64. Rise of user_time is puzzling too.
Hm. Is reproducible? Across reboot?
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-06 9:51 ` Kirill A. Shutemov
@ 2016-06-08 7:21 ` Huang, Ying
-1 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-08 7:21 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: kernel test robot, Rik van Riel, Michal Hocko, lkp, LKML,
Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman,
Andrew Morton, Linus Torvalds
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>>
>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>>
>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: unixbench
>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>>
>> commit:
>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>>
>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> ---------------- --------------------------
>> fail:runs %reproduction fail:runs
>> | | |
>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> %stddev %change %stddev
>> \ | \
>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>
> That's weird.
>
> I don't understand why the change would reduce number or minor faults.
> It should stay the same on x86-64. Rise of user_time is puzzling too.
unixbench runs in fixed time mode. That is, the total time to run
unixbench is fixed, but the work done varies. So the minor_page_faults
change may reflect only the work done.
> Hm. Is reproducible? Across reboot?
Yes. LKP will run every benchmark after reboot via kexec. We run 3
times for both the commit and its parent. The result is quite stable.
You can find the standard deviation in percent is near 0 across
different runs. Here is another comparison with profile data.
=========================================================================================
compiler/cpufreq_governor/debug-setup/kconfig/nr_task/rootfs/tbox_group/test/testcase:
gcc-4.9/performance/profile/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
commit:
4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
5c0a85fad949212b3e059692deecdeed74ae7ec7
4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
---------------- --------------------------
%stddev %change %stddev
\ | \
14056 ± 0% -6.3% 13172 ± 0% unixbench.score
6464046 ± 0% -6.1% 6071922 ± 0% unixbench.time.involuntary_context_switches
5.555e+08 ± 0% -6.2% 5.211e+08 ± 0% unixbench.time.minor_page_faults
2537 ± 0% -3.2% 2455 ± 0% unixbench.time.system_time
1284 ± 0% +5.8% 1359 ± 0% unixbench.time.user_time
19192611 ± 0% -6.2% 18010830 ± 0% unixbench.time.voluntary_context_switches
7709931 ± 0% -11.0% 6860574 ± 0% cpuidle.C1-HSW.usage
6900 ± 1% -43.9% 3871 ± 0% proc-vmstat.nr_active_file
40813 ± 1% -77.9% 9015 ±114% softirqs.NET_RX
111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
93169 ± 0% -5.8% 87766 ± 0% vmstat.system.cs
19768 ± 0% -1.7% 19437 ± 0% vmstat.system.in
6.22 ± 0% +10.3% 6.86 ± 0% turbostat.CPU%c3
0.02 ± 20% -85.7% 0.00 ±141% turbostat.Pkg%pc3
68.99 ± 0% -1.7% 67.84 ± 0% turbostat.PkgWatt
1.38 ± 5% -42.0% 0.80 ± 5% perf-profile.cycles-pp.page_remove_rmap.unmap_page_range.unmap_single_vma.unmap_vmas.exit_mmap
0.83 ± 4% +28.8% 1.07 ± 21% perf-profile.cycles-pp.release_pages.free_pages_and_swap_cache.tlb_flush_mmu_free.tlb_finish_mmu.exit_mmap
1.55 ± 3% -10.6% 1.38 ± 2% perf-profile.cycles-pp.unmap_single_vma.unmap_vmas.exit_mmap.mmput.flush_old_exec
1.59 ± 3% -9.8% 1.44 ± 3% perf-profile.cycles-pp.unmap_vmas.exit_mmap.mmput.flush_old_exec.load_elf_binary
389.00 ± 0% +32.1% 514.00 ± 8% slabinfo.file_lock_cache.active_objs
389.00 ± 0% +32.1% 514.00 ± 8% slabinfo.file_lock_cache.num_objs
7075 ± 3% -17.7% 5823 ± 7% slabinfo.pid.active_objs
7075 ± 3% -17.7% 5823 ± 7% slabinfo.pid.num_objs
0.67 ± 34% +86.4% 1.24 ± 30% sched_debug.cfs_rq:/.runnable_load_avg.min
-9013 ± -1% +14.4% -10315 ± -9% sched_debug.cfs_rq:/.spread0.avg
83127 ± 5% +16.9% 97163 ± 8% sched_debug.cpu.avg_idle.min
17777 ± 16% +66.6% 29608 ± 22% sched_debug.cpu.curr->pid.avg
50223 ± 10% +49.3% 74974 ± 0% sched_debug.cpu.curr->pid.max
22281 ± 13% +51.8% 33816 ± 6% sched_debug.cpu.curr->pid.stddev
251.79 ± 5% -13.8% 217.15 ± 5% sched_debug.cpu.nr_uninterruptible.max
-261.12 ± -2% -13.4% -226.03 ± -1% sched_debug.cpu.nr_uninterruptible.min
221.14 ± 3% -14.7% 188.60 ± 1% sched_debug.cpu.nr_uninterruptible.stddev
1.94e+11 ± 0% -5.8% 1.827e+11 ± 0% perf-stat.L1-dcache-load-misses
3.496e+12 ± 0% -6.5% 3.268e+12 ± 0% perf-stat.L1-dcache-loads
2.262e+12 ± 1% -5.5% 2.137e+12 ± 0% perf-stat.L1-dcache-stores
9.711e+10 ± 0% -3.7% 9.353e+10 ± 0% perf-stat.L1-icache-load-misses
8.051e+08 ± 0% -8.8% 7.343e+08 ± 1% perf-stat.LLC-load-misses
7.184e+10 ± 1% -5.6% 6.78e+10 ± 0% perf-stat.LLC-loads
5.867e+08 ± 2% -7.0% 5.456e+08 ± 0% perf-stat.LLC-store-misses
1.524e+10 ± 1% -5.6% 1.438e+10 ± 0% perf-stat.LLC-stores
2.711e+12 ± 0% -6.3% 2.539e+12 ± 0% perf-stat.branch-instructions
5.948e+10 ± 0% -3.9% 5.715e+10 ± 0% perf-stat.branch-load-misses
2.715e+12 ± 0% -6.4% 2.542e+12 ± 0% perf-stat.branch-loads
5.947e+10 ± 0% -3.9% 5.713e+10 ± 0% perf-stat.branch-misses
1.448e+09 ± 0% -9.3% 1.313e+09 ± 1% perf-stat.cache-misses
1.931e+11 ± 0% -5.8% 1.818e+11 ± 0% perf-stat.cache-references
58882705 ± 0% -5.8% 55467522 ± 0% perf-stat.context-switches
17037466 ± 0% -6.1% 15999111 ± 0% perf-stat.cpu-migrations
6.732e+09 ± 1% +90.7% 1.284e+10 ± 0% perf-stat.dTLB-load-misses
3.474e+12 ± 0% -6.6% 3.245e+12 ± 0% perf-stat.dTLB-loads
1.215e+09 ± 0% -5.5% 1.149e+09 ± 0% perf-stat.dTLB-store-misses
2.286e+12 ± 0% -5.8% 2.153e+12 ± 0% perf-stat.dTLB-stores
3.511e+09 ± 0% +20.4% 4.226e+09 ± 0% perf-stat.iTLB-load-misses
2.317e+09 ± 0% -6.8% 2.16e+09 ± 0% perf-stat.iTLB-loads
1.343e+13 ± 0% -6.0% 1.263e+13 ± 0% perf-stat.instructions
5.504e+08 ± 0% -6.2% 5.163e+08 ± 0% perf-stat.minor-faults
8.09e+08 ± 1% -9.0% 7.36e+08 ± 1% perf-stat.node-loads
5.932e+08 ± 0% -8.7% 5.417e+08 ± 1% perf-stat.node-stores
5.504e+08 ± 0% -6.2% 5.163e+08 ± 0% perf-stat.page-faults
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-08 7:21 ` Huang, Ying
0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-08 7:21 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 8064 bytes --]
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>>
>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>>
>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: unixbench
>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>>
>> commit:
>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>>
>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> ---------------- --------------------------
>> fail:runs %reproduction fail:runs
>> | | |
>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> %stddev %change %stddev
>> \ | \
>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>
> That's weird.
>
> I don't understand why the change would reduce number or minor faults.
> It should stay the same on x86-64. Rise of user_time is puzzling too.
unixbench runs in fixed time mode. That is, the total time to run
unixbench is fixed, but the work done varies. So the minor_page_faults
change may reflect only the work done.
> Hm. Is reproducible? Across reboot?
Yes. LKP will run every benchmark after reboot via kexec. We run 3
times for both the commit and its parent. The result is quite stable.
You can find the standard deviation in percent is near 0 across
different runs. Here is another comparison with profile data.
=========================================================================================
compiler/cpufreq_governor/debug-setup/kconfig/nr_task/rootfs/tbox_group/test/testcase:
gcc-4.9/performance/profile/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
commit:
4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
5c0a85fad949212b3e059692deecdeed74ae7ec7
4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
---------------- --------------------------
%stddev %change %stddev
\ | \
14056 ± 0% -6.3% 13172 ± 0% unixbench.score
6464046 ± 0% -6.1% 6071922 ± 0% unixbench.time.involuntary_context_switches
5.555e+08 ± 0% -6.2% 5.211e+08 ± 0% unixbench.time.minor_page_faults
2537 ± 0% -3.2% 2455 ± 0% unixbench.time.system_time
1284 ± 0% +5.8% 1359 ± 0% unixbench.time.user_time
19192611 ± 0% -6.2% 18010830 ± 0% unixbench.time.voluntary_context_switches
7709931 ± 0% -11.0% 6860574 ± 0% cpuidle.C1-HSW.usage
6900 ± 1% -43.9% 3871 ± 0% proc-vmstat.nr_active_file
40813 ± 1% -77.9% 9015 ±114% softirqs.NET_RX
111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
93169 ± 0% -5.8% 87766 ± 0% vmstat.system.cs
19768 ± 0% -1.7% 19437 ± 0% vmstat.system.in
6.22 ± 0% +10.3% 6.86 ± 0% turbostat.CPU%c3
0.02 ± 20% -85.7% 0.00 ±141% turbostat.Pkg%pc3
68.99 ± 0% -1.7% 67.84 ± 0% turbostat.PkgWatt
1.38 ± 5% -42.0% 0.80 ± 5% perf-profile.cycles-pp.page_remove_rmap.unmap_page_range.unmap_single_vma.unmap_vmas.exit_mmap
0.83 ± 4% +28.8% 1.07 ± 21% perf-profile.cycles-pp.release_pages.free_pages_and_swap_cache.tlb_flush_mmu_free.tlb_finish_mmu.exit_mmap
1.55 ± 3% -10.6% 1.38 ± 2% perf-profile.cycles-pp.unmap_single_vma.unmap_vmas.exit_mmap.mmput.flush_old_exec
1.59 ± 3% -9.8% 1.44 ± 3% perf-profile.cycles-pp.unmap_vmas.exit_mmap.mmput.flush_old_exec.load_elf_binary
389.00 ± 0% +32.1% 514.00 ± 8% slabinfo.file_lock_cache.active_objs
389.00 ± 0% +32.1% 514.00 ± 8% slabinfo.file_lock_cache.num_objs
7075 ± 3% -17.7% 5823 ± 7% slabinfo.pid.active_objs
7075 ± 3% -17.7% 5823 ± 7% slabinfo.pid.num_objs
0.67 ± 34% +86.4% 1.24 ± 30% sched_debug.cfs_rq:/.runnable_load_avg.min
-9013 ± -1% +14.4% -10315 ± -9% sched_debug.cfs_rq:/.spread0.avg
83127 ± 5% +16.9% 97163 ± 8% sched_debug.cpu.avg_idle.min
17777 ± 16% +66.6% 29608 ± 22% sched_debug.cpu.curr->pid.avg
50223 ± 10% +49.3% 74974 ± 0% sched_debug.cpu.curr->pid.max
22281 ± 13% +51.8% 33816 ± 6% sched_debug.cpu.curr->pid.stddev
251.79 ± 5% -13.8% 217.15 ± 5% sched_debug.cpu.nr_uninterruptible.max
-261.12 ± -2% -13.4% -226.03 ± -1% sched_debug.cpu.nr_uninterruptible.min
221.14 ± 3% -14.7% 188.60 ± 1% sched_debug.cpu.nr_uninterruptible.stddev
1.94e+11 ± 0% -5.8% 1.827e+11 ± 0% perf-stat.L1-dcache-load-misses
3.496e+12 ± 0% -6.5% 3.268e+12 ± 0% perf-stat.L1-dcache-loads
2.262e+12 ± 1% -5.5% 2.137e+12 ± 0% perf-stat.L1-dcache-stores
9.711e+10 ± 0% -3.7% 9.353e+10 ± 0% perf-stat.L1-icache-load-misses
8.051e+08 ± 0% -8.8% 7.343e+08 ± 1% perf-stat.LLC-load-misses
7.184e+10 ± 1% -5.6% 6.78e+10 ± 0% perf-stat.LLC-loads
5.867e+08 ± 2% -7.0% 5.456e+08 ± 0% perf-stat.LLC-store-misses
1.524e+10 ± 1% -5.6% 1.438e+10 ± 0% perf-stat.LLC-stores
2.711e+12 ± 0% -6.3% 2.539e+12 ± 0% perf-stat.branch-instructions
5.948e+10 ± 0% -3.9% 5.715e+10 ± 0% perf-stat.branch-load-misses
2.715e+12 ± 0% -6.4% 2.542e+12 ± 0% perf-stat.branch-loads
5.947e+10 ± 0% -3.9% 5.713e+10 ± 0% perf-stat.branch-misses
1.448e+09 ± 0% -9.3% 1.313e+09 ± 1% perf-stat.cache-misses
1.931e+11 ± 0% -5.8% 1.818e+11 ± 0% perf-stat.cache-references
58882705 ± 0% -5.8% 55467522 ± 0% perf-stat.context-switches
17037466 ± 0% -6.1% 15999111 ± 0% perf-stat.cpu-migrations
6.732e+09 ± 1% +90.7% 1.284e+10 ± 0% perf-stat.dTLB-load-misses
3.474e+12 ± 0% -6.6% 3.245e+12 ± 0% perf-stat.dTLB-loads
1.215e+09 ± 0% -5.5% 1.149e+09 ± 0% perf-stat.dTLB-store-misses
2.286e+12 ± 0% -5.8% 2.153e+12 ± 0% perf-stat.dTLB-stores
3.511e+09 ± 0% +20.4% 4.226e+09 ± 0% perf-stat.iTLB-load-misses
2.317e+09 ± 0% -6.8% 2.16e+09 ± 0% perf-stat.iTLB-loads
1.343e+13 ± 0% -6.0% 1.263e+13 ± 0% perf-stat.instructions
5.504e+08 ± 0% -6.2% 5.163e+08 ± 0% perf-stat.minor-faults
8.09e+08 ± 1% -9.0% 7.36e+08 ± 1% perf-stat.node-loads
5.932e+08 ± 0% -8.7% 5.417e+08 ± 1% perf-stat.node-stores
5.504e+08 ± 0% -6.2% 5.163e+08 ± 0% perf-stat.page-faults
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-08 7:21 ` Huang, Ying
@ 2016-06-08 8:41 ` Huang, Ying
-1 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-08 8:41 UTC (permalink / raw)
To: Huang, Ying
Cc: Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML,
Linus Torvalds, Michal Hocko, Minchan Kim, Vinayak Menon,
Mel Gorman, Andrew Morton, lkp
"Huang, Ying" <ying.huang@intel.com> writes:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>
>> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>>>
>>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>>>
>>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>>
>>> in testcase: unixbench
>>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>>>
>>>
>>> Details are as below:
>>> -------------------------------------------------------------------------------------------------->
>>>
>>>
>>> =========================================================================================
>>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>>>
>>> commit:
>>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>>>
>>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>>> ---------------- --------------------------
>>> fail:runs %reproduction fail:runs
>>> | | |
>>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>>> %stddev %change %stddev
>>> \ | \
>>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>>
>> That's weird.
>>
>> I don't understand why the change would reduce number or minor faults.
>> It should stay the same on x86-64. Rise of user_time is puzzling too.
>
> unixbench runs in fixed time mode. That is, the total time to run
> unixbench is fixed, but the work done varies. So the minor_page_faults
> change may reflect only the work done.
>
>> Hm. Is reproducible? Across reboot?
>
And FYI, there is no swap setup for test, all root file system including
benchmark files are in tmpfs, so no real page reclaim will be
triggered. But it appears that active file cache reduced after the
commit.
111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
I think this is the expected behavior of the commit?
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-08 8:41 ` Huang, Ying
0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-08 8:41 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 2949 bytes --]
"Huang, Ying" <ying.huang@intel.com> writes:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>
>> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>>>
>>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>>>
>>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>>
>>> in testcase: unixbench
>>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>>>
>>>
>>> Details are as below:
>>> -------------------------------------------------------------------------------------------------->
>>>
>>>
>>> =========================================================================================
>>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>>>
>>> commit:
>>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>>>
>>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>>> ---------------- --------------------------
>>> fail:runs %reproduction fail:runs
>>> | | |
>>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>>> %stddev %change %stddev
>>> \ | \
>>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>>
>> That's weird.
>>
>> I don't understand why the change would reduce number or minor faults.
>> It should stay the same on x86-64. Rise of user_time is puzzling too.
>
> unixbench runs in fixed time mode. That is, the total time to run
> unixbench is fixed, but the work done varies. So the minor_page_faults
> change may reflect only the work done.
>
>> Hm. Is reproducible? Across reboot?
>
And FYI, there is no swap setup for test, all root file system including
benchmark files are in tmpfs, so no real page reclaim will be
triggered. But it appears that active file cache reduced after the
commit.
111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
I think this is the expected behavior of the commit?
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-08 8:41 ` Huang, Ying
@ 2016-06-08 8:58 ` Kirill A. Shutemov
-1 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-08 8:58 UTC (permalink / raw)
To: Huang, Ying
Cc: Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko,
Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, lkp
On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> "Huang, Ying" <ying.huang@intel.com> writes:
>
> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >
> >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >>>
> >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >>>
> >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >>>
> >>> in testcase: unixbench
> >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >>>
> >>>
> >>> Details are as below:
> >>> -------------------------------------------------------------------------------------------------->
> >>>
> >>>
> >>> =========================================================================================
> >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >>>
> >>> commit:
> >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> >>>
> >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> >>> ---------------- --------------------------
> >>> fail:runs %reproduction fail:runs
> >>> | | |
> >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >>> %stddev %change %stddev
> >>> \ | \
> >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> >>
> >> That's weird.
> >>
> >> I don't understand why the change would reduce number or minor faults.
> >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >
> > unixbench runs in fixed time mode. That is, the total time to run
> > unixbench is fixed, but the work done varies. So the minor_page_faults
> > change may reflect only the work done.
> >
> >> Hm. Is reproducible? Across reboot?
> >
>
> And FYI, there is no swap setup for test, all root file system including
> benchmark files are in tmpfs, so no real page reclaim will be
> triggered. But it appears that active file cache reduced after the
> commit.
>
> 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
> 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
>
> I think this is the expected behavior of the commit?
Yes, it's expected.
After the change faularound would produce old pte. It means there's more
chance for these pages to be on inactive lru, unless somebody actually
touch them and flip accessed bit.
I wounder if this regression can attributed to cost of setting accessed
bit. It looks too high, but who knows.
I don't have time to do testing myself right now. I will put this on todo
list.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-08 8:58 ` Kirill A. Shutemov
0 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-08 8:58 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 3540 bytes --]
On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> "Huang, Ying" <ying.huang@intel.com> writes:
>
> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >
> >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >>>
> >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >>>
> >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >>>
> >>> in testcase: unixbench
> >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >>>
> >>>
> >>> Details are as below:
> >>> -------------------------------------------------------------------------------------------------->
> >>>
> >>>
> >>> =========================================================================================
> >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >>>
> >>> commit:
> >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> >>>
> >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> >>> ---------------- --------------------------
> >>> fail:runs %reproduction fail:runs
> >>> | | |
> >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >>> %stddev %change %stddev
> >>> \ | \
> >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> >>
> >> That's weird.
> >>
> >> I don't understand why the change would reduce number or minor faults.
> >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >
> > unixbench runs in fixed time mode. That is, the total time to run
> > unixbench is fixed, but the work done varies. So the minor_page_faults
> > change may reflect only the work done.
> >
> >> Hm. Is reproducible? Across reboot?
> >
>
> And FYI, there is no swap setup for test, all root file system including
> benchmark files are in tmpfs, so no real page reclaim will be
> triggered. But it appears that active file cache reduced after the
> commit.
>
> 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
> 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
>
> I think this is the expected behavior of the commit?
Yes, it's expected.
After the change faularound would produce old pte. It means there's more
chance for these pages to be on inactive lru, unless somebody actually
touch them and flip accessed bit.
I wounder if this regression can attributed to cost of setting accessed
bit. It looks too high, but who knows.
I don't have time to do testing myself right now. I will put this on todo
list.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-08 8:58 ` Kirill A. Shutemov
@ 2016-06-12 0:49 ` Huang, Ying
-1 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-12 0:49 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Linus Torvalds,
Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman,
Andrew Morton, lkp
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >
>> >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >>>
>> >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >>>
>> >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >>>
>> >>> in testcase: unixbench
>> >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >>>
>> >>>
>> >>> Details are as below:
>> >>> -------------------------------------------------------------------------------------------------->
>> >>>
>> >>>
>> >>> =========================================================================================
>> >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >>>
>> >>> commit:
>> >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >>>
>> >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> >>> ---------------- --------------------------
>> >>> fail:runs %reproduction fail:runs
>> >>> | | |
>> >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >>> %stddev %change %stddev
>> >>> \ | \
>> >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>> >>
>> >> That's weird.
>> >>
>> >> I don't understand why the change would reduce number or minor faults.
>> >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >
>> > unixbench runs in fixed time mode. That is, the total time to run
>> > unixbench is fixed, but the work done varies. So the minor_page_faults
>> > change may reflect only the work done.
>> >
>> >> Hm. Is reproducible? Across reboot?
>> >
>>
>> And FYI, there is no swap setup for test, all root file system including
>> benchmark files are in tmpfs, so no real page reclaim will be
>> triggered. But it appears that active file cache reduced after the
>> commit.
>>
>> 111331 . 1% -13.3% 96503 . 0% meminfo.Active
>> 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
>>
>> I think this is the expected behavior of the commit?
>
> Yes, it's expected.
>
> After the change faularound would produce old pte. It means there's more
> chance for these pages to be on inactive lru, unless somebody actually
> touch them and flip accessed bit.
>
> I wounder if this regression can attributed to cost of setting accessed
> bit. It looks too high, but who knows.
>From perf profile, the time spent in page_fault and its children
functions are almost same (7.85% vs 7.81%). So the time spent in page
fault and page table operation itself doesn't changed much. So, you
mean CPU may be slower to load the page table entry to TLB if accessed
bit is not set?
> I don't have time to do testing myself right now. I will put this on todo
> list.
Which kind of test your want to do? I want to check whether I can help.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-12 0:49 ` Huang, Ying
0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-12 0:49 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 4064 bytes --]
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >
>> >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >>>
>> >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >>>
>> >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >>>
>> >>> in testcase: unixbench
>> >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >>>
>> >>>
>> >>> Details are as below:
>> >>> -------------------------------------------------------------------------------------------------->
>> >>>
>> >>>
>> >>> =========================================================================================
>> >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >>>
>> >>> commit:
>> >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >>>
>> >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> >>> ---------------- --------------------------
>> >>> fail:runs %reproduction fail:runs
>> >>> | | |
>> >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >>> %stddev %change %stddev
>> >>> \ | \
>> >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>> >>
>> >> That's weird.
>> >>
>> >> I don't understand why the change would reduce number or minor faults.
>> >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >
>> > unixbench runs in fixed time mode. That is, the total time to run
>> > unixbench is fixed, but the work done varies. So the minor_page_faults
>> > change may reflect only the work done.
>> >
>> >> Hm. Is reproducible? Across reboot?
>> >
>>
>> And FYI, there is no swap setup for test, all root file system including
>> benchmark files are in tmpfs, so no real page reclaim will be
>> triggered. But it appears that active file cache reduced after the
>> commit.
>>
>> 111331 . 1% -13.3% 96503 . 0% meminfo.Active
>> 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
>>
>> I think this is the expected behavior of the commit?
>
> Yes, it's expected.
>
> After the change faularound would produce old pte. It means there's more
> chance for these pages to be on inactive lru, unless somebody actually
> touch them and flip accessed bit.
>
> I wounder if this regression can attributed to cost of setting accessed
> bit. It looks too high, but who knows.
>From perf profile, the time spent in page_fault and its children
functions are almost same (7.85% vs 7.81%). So the time spent in page
fault and page table operation itself doesn't changed much. So, you
mean CPU may be slower to load the page table entry to TLB if accessed
bit is not set?
> I don't have time to do testing myself right now. I will put this on todo
> list.
Which kind of test your want to do? I want to check whether I can help.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-12 0:49 ` Huang, Ying
@ 2016-06-12 1:02 ` Linus Torvalds
-1 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2016-06-12 1:02 UTC (permalink / raw)
To: Huang, Ying
Cc: Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML,
Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman,
Andrew Morton, LKP
On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>
> From perf profile, the time spent in page_fault and its children
> functions are almost same (7.85% vs 7.81%). So the time spent in page
> fault and page table operation itself doesn't changed much. So, you
> mean CPU may be slower to load the page table entry to TLB if accessed
> bit is not set?
So the CPU does take a microfault internally when it needs to set the
accessed/dirty bit. It's not architecturally visible, but you can see
it when you do timing loops.
I've timed it at over a thousand cycles on at least some CPU's, but
that's still peanuts compared to a real page fault. It shouldn't be
*that* noticeable, ie no way it's a 6% regression on its own.
Linus
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-12 1:02 ` Linus Torvalds
0 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2016-06-12 1:02 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 783 bytes --]
On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>
> From perf profile, the time spent in page_fault and its children
> functions are almost same (7.85% vs 7.81%). So the time spent in page
> fault and page table operation itself doesn't changed much. So, you
> mean CPU may be slower to load the page table entry to TLB if accessed
> bit is not set?
So the CPU does take a microfault internally when it needs to set the
accessed/dirty bit. It's not architecturally visible, but you can see
it when you do timing loops.
I've timed it at over a thousand cycles on at least some CPU's, but
that's still peanuts compared to a real page fault. It shouldn't be
*that* noticeable, ie no way it's a 6% regression on its own.
Linus
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-12 1:02 ` Linus Torvalds
@ 2016-06-13 9:02 ` Huang, Ying
-1 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-13 9:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Huang, Ying, Kirill A. Shutemov, Rik van Riel, Michal Hocko,
LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman,
Andrew Morton, LKP
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>>
>> From perf profile, the time spent in page_fault and its children
>> functions are almost same (7.85% vs 7.81%). So the time spent in page
>> fault and page table operation itself doesn't changed much. So, you
>> mean CPU may be slower to load the page table entry to TLB if accessed
>> bit is not set?
>
> So the CPU does take a microfault internally when it needs to set the
> accessed/dirty bit. It's not architecturally visible, but you can see
> it when you do timing loops.
>
> I've timed it at over a thousand cycles on at least some CPU's, but
> that's still peanuts compared to a real page fault. It shouldn't be
> *that* noticeable, ie no way it's a 6% regression on its own.
I done some simple counting, and found that about 3.15e9 PTE are set to
old during the test after the commit. This may interpret the user_time
increase as below, because these accessed bit microfault is accounted as
user time.
387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
I also make a one line debug patch as below on top of the commit to set
the PTE to young unconditionally, which recover the regression.
modified mm/filemap.c
@@ -2193,7 +2193,7 @@ repeat:
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
- do_set_pte(vma, addr, page, pte, false, false, true);
+ do_set_pte(vma, addr, page, pte, false, false, false);
unlock_page(page);
atomic64_inc(&old_pte_count);
goto next;
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-13 9:02 ` Huang, Ying
0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-13 9:02 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>>
>> From perf profile, the time spent in page_fault and its children
>> functions are almost same (7.85% vs 7.81%). So the time spent in page
>> fault and page table operation itself doesn't changed much. So, you
>> mean CPU may be slower to load the page table entry to TLB if accessed
>> bit is not set?
>
> So the CPU does take a microfault internally when it needs to set the
> accessed/dirty bit. It's not architecturally visible, but you can see
> it when you do timing loops.
>
> I've timed it at over a thousand cycles on at least some CPU's, but
> that's still peanuts compared to a real page fault. It shouldn't be
> *that* noticeable, ie no way it's a 6% regression on its own.
I done some simple counting, and found that about 3.15e9 PTE are set to
old during the test after the commit. This may interpret the user_time
increase as below, because these accessed bit microfault is accounted as
user time.
387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
I also make a one line debug patch as below on top of the commit to set
the PTE to young unconditionally, which recover the regression.
modified mm/filemap.c
@@ -2193,7 +2193,7 @@ repeat:
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
- do_set_pte(vma, addr, page, pte, false, false, true);
+ do_set_pte(vma, addr, page, pte, false, false, false);
unlock_page(page);
atomic64_inc(&old_pte_count);
goto next;
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-12 1:02 ` Linus Torvalds
@ 2016-06-13 12:52 ` Kirill A. Shutemov
-1 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-13 12:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko,
Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP,
Dave Hansen
On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
> >
> > From perf profile, the time spent in page_fault and its children
> > functions are almost same (7.85% vs 7.81%). So the time spent in page
> > fault and page table operation itself doesn't changed much. So, you
> > mean CPU may be slower to load the page table entry to TLB if accessed
> > bit is not set?
>
> So the CPU does take a microfault internally when it needs to set the
> accessed/dirty bit. It's not architecturally visible, but you can see
> it when you do timing loops.
>
> I've timed it at over a thousand cycles on at least some CPU's, but
> that's still peanuts compared to a real page fault. It shouldn't be
> *that* noticeable, ie no way it's a 6% regression on its own.
Looks like setting accessed bit is the problem.
Withouth mkold:
Score: 1952.9
Performance counter stats for './Run shell8 -c 1' (3 runs):
468,562,316,621 cycles:u ( +- 0.02% )
4,596,299,472 dtlb_load_misses_walk_duration:u ( +- 0.07% )
5,245,488,559 itlb_misses_walk_duration:u ( +- 0.10% )
189.336404566 seconds time elapsed ( +- 0.01% )
With mkold:
Score: 1885.5
Performance counter stats for './Run shell8 -c 1' (3 runs):
503,185,676,256 cycles:u ( +- 0.06% )
8,137,007,894 dtlb_load_misses_walk_duration:u ( +- 0.85% )
7,220,632,283 itlb_misses_walk_duration:u ( +- 1.40% )
189.363223499 seconds time elapsed ( +- 0.01% )
We spend 36% more time in page walk only, about 1% of total userspace time.
Combining this with page walk footprint on caches, I guess we can get to
this 3.5% score difference I see.
I'm not sure if there's anything we can do to solve the issue without
screwing relacim logic again. :(
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-13 12:52 ` Kirill A. Shutemov
0 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-13 12:52 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 2266 bytes --]
On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
> >
> > From perf profile, the time spent in page_fault and its children
> > functions are almost same (7.85% vs 7.81%). So the time spent in page
> > fault and page table operation itself doesn't changed much. So, you
> > mean CPU may be slower to load the page table entry to TLB if accessed
> > bit is not set?
>
> So the CPU does take a microfault internally when it needs to set the
> accessed/dirty bit. It's not architecturally visible, but you can see
> it when you do timing loops.
>
> I've timed it at over a thousand cycles on at least some CPU's, but
> that's still peanuts compared to a real page fault. It shouldn't be
> *that* noticeable, ie no way it's a 6% regression on its own.
Looks like setting accessed bit is the problem.
Withouth mkold:
Score: 1952.9
Performance counter stats for './Run shell8 -c 1' (3 runs):
468,562,316,621 cycles:u ( +- 0.02% )
4,596,299,472 dtlb_load_misses_walk_duration:u ( +- 0.07% )
5,245,488,559 itlb_misses_walk_duration:u ( +- 0.10% )
189.336404566 seconds time elapsed ( +- 0.01% )
With mkold:
Score: 1885.5
Performance counter stats for './Run shell8 -c 1' (3 runs):
503,185,676,256 cycles:u ( +- 0.06% )
8,137,007,894 dtlb_load_misses_walk_duration:u ( +- 0.85% )
7,220,632,283 itlb_misses_walk_duration:u ( +- 1.40% )
189.363223499 seconds time elapsed ( +- 0.01% )
We spend 36% more time in page walk only, about 1% of total userspace time.
Combining this with page walk footprint on caches, I guess we can get to
this 3.5% score difference I see.
I'm not sure if there's anything we can do to solve the issue without
screwing relacim logic again. :(
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-13 12:52 ` Kirill A. Shutemov
@ 2016-06-14 6:11 ` Linus Torvalds
-1 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2016-06-14 6:11 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko,
Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP,
Dave Hansen
On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
>>
>> I've timed it at over a thousand cycles on at least some CPU's, but
>> that's still peanuts compared to a real page fault. It shouldn't be
>> *that* noticeable, ie no way it's a 6% regression on its own.
>
> Looks like setting accessed bit is the problem.
Ok. I've definitely seen it as an issue, but never to the point of
several percent on a real benchmark that wasn't explicitly testing
that cost.
I reported the excessive dirty/accessed bit cost to Intel back in the
P4 days, but it's apparently not been high enough for anybody to care.
> We spend 36% more time in page walk only, about 1% of total userspace time.
> Combining this with page walk footprint on caches, I guess we can get to
> this 3.5% score difference I see.
>
> I'm not sure if there's anything we can do to solve the issue without
> screwing relacim logic again. :(
I think we should say "screw the reclaim logic" for now, and revert
commit 5c0a85fad949 for now.
Considering how much trouble the accessed bit is on some other
architectures too, I wonder if we should strive to simply not care
about it, and always leaving it set. And then rely entirely on just
unmapping the pages and making the "we took a page fault after
unmapping" be the real activity tester.
So get rid of the "if the page is young, mark it old but leave it in
the page tables" logic entirely. When we unmap a page, it will always
either be in the swap cache or the page cache anyway, so faulting it
in again should be just a minor fault with no actual IO happening.
That might be less of an impact in the end - yes, the unmap and
re-fault is much more expensive, but it presumably happens to much
fewer pages.
What do you think?
Linus
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-14 6:11 ` Linus Torvalds
0 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2016-06-14 6:11 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 1914 bytes --]
On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
>>
>> I've timed it at over a thousand cycles on at least some CPU's, but
>> that's still peanuts compared to a real page fault. It shouldn't be
>> *that* noticeable, ie no way it's a 6% regression on its own.
>
> Looks like setting accessed bit is the problem.
Ok. I've definitely seen it as an issue, but never to the point of
several percent on a real benchmark that wasn't explicitly testing
that cost.
I reported the excessive dirty/accessed bit cost to Intel back in the
P4 days, but it's apparently not been high enough for anybody to care.
> We spend 36% more time in page walk only, about 1% of total userspace time.
> Combining this with page walk footprint on caches, I guess we can get to
> this 3.5% score difference I see.
>
> I'm not sure if there's anything we can do to solve the issue without
> screwing relacim logic again. :(
I think we should say "screw the reclaim logic" for now, and revert
commit 5c0a85fad949 for now.
Considering how much trouble the accessed bit is on some other
architectures too, I wonder if we should strive to simply not care
about it, and always leaving it set. And then rely entirely on just
unmapping the pages and making the "we took a page fault after
unmapping" be the real activity tester.
So get rid of the "if the page is young, mark it old but leave it in
the page tables" logic entirely. When we unmap a page, it will always
either be in the swap cache or the page cache anyway, so faulting it
in again should be just a minor fault with no actual IO happening.
That might be less of an impact in the end - yes, the unmap and
re-fault is much more expensive, but it presumably happens to much
fewer pages.
What do you think?
Linus
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-14 6:11 ` Linus Torvalds
@ 2016-06-14 8:26 ` Kirill A. Shutemov
-1 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-14 8:26 UTC (permalink / raw)
To: Linus Torvalds, Rik van Riel, Mel Gorman
Cc: Kirill A. Shutemov, Huang, Ying, Michal Hocko, LKML,
Michal Hocko, Minchan Kim, Vinayak Menon, Andrew Morton, LKP,
Dave Hansen, Vladimir Davydov
On Mon, Jun 13, 2016 at 11:11:05PM -0700, Linus Torvalds wrote:
> On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> >>
> >> I've timed it at over a thousand cycles on at least some CPU's, but
> >> that's still peanuts compared to a real page fault. It shouldn't be
> >> *that* noticeable, ie no way it's a 6% regression on its own.
> >
> > Looks like setting accessed bit is the problem.
>
> Ok. I've definitely seen it as an issue, but never to the point of
> several percent on a real benchmark that wasn't explicitly testing
> that cost.
>
> I reported the excessive dirty/accessed bit cost to Intel back in the
> P4 days, but it's apparently not been high enough for anybody to care.
>
> > We spend 36% more time in page walk only, about 1% of total userspace time.
> > Combining this with page walk footprint on caches, I guess we can get to
> > this 3.5% score difference I see.
> >
> > I'm not sure if there's anything we can do to solve the issue without
> > screwing relacim logic again. :(
>
> I think we should say "screw the reclaim logic" for now, and revert
> commit 5c0a85fad949 for now.
Okay. I'll prepare the patch.
> Considering how much trouble the accessed bit is on some other
> architectures too, I wonder if we should strive to simply not care
> about it, and always leaving it set. And then rely entirely on just
> unmapping the pages and making the "we took a page fault after
> unmapping" be the real activity tester.
>
> So get rid of the "if the page is young, mark it old but leave it in
> the page tables" logic entirely. When we unmap a page, it will always
> either be in the swap cache or the page cache anyway, so faulting it
> in again should be just a minor fault with no actual IO happening.
>
> That might be less of an impact in the end - yes, the unmap and
> re-fault is much more expensive, but it presumably happens to much
> fewer pages.
>
> What do you think?
Well, we cannot do this for anonymous memory. No swap -- no swap cache, if
I read code correctly.
I guess it's doable for file mappings. Although I would expect regressions
in other benchmarks. IIUC, it would require page unmapping to propogate
page to active list, which is suboptimal.
And implications for page_idle is not clear to me.
Rik, Mel, any comments?
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-14 8:26 ` Kirill A. Shutemov
0 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-14 8:26 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 2479 bytes --]
On Mon, Jun 13, 2016 at 11:11:05PM -0700, Linus Torvalds wrote:
> On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> >>
> >> I've timed it at over a thousand cycles on at least some CPU's, but
> >> that's still peanuts compared to a real page fault. It shouldn't be
> >> *that* noticeable, ie no way it's a 6% regression on its own.
> >
> > Looks like setting accessed bit is the problem.
>
> Ok. I've definitely seen it as an issue, but never to the point of
> several percent on a real benchmark that wasn't explicitly testing
> that cost.
>
> I reported the excessive dirty/accessed bit cost to Intel back in the
> P4 days, but it's apparently not been high enough for anybody to care.
>
> > We spend 36% more time in page walk only, about 1% of total userspace time.
> > Combining this with page walk footprint on caches, I guess we can get to
> > this 3.5% score difference I see.
> >
> > I'm not sure if there's anything we can do to solve the issue without
> > screwing relacim logic again. :(
>
> I think we should say "screw the reclaim logic" for now, and revert
> commit 5c0a85fad949 for now.
Okay. I'll prepare the patch.
> Considering how much trouble the accessed bit is on some other
> architectures too, I wonder if we should strive to simply not care
> about it, and always leaving it set. And then rely entirely on just
> unmapping the pages and making the "we took a page fault after
> unmapping" be the real activity tester.
>
> So get rid of the "if the page is young, mark it old but leave it in
> the page tables" logic entirely. When we unmap a page, it will always
> either be in the swap cache or the page cache anyway, so faulting it
> in again should be just a minor fault with no actual IO happening.
>
> That might be less of an impact in the end - yes, the unmap and
> re-fault is much more expensive, but it presumably happens to much
> fewer pages.
>
> What do you think?
Well, we cannot do this for anonymous memory. No swap -- no swap cache, if
I read code correctly.
I guess it's doable for file mappings. Although I would expect regressions
in other benchmarks. IIUC, it would require page unmapping to propogate
page to active list, which is suboptimal.
And implications for page_idle is not clear to me.
Rik, Mel, any comments?
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-08 8:58 ` Kirill A. Shutemov
@ 2016-06-14 8:57 ` Minchan Kim
-1 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-14 8:57 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Linus Torvalds,
Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp
On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> > "Huang, Ying" <ying.huang@intel.com> writes:
> >
> > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> > >
> > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> > >>>
> > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> > >>>
> > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > >>>
> > >>> in testcase: unixbench
> > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> > >>>
> > >>>
> > >>> Details are as below:
> > >>> -------------------------------------------------------------------------------------------------->
> > >>>
> > >>>
> > >>> =========================================================================================
> > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> > >>>
> > >>> commit:
> > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> > >>>
> > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> > >>> ---------------- --------------------------
> > >>> fail:runs %reproduction fail:runs
> > >>> | | |
> > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> > >>> %stddev %change %stddev
> > >>> \ | \
> > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> > >>
> > >> That's weird.
> > >>
> > >> I don't understand why the change would reduce number or minor faults.
> > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> > >
> > > unixbench runs in fixed time mode. That is, the total time to run
> > > unixbench is fixed, but the work done varies. So the minor_page_faults
> > > change may reflect only the work done.
> > >
> > >> Hm. Is reproducible? Across reboot?
> > >
> >
> > And FYI, there is no swap setup for test, all root file system including
> > benchmark files are in tmpfs, so no real page reclaim will be
> > triggered. But it appears that active file cache reduced after the
> > commit.
> >
> > 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
> > 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
> >
> > I think this is the expected behavior of the commit?
>
> Yes, it's expected.
>
> After the change faularound would produce old pte. It means there's more
> chance for these pages to be on inactive lru, unless somebody actually
> touch them and flip accessed bit.
Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
anonymous LRU list on swapless system so I really wonder why active file
LRU is shrunk.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-14 8:57 ` Minchan Kim
0 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-14 8:57 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 3686 bytes --]
On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> > "Huang, Ying" <ying.huang@intel.com> writes:
> >
> > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> > >
> > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> > >>>
> > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> > >>>
> > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > >>>
> > >>> in testcase: unixbench
> > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> > >>>
> > >>>
> > >>> Details are as below:
> > >>> -------------------------------------------------------------------------------------------------->
> > >>>
> > >>>
> > >>> =========================================================================================
> > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> > >>>
> > >>> commit:
> > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> > >>>
> > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> > >>> ---------------- --------------------------
> > >>> fail:runs %reproduction fail:runs
> > >>> | | |
> > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> > >>> %stddev %change %stddev
> > >>> \ | \
> > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> > >>
> > >> That's weird.
> > >>
> > >> I don't understand why the change would reduce number or minor faults.
> > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> > >
> > > unixbench runs in fixed time mode. That is, the total time to run
> > > unixbench is fixed, but the work done varies. So the minor_page_faults
> > > change may reflect only the work done.
> > >
> > >> Hm. Is reproducible? Across reboot?
> > >
> >
> > And FYI, there is no swap setup for test, all root file system including
> > benchmark files are in tmpfs, so no real page reclaim will be
> > triggered. But it appears that active file cache reduced after the
> > commit.
> >
> > 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
> > 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
> >
> > I think this is the expected behavior of the commit?
>
> Yes, it's expected.
>
> After the change faularound would produce old pte. It means there's more
> chance for these pages to be on inactive lru, unless somebody actually
> touch them and flip accessed bit.
Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
anonymous LRU list on swapless system so I really wonder why active file
LRU is shrunk.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-13 9:02 ` Huang, Ying
@ 2016-06-14 13:38 ` Minchan Kim
-1 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-14 13:38 UTC (permalink / raw)
To: Huang, Ying
Cc: Linus Torvalds, Kirill A. Shutemov, Rik van Riel, Michal Hocko,
LKML, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton,
LKP
On Mon, Jun 13, 2016 at 05:02:15PM +0800, Huang, Ying wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> From perf profile, the time spent in page_fault and its children
> >> functions are almost same (7.85% vs 7.81%). So the time spent in page
> >> fault and page table operation itself doesn't changed much. So, you
> >> mean CPU may be slower to load the page table entry to TLB if accessed
> >> bit is not set?
> >
> > So the CPU does take a microfault internally when it needs to set the
> > accessed/dirty bit. It's not architecturally visible, but you can see
> > it when you do timing loops.
> >
> > I've timed it at over a thousand cycles on at least some CPU's, but
> > that's still peanuts compared to a real page fault. It shouldn't be
> > *that* noticeable, ie no way it's a 6% regression on its own.
>
> I done some simple counting, and found that about 3.15e9 PTE are set to
> old during the test after the commit. This may interpret the user_time
> increase as below, because these accessed bit microfault is accounted as
> user time.
>
> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>
> I also make a one line debug patch as below on top of the commit to set
> the PTE to young unconditionally, which recover the regression.
With this patch, meminfo.Active(file) is almost same unlike previous
experiment?
>
> modified mm/filemap.c
> @@ -2193,7 +2193,7 @@ repeat:
> if (file->f_ra.mmap_miss > 0)
> file->f_ra.mmap_miss--;
> addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> - do_set_pte(vma, addr, page, pte, false, false, true);
> + do_set_pte(vma, addr, page, pte, false, false, false);
> unlock_page(page);
> atomic64_inc(&old_pte_count);
> goto next;
>
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-14 13:38 ` Minchan Kim
0 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-14 13:38 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 1923 bytes --]
On Mon, Jun 13, 2016 at 05:02:15PM +0800, Huang, Ying wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> From perf profile, the time spent in page_fault and its children
> >> functions are almost same (7.85% vs 7.81%). So the time spent in page
> >> fault and page table operation itself doesn't changed much. So, you
> >> mean CPU may be slower to load the page table entry to TLB if accessed
> >> bit is not set?
> >
> > So the CPU does take a microfault internally when it needs to set the
> > accessed/dirty bit. It's not architecturally visible, but you can see
> > it when you do timing loops.
> >
> > I've timed it at over a thousand cycles on at least some CPU's, but
> > that's still peanuts compared to a real page fault. It shouldn't be
> > *that* noticeable, ie no way it's a 6% regression on its own.
>
> I done some simple counting, and found that about 3.15e9 PTE are set to
> old during the test after the commit. This may interpret the user_time
> increase as below, because these accessed bit microfault is accounted as
> user time.
>
> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>
> I also make a one line debug patch as below on top of the commit to set
> the PTE to young unconditionally, which recover the regression.
With this patch, meminfo.Active(file) is almost same unlike previous
experiment?
>
> modified mm/filemap.c
> @@ -2193,7 +2193,7 @@ repeat:
> if (file->f_ra.mmap_miss > 0)
> file->f_ra.mmap_miss--;
> addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> - do_set_pte(vma, addr, page, pte, false, false, true);
> + do_set_pte(vma, addr, page, pte, false, false, false);
> unlock_page(page);
> atomic64_inc(&old_pte_count);
> goto next;
>
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-14 6:11 ` Linus Torvalds
@ 2016-06-14 14:03 ` Christian Borntraeger
-1 siblings, 0 replies; 46+ messages in thread
From: Christian Borntraeger @ 2016-06-14 14:03 UTC (permalink / raw)
To: Linus Torvalds, Kirill A. Shutemov
Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko,
Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP,
Dave Hansen, Martin Schwidefsky, linux-s390
On 06/14/2016 08:11 AM, Linus Torvalds wrote:
> On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>> On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
>>>
>>> I've timed it at over a thousand cycles on at least some CPU's, but
>>> that's still peanuts compared to a real page fault. It shouldn't be
>>> *that* noticeable, ie no way it's a 6% regression on its own.
>>
>> Looks like setting accessed bit is the problem.
>
> Ok. I've definitely seen it as an issue, but never to the point of
> several percent on a real benchmark that wasn't explicitly testing
> that cost.
>
> I reported the excessive dirty/accessed bit cost to Intel back in the
> P4 days, but it's apparently not been high enough for anybody to care.
>
>> We spend 36% more time in page walk only, about 1% of total userspace time.
>> Combining this with page walk footprint on caches, I guess we can get to
>> this 3.5% score difference I see.
>>
>> I'm not sure if there's anything we can do to solve the issue without
>> screwing relacim logic again. :(
>
> I think we should say "screw the reclaim logic" for now, and revert
> commit 5c0a85fad949 for now.
>
> Considering how much trouble the accessed bit is on some other
> architectures too, I wonder if we should strive to simply not care
> about it, and always leaving it set. And then rely entirely on just
> unmapping the pages and making the "we took a page fault after
> unmapping" be the real activity tester.
>
> So get rid of the "if the page is young, mark it old but leave it in
> the page tables" logic entirely. When we unmap a page, it will always
> either be in the swap cache or the page cache anyway, so faulting it
> in again should be just a minor fault with no actual IO happening.
>
> That might be less of an impact in the end - yes, the unmap and
> re-fault is much more expensive, but it presumably happens to much
> fewer pages.
FWIW, something like that is what Martin did for s390 3 years ago.
We now use invalidation and page faults to implement the *young
functions in pgtable.h (basically using a SW young bit). This
helped us to get rid of the storage keys (which contain the HW
reference bit). The performance did not seem to suffer.
See commit 0944fe3f4a323f436180d39402cae7f9c46ead17
s390/mm: implement software referenced bits
>
> What do you think?
Your proposal would be to do the software tracking via
invalidation/fault part of the generic mm code and not to hide it
in the architecture backend. Correct?
>
> Linus
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-14 14:03 ` Christian Borntraeger
0 siblings, 0 replies; 46+ messages in thread
From: Christian Borntraeger @ 2016-06-14 14:03 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 2634 bytes --]
On 06/14/2016 08:11 AM, Linus Torvalds wrote:
> On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>> On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
>>>
>>> I've timed it at over a thousand cycles on at least some CPU's, but
>>> that's still peanuts compared to a real page fault. It shouldn't be
>>> *that* noticeable, ie no way it's a 6% regression on its own.
>>
>> Looks like setting accessed bit is the problem.
>
> Ok. I've definitely seen it as an issue, but never to the point of
> several percent on a real benchmark that wasn't explicitly testing
> that cost.
>
> I reported the excessive dirty/accessed bit cost to Intel back in the
> P4 days, but it's apparently not been high enough for anybody to care.
>
>> We spend 36% more time in page walk only, about 1% of total userspace time.
>> Combining this with page walk footprint on caches, I guess we can get to
>> this 3.5% score difference I see.
>>
>> I'm not sure if there's anything we can do to solve the issue without
>> screwing relacim logic again. :(
>
> I think we should say "screw the reclaim logic" for now, and revert
> commit 5c0a85fad949 for now.
>
> Considering how much trouble the accessed bit is on some other
> architectures too, I wonder if we should strive to simply not care
> about it, and always leaving it set. And then rely entirely on just
> unmapping the pages and making the "we took a page fault after
> unmapping" be the real activity tester.
>
> So get rid of the "if the page is young, mark it old but leave it in
> the page tables" logic entirely. When we unmap a page, it will always
> either be in the swap cache or the page cache anyway, so faulting it
> in again should be just a minor fault with no actual IO happening.
>
> That might be less of an impact in the end - yes, the unmap and
> re-fault is much more expensive, but it presumably happens to much
> fewer pages.
FWIW, something like that is what Martin did for s390 3 years ago.
We now use invalidation and page faults to implement the *young
functions in pgtable.h (basically using a SW young bit). This
helped us to get rid of the storage keys (which contain the HW
reference bit). The performance did not seem to suffer.
See commit 0944fe3f4a323f436180d39402cae7f9c46ead17
s390/mm: implement software referenced bits
>
> What do you think?
Your proposal would be to do the software tracking via
invalidation/fault part of the generic mm code and not to hide it
in the architecture backend. Correct?
>
> Linus
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-14 8:57 ` Minchan Kim
@ 2016-06-14 14:34 ` Kirill A. Shutemov
-1 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-14 14:34 UTC (permalink / raw)
To: Minchan Kim
Cc: Kirill A. Shutemov, Huang, Ying, Rik van Riel, Michal Hocko,
LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman,
Andrew Morton, lkp
On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> > > "Huang, Ying" <ying.huang@intel.com> writes:
> > >
> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> > > >
> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> > > >>>
> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> > > >>>
> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > > >>>
> > > >>> in testcase: unixbench
> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> > > >>>
> > > >>>
> > > >>> Details are as below:
> > > >>> -------------------------------------------------------------------------------------------------->
> > > >>>
> > > >>>
> > > >>> =========================================================================================
> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> > > >>>
> > > >>> commit:
> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> > > >>>
> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> > > >>> ---------------- --------------------------
> > > >>> fail:runs %reproduction fail:runs
> > > >>> | | |
> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> > > >>> %stddev %change %stddev
> > > >>> \ | \
> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> > > >>
> > > >> That's weird.
> > > >>
> > > >> I don't understand why the change would reduce number or minor faults.
> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> > > >
> > > > unixbench runs in fixed time mode. That is, the total time to run
> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
> > > > change may reflect only the work done.
> > > >
> > > >> Hm. Is reproducible? Across reboot?
> > > >
> > >
> > > And FYI, there is no swap setup for test, all root file system including
> > > benchmark files are in tmpfs, so no real page reclaim will be
> > > triggered. But it appears that active file cache reduced after the
> > > commit.
> > >
> > > 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
> > > 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
> > >
> > > I think this is the expected behavior of the commit?
> >
> > Yes, it's expected.
> >
> > After the change faularound would produce old pte. It means there's more
> > chance for these pages to be on inactive lru, unless somebody actually
> > touch them and flip accessed bit.
>
> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> anonymous LRU list on swapless system so I really wonder why active file
> LRU is shrunk.
Hm. Good point. I don't why we have anything on file lru if there's no
filesystems except tmpfs.
Ying, how do you get stuff to the tmpfs?
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-14 14:34 ` Kirill A. Shutemov
0 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2016-06-14 14:34 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 4071 bytes --]
On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> > > "Huang, Ying" <ying.huang@intel.com> writes:
> > >
> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> > > >
> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> > > >>>
> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> > > >>>
> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > > >>>
> > > >>> in testcase: unixbench
> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> > > >>>
> > > >>>
> > > >>> Details are as below:
> > > >>> -------------------------------------------------------------------------------------------------->
> > > >>>
> > > >>>
> > > >>> =========================================================================================
> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> > > >>>
> > > >>> commit:
> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> > > >>>
> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> > > >>> ---------------- --------------------------
> > > >>> fail:runs %reproduction fail:runs
> > > >>> | | |
> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> > > >>> %stddev %change %stddev
> > > >>> \ | \
> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> > > >>
> > > >> That's weird.
> > > >>
> > > >> I don't understand why the change would reduce number or minor faults.
> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> > > >
> > > > unixbench runs in fixed time mode. That is, the total time to run
> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
> > > > change may reflect only the work done.
> > > >
> > > >> Hm. Is reproducible? Across reboot?
> > > >
> > >
> > > And FYI, there is no swap setup for test, all root file system including
> > > benchmark files are in tmpfs, so no real page reclaim will be
> > > triggered. But it appears that active file cache reduced after the
> > > commit.
> > >
> > > 111331 ± 1% -13.3% 96503 ± 0% meminfo.Active
> > > 27603 ± 1% -43.9% 15486 ± 0% meminfo.Active(file)
> > >
> > > I think this is the expected behavior of the commit?
> >
> > Yes, it's expected.
> >
> > After the change faularound would produce old pte. It means there's more
> > chance for these pages to be on inactive lru, unless somebody actually
> > touch them and flip accessed bit.
>
> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> anonymous LRU list on swapless system so I really wonder why active file
> LRU is shrunk.
Hm. Good point. I don't why we have anything on file lru if there's no
filesystems except tmpfs.
Ying, how do you get stuff to the tmpfs?
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-14 8:26 ` Kirill A. Shutemov
@ 2016-06-14 16:07 ` Rik van Riel
-1 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2016-06-14 16:07 UTC (permalink / raw)
To: Kirill A. Shutemov, Linus Torvalds, Mel Gorman
Cc: Kirill A. Shutemov, Huang, Ying, Michal Hocko, LKML,
Michal Hocko, Minchan Kim, Vinayak Menon, Andrew Morton, LKP,
Dave Hansen, Vladimir Davydov
[-- Attachment #1: Type: text/plain, Size: 3579 bytes --]
On Tue, 2016-06-14 at 11:26 +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 13, 2016 at 11:11:05PM -0700, Linus Torvalds wrote:
> >
> > On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> > > >
> > > >
> > > > I've timed it at over a thousand cycles on at least some CPU's,
> > > > but
> > > > that's still peanuts compared to a real page fault. It
> > > > shouldn't be
> > > > *that* noticeable, ie no way it's a 6% regression on its own.
> > > Looks like setting accessed bit is the problem.
> > Ok. I've definitely seen it as an issue, but never to the point of
> > several percent on a real benchmark that wasn't explicitly testing
> > that cost.
> >
> > I reported the excessive dirty/accessed bit cost to Intel back in
> > the
> > P4 days, but it's apparently not been high enough for anybody to
> > care.
> >
> > >
> > > We spend 36% more time in page walk only, about 1% of total
> > > userspace time.
> > > Combining this with page walk footprint on caches, I guess we can
> > > get to
> > > this 3.5% score difference I see.
> > >
> > > I'm not sure if there's anything we can do to solve the issue
> > > without
> > > screwing relacim logic again. :(
> > I think we should say "screw the reclaim logic" for now, and revert
> > commit 5c0a85fad949 for now.
> Okay. I'll prepare the patch.
>
> >
> > Considering how much trouble the accessed bit is on some other
> > architectures too, I wonder if we should strive to simply not care
> > about it, and always leaving it set. And then rely entirely on just
> > unmapping the pages and making the "we took a page fault after
> > unmapping" be the real activity tester.
> >
> > So get rid of the "if the page is young, mark it old but leave it
> > in
> > the page tables" logic entirely. When we unmap a page, it will
> > always
> > either be in the swap cache or the page cache anyway, so faulting
> > it
> > in again should be just a minor fault with no actual IO happening.
> >
> > That might be less of an impact in the end - yes, the unmap and
> > re-fault is much more expensive, but it presumably happens to much
> > fewer pages.
> >
> > What do you think?
> Well, we cannot do this for anonymous memory. No swap -- no swap
> cache, if
> I read code correctly.
>
> I guess it's doable for file mappings. Although I would expect
> regressions
> in other benchmarks. IIUC, it would require page unmapping to
> propogate
> page to active list, which is suboptimal.
>
> And implications for page_idle is not clear to me.
>
> Rik, Mel, any comments?
We can clear the accessed/young bit when anon pages are moved
from the active to the inactive list.
Reclaim does not care about the young bit on active anon pages
at all. For anon pages it uses a two hand clock algorithm, with
only pages on the inactive list being cared about.
For file pages, I believe we do look at the young bit on mapped
pages when they reach the end of the inactive list. Again, we
only care about the young bit on inactive pages.
One option may be to count on actively used file pages actually
being on the active list, and always set the young bit on ptes
when the page is already active.
Then we can let reclaim do its thing with the smaller number of
pages that are on the inactive list, while doing the faster thing
for pages that are on the active list.
Does that make sense?
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-14 16:07 ` Rik van Riel
0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2016-06-14 16:07 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 3579 bytes --]
On Tue, 2016-06-14 at 11:26 +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 13, 2016 at 11:11:05PM -0700, Linus Torvalds wrote:
> >
> > On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> > > >
> > > >
> > > > I've timed it at over a thousand cycles on at least some CPU's,
> > > > but
> > > > that's still peanuts compared to a real page fault. It
> > > > shouldn't be
> > > > *that* noticeable, ie no way it's a 6% regression on its own.
> > > Looks like setting accessed bit is the problem.
> > Ok. I've definitely seen it as an issue, but never to the point of
> > several percent on a real benchmark that wasn't explicitly testing
> > that cost.
> >
> > I reported the excessive dirty/accessed bit cost to Intel back in
> > the
> > P4 days, but it's apparently not been high enough for anybody to
> > care.
> >
> > >
> > > We spend 36% more time in page walk only, about 1% of total
> > > userspace time.
> > > Combining this with page walk footprint on caches, I guess we can
> > > get to
> > > this 3.5% score difference I see.
> > >
> > > I'm not sure if there's anything we can do to solve the issue
> > > without
> > > screwing relacim logic again. :(
> > I think we should say "screw the reclaim logic" for now, and revert
> > commit 5c0a85fad949 for now.
> Okay. I'll prepare the patch.
>
> >
> > Considering how much trouble the accessed bit is on some other
> > architectures too, I wonder if we should strive to simply not care
> > about it, and always leaving it set. And then rely entirely on just
> > unmapping the pages and making the "we took a page fault after
> > unmapping" be the real activity tester.
> >
> > So get rid of the "if the page is young, mark it old but leave it
> > in
> > the page tables" logic entirely. When we unmap a page, it will
> > always
> > either be in the swap cache or the page cache anyway, so faulting
> > it
> > in again should be just a minor fault with no actual IO happening.
> >
> > That might be less of an impact in the end - yes, the unmap and
> > re-fault is much more expensive, but it presumably happens to much
> > fewer pages.
> >
> > What do you think?
> Well, we cannot do this for anonymous memory. No swap -- no swap
> cache, if
> I read code correctly.
>
> I guess it's doable for file mappings. Although I would expect
> regressions
> in other benchmarks. IIUC, it would require page unmapping to
> propogate
> page to active list, which is suboptimal.
>
> And implications for page_idle is not clear to me.
>
> Rik, Mel, any comments?
We can clear the accessed/young bit when anon pages are moved
from the active to the inactive list.
Reclaim does not care about the young bit on active anon pages
at all. For anon pages it uses a two hand clock algorithm, with
only pages on the inactive list being cared about.
For file pages, I believe we do look at the young bit on mapped
pages when they reach the end of the inactive list. Again, we
only care about the young bit on inactive pages.
One option may be to count on actively used file pages actually
being on the active list, and always set the young bit on ptes
when the page is already active.
Then we can let reclaim do its thing with the smaller number of
pages that are on the inactive list, while doing the faster thing
for pages that are on the active list.
Does that make sense?
--
All Rights Reversed.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-14 13:38 ` Minchan Kim
@ 2016-06-15 23:42 ` Huang, Ying
-1 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-15 23:42 UTC (permalink / raw)
To: Minchan Kim
Cc: Huang, Ying, Linus Torvalds, Kirill A. Shutemov, Rik van Riel,
Michal Hocko, LKML, Michal Hocko, Vinayak Menon, Mel Gorman,
Andrew Morton, LKP
Minchan Kim <minchan@kernel.org> writes:
> On Mon, Jun 13, 2016 at 05:02:15PM +0800, Huang, Ying wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>
>> > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> From perf profile, the time spent in page_fault and its children
>> >> functions are almost same (7.85% vs 7.81%). So the time spent in page
>> >> fault and page table operation itself doesn't changed much. So, you
>> >> mean CPU may be slower to load the page table entry to TLB if accessed
>> >> bit is not set?
>> >
>> > So the CPU does take a microfault internally when it needs to set the
>> > accessed/dirty bit. It's not architecturally visible, but you can see
>> > it when you do timing loops.
>> >
>> > I've timed it at over a thousand cycles on at least some CPU's, but
>> > that's still peanuts compared to a real page fault. It shouldn't be
>> > *that* noticeable, ie no way it's a 6% regression on its own.
>>
>> I done some simple counting, and found that about 3.15e9 PTE are set to
>> old during the test after the commit. This may interpret the user_time
>> increase as below, because these accessed bit microfault is accounted as
>> user time.
>>
>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>>
>> I also make a one line debug patch as below on top of the commit to set
>> the PTE to young unconditionally, which recover the regression.
>
> With this patch, meminfo.Active(file) is almost same unlike previous
> experiment?
Yes. meminfo.Active(file) is almost same of that of the parent commit of
the first bad commit.
Best Regards,
Huang, Ying
>>
>> modified mm/filemap.c
>> @@ -2193,7 +2193,7 @@ repeat:
>> if (file->f_ra.mmap_miss > 0)
>> file->f_ra.mmap_miss--;
>> addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
>> - do_set_pte(vma, addr, page, pte, false, false, true);
>> + do_set_pte(vma, addr, page, pte, false, false, false);
>> unlock_page(page);
>> atomic64_inc(&old_pte_count);
>> goto next;
>>
>> Best Regards,
>> Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-15 23:42 ` Huang, Ying
0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-15 23:42 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 2145 bytes --]
Minchan Kim <minchan@kernel.org> writes:
> On Mon, Jun 13, 2016 at 05:02:15PM +0800, Huang, Ying wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>
>> > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> From perf profile, the time spent in page_fault and its children
>> >> functions are almost same (7.85% vs 7.81%). So the time spent in page
>> >> fault and page table operation itself doesn't changed much. So, you
>> >> mean CPU may be slower to load the page table entry to TLB if accessed
>> >> bit is not set?
>> >
>> > So the CPU does take a microfault internally when it needs to set the
>> > accessed/dirty bit. It's not architecturally visible, but you can see
>> > it when you do timing loops.
>> >
>> > I've timed it at over a thousand cycles on at least some CPU's, but
>> > that's still peanuts compared to a real page fault. It shouldn't be
>> > *that* noticeable, ie no way it's a 6% regression on its own.
>>
>> I done some simple counting, and found that about 3.15e9 PTE are set to
>> old during the test after the commit. This may interpret the user_time
>> increase as below, because these accessed bit microfault is accounted as
>> user time.
>>
>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>>
>> I also make a one line debug patch as below on top of the commit to set
>> the PTE to young unconditionally, which recover the regression.
>
> With this patch, meminfo.Active(file) is almost same unlike previous
> experiment?
Yes. meminfo.Active(file) is almost same of that of the parent commit of
the first bad commit.
Best Regards,
Huang, Ying
>>
>> modified mm/filemap.c
>> @@ -2193,7 +2193,7 @@ repeat:
>> if (file->f_ra.mmap_miss > 0)
>> file->f_ra.mmap_miss--;
>> addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
>> - do_set_pte(vma, addr, page, pte, false, false, true);
>> + do_set_pte(vma, addr, page, pte, false, false, false);
>> unlock_page(page);
>> atomic64_inc(&old_pte_count);
>> goto next;
>>
>> Best Regards,
>> Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-14 14:34 ` Kirill A. Shutemov
@ 2016-06-15 23:52 ` Huang, Ying
-1 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-15 23:52 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Minchan Kim, Kirill A. Shutemov, Huang, Ying, Rik van Riel,
Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon,
Mel Gorman, Andrew Morton, lkp
"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> > >
>> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> > > >
>> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> > > >>>
>> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> > > >>>
>> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> > > >>>
>> > > >>> in testcase: unixbench
>> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> > > >>>
>> > > >>>
>> > > >>> Details are as below:
>> > > >>> -------------------------------------------------------------------------------------------------->
>> > > >>>
>> > > >>>
>> > > >>> =========================================================================================
>> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> > > >>>
>> > > >>> commit:
>> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>> > > >>>
>> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> > > >>> ---------------- --------------------------
>> > > >>> fail:runs %reproduction fail:runs
>> > > >>> | | |
>> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> > > >>> %stddev %change %stddev
>> > > >>> \ | \
>> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>> > > >>
>> > > >> That's weird.
>> > > >>
>> > > >> I don't understand why the change would reduce number or minor faults.
>> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> > > >
>> > > > unixbench runs in fixed time mode. That is, the total time to run
>> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
>> > > > change may reflect only the work done.
>> > > >
>> > > >> Hm. Is reproducible? Across reboot?
>> > > >
>> > >
>> > > And FYI, there is no swap setup for test, all root file system including
>> > > benchmark files are in tmpfs, so no real page reclaim will be
>> > > triggered. But it appears that active file cache reduced after the
>> > > commit.
>> > >
>> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
>> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
>> > >
>> > > I think this is the expected behavior of the commit?
>> >
>> > Yes, it's expected.
>> >
>> > After the change faularound would produce old pte. It means there's more
>> > chance for these pages to be on inactive lru, unless somebody actually
>> > touch them and flip accessed bit.
>>
>> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> anonymous LRU list on swapless system so I really wonder why active file
>> LRU is shrunk.
>
> Hm. Good point. I don't why we have anything on file lru if there's no
> filesystems except tmpfs.
>
> Ying, how do you get stuff to the tmpfs?
We put root file system and benchmark into a set of compressed cpio
archive, then concatenate them into one initrd, and finally kernel use
that initrd as initramfs.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-15 23:52 ` Huang, Ying
0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-15 23:52 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 4379 bytes --]
"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> > >
>> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> > > >
>> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> > > >>>
>> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> > > >>>
>> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> > > >>>
>> > > >>> in testcase: unixbench
>> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> > > >>>
>> > > >>>
>> > > >>> Details are as below:
>> > > >>> -------------------------------------------------------------------------------------------------->
>> > > >>>
>> > > >>>
>> > > >>> =========================================================================================
>> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> > > >>>
>> > > >>> commit:
>> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>> > > >>>
>> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> > > >>> ---------------- --------------------------
>> > > >>> fail:runs %reproduction fail:runs
>> > > >>> | | |
>> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> > > >>> %stddev %change %stddev
>> > > >>> \ | \
>> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>> > > >>
>> > > >> That's weird.
>> > > >>
>> > > >> I don't understand why the change would reduce number or minor faults.
>> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> > > >
>> > > > unixbench runs in fixed time mode. That is, the total time to run
>> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
>> > > > change may reflect only the work done.
>> > > >
>> > > >> Hm. Is reproducible? Across reboot?
>> > > >
>> > >
>> > > And FYI, there is no swap setup for test, all root file system including
>> > > benchmark files are in tmpfs, so no real page reclaim will be
>> > > triggered. But it appears that active file cache reduced after the
>> > > commit.
>> > >
>> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
>> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
>> > >
>> > > I think this is the expected behavior of the commit?
>> >
>> > Yes, it's expected.
>> >
>> > After the change faularound would produce old pte. It means there's more
>> > chance for these pages to be on inactive lru, unless somebody actually
>> > touch them and flip accessed bit.
>>
>> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> anonymous LRU list on swapless system so I really wonder why active file
>> LRU is shrunk.
>
> Hm. Good point. I don't why we have anything on file lru if there's no
> filesystems except tmpfs.
>
> Ying, how do you get stuff to the tmpfs?
We put root file system and benchmark into a set of compressed cpio
archive, then concatenate them into one initrd, and finally kernel use
that initrd as initramfs.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-15 23:52 ` Huang, Ying
@ 2016-06-16 0:13 ` Minchan Kim
-1 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-16 0:13 UTC (permalink / raw)
To: Huang, Ying
Cc: Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel,
Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon,
Mel Gorman, Andrew Morton, lkp
On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>
> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> > >
> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> > > >
> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> > > >>>
> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> > > >>>
> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> > > >>>
> >> > > >>> in testcase: unixbench
> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> > > >>>
> >> > > >>>
> >> > > >>> Details are as below:
> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> > > >>>
> >> > > >>>
> >> > > >>> =========================================================================================
> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> > > >>>
> >> > > >>> commit:
> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> > > >>>
> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> >> > > >>> ---------------- --------------------------
> >> > > >>> fail:runs %reproduction fail:runs
> >> > > >>> | | |
> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> > > >>> %stddev %change %stddev
> >> > > >>> \ | \
> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> >> > > >>
> >> > > >> That's weird.
> >> > > >>
> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> > > >
> >> > > > unixbench runs in fixed time mode. That is, the total time to run
> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
> >> > > > change may reflect only the work done.
> >> > > >
> >> > > >> Hm. Is reproducible? Across reboot?
> >> > > >
> >> > >
> >> > > And FYI, there is no swap setup for test, all root file system including
> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> > > triggered. But it appears that active file cache reduced after the
> >> > > commit.
> >> > >
> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
> >> > >
> >> > > I think this is the expected behavior of the commit?
> >> >
> >> > Yes, it's expected.
> >> >
> >> > After the change faularound would produce old pte. It means there's more
> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> > touch them and flip accessed bit.
> >>
> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> anonymous LRU list on swapless system so I really wonder why active file
> >> LRU is shrunk.
> >
> > Hm. Good point. I don't why we have anything on file lru if there's no
> > filesystems except tmpfs.
> >
> > Ying, how do you get stuff to the tmpfs?
>
> We put root file system and benchmark into a set of compressed cpio
> archive, then concatenate them into one initrd, and finally kernel use
> that initrd as initramfs.
I see.
Could you share your 4 full vmstat(/proc/vmstat) files?
old:
cat /proc/vmstat > before.old.vmstat
do benchmark
cat /proc/vmstat > after.old.vmstat
new:
cat /proc/vmstat > before.new.vmstat
do benchmark
cat /proc/vmstat > after.new.vmstat
IOW, I want to see stats related to reclaim.
Thanks.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-16 0:13 ` Minchan Kim
0 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-16 0:13 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 4911 bytes --]
On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>
> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> > >
> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> > > >
> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> > > >>>
> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> > > >>>
> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> > > >>>
> >> > > >>> in testcase: unixbench
> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> > > >>>
> >> > > >>>
> >> > > >>> Details are as below:
> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> > > >>>
> >> > > >>>
> >> > > >>> =========================================================================================
> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> > > >>>
> >> > > >>> commit:
> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> > > >>>
> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> >> > > >>> ---------------- --------------------------
> >> > > >>> fail:runs %reproduction fail:runs
> >> > > >>> | | |
> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> > > >>> %stddev %change %stddev
> >> > > >>> \ | \
> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> >> > > >>
> >> > > >> That's weird.
> >> > > >>
> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> > > >
> >> > > > unixbench runs in fixed time mode. That is, the total time to run
> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
> >> > > > change may reflect only the work done.
> >> > > >
> >> > > >> Hm. Is reproducible? Across reboot?
> >> > > >
> >> > >
> >> > > And FYI, there is no swap setup for test, all root file system including
> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> > > triggered. But it appears that active file cache reduced after the
> >> > > commit.
> >> > >
> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
> >> > >
> >> > > I think this is the expected behavior of the commit?
> >> >
> >> > Yes, it's expected.
> >> >
> >> > After the change faularound would produce old pte. It means there's more
> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> > touch them and flip accessed bit.
> >>
> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> anonymous LRU list on swapless system so I really wonder why active file
> >> LRU is shrunk.
> >
> > Hm. Good point. I don't why we have anything on file lru if there's no
> > filesystems except tmpfs.
> >
> > Ying, how do you get stuff to the tmpfs?
>
> We put root file system and benchmark into a set of compressed cpio
> archive, then concatenate them into one initrd, and finally kernel use
> that initrd as initramfs.
I see.
Could you share your 4 full vmstat(/proc/vmstat) files?
old:
cat /proc/vmstat > before.old.vmstat
do benchmark
cat /proc/vmstat > after.old.vmstat
new:
cat /proc/vmstat > before.new.vmstat
do benchmark
cat /proc/vmstat > after.new.vmstat
IOW, I want to see stats related to reclaim.
Thanks.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-16 0:13 ` Minchan Kim
@ 2016-06-16 22:27 ` Huang, Ying
-1 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-16 22:27 UTC (permalink / raw)
To: Minchan Kim
Cc: Huang, Ying, Kirill A. Shutemov, Kirill A. Shutemov,
Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko,
Vinayak Menon, Mel Gorman, Andrew Morton, lkp
[-- Attachment #1: Type: text/plain, Size: 5440 bytes --]
Minchan Kim <minchan@kernel.org> writes:
> On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>>
>> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> >> > >
>> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >> > > >
>> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >> > > >>>
>> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >> > > >>>
>> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >> > > >>>
>> >> > > >>> in testcase: unixbench
>> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >> > > >>>
>> >> > > >>>
>> >> > > >>> Details are as below:
>> >> > > >>> -------------------------------------------------------------------------------------------------->
>> >> > > >>>
>> >> > > >>>
>> >> > > >>> =========================================================================================
>> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >> > > >>>
>> >> > > >>> commit:
>> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >> > > >>>
>> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> >> > > >>> ---------------- --------------------------
>> >> > > >>> fail:runs %reproduction fail:runs
>> >> > > >>> | | |
>> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >> > > >>> %stddev %change %stddev
>> >> > > >>> \ | \
>> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>> >> > > >>
>> >> > > >> That's weird.
>> >> > > >>
>> >> > > >> I don't understand why the change would reduce number or minor faults.
>> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >> > > >
>> >> > > > unixbench runs in fixed time mode. That is, the total time to run
>> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
>> >> > > > change may reflect only the work done.
>> >> > > >
>> >> > > >> Hm. Is reproducible? Across reboot?
>> >> > > >
>> >> > >
>> >> > > And FYI, there is no swap setup for test, all root file system including
>> >> > > benchmark files are in tmpfs, so no real page reclaim will be
>> >> > > triggered. But it appears that active file cache reduced after the
>> >> > > commit.
>> >> > >
>> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
>> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
>> >> > >
>> >> > > I think this is the expected behavior of the commit?
>> >> >
>> >> > Yes, it's expected.
>> >> >
>> >> > After the change faularound would produce old pte. It means there's more
>> >> > chance for these pages to be on inactive lru, unless somebody actually
>> >> > touch them and flip accessed bit.
>> >>
>> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> >> anonymous LRU list on swapless system so I really wonder why active file
>> >> LRU is shrunk.
>> >
>> > Hm. Good point. I don't why we have anything on file lru if there's no
>> > filesystems except tmpfs.
>> >
>> > Ying, how do you get stuff to the tmpfs?
>>
>> We put root file system and benchmark into a set of compressed cpio
>> archive, then concatenate them into one initrd, and finally kernel use
>> that initrd as initramfs.
>
> I see.
>
> Could you share your 4 full vmstat(/proc/vmstat) files?
>
> old:
>
> cat /proc/vmstat > before.old.vmstat
> do benchmark
> cat /proc/vmstat > after.old.vmstat
>
> new:
>
> cat /proc/vmstat > before.new.vmstat
> do benchmark
> cat /proc/vmstat > after.new.vmstat
>
> IOW, I want to see stats related to reclaim.
Hi,
The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
bad commit (fbc-proc-vmstat.gz) are attached with the email.
The contents of the file is more than the vmstat before and after
benchmark running, but are sampled every 1 seconds. Every sample begin
with "time: <time>". You can check the first and last samples. The
first /proc/vmstat capturing is started at the same time of the
benchmark, so it is not exactly the vmstat before the benchmark running.
[-- Attachment #2: parent-proc-vmstat.gz --]
[-- Type: application/gzip, Size: 78486 bytes --]
[-- Attachment #3: fbc-proc-vmstat.gz --]
[-- Type: application/gzip, Size: 77915 bytes --]
[-- Attachment #4: Type: text/plain, Size: 27 bytes --]
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-16 22:27 ` Huang, Ying
0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-16 22:27 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 5592 bytes --]
Minchan Kim <minchan@kernel.org> writes:
> On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>>
>> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> >> > >
>> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >> > > >
>> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >> > > >>>
>> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >> > > >>>
>> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >> > > >>>
>> >> > > >>> in testcase: unixbench
>> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >> > > >>>
>> >> > > >>>
>> >> > > >>> Details are as below:
>> >> > > >>> -------------------------------------------------------------------------------------------------->
>> >> > > >>>
>> >> > > >>>
>> >> > > >>> =========================================================================================
>> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >> > > >>>
>> >> > > >>> commit:
>> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >> > > >>>
>> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> >> > > >>> ---------------- --------------------------
>> >> > > >>> fail:runs %reproduction fail:runs
>> >> > > >>> | | |
>> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >> > > >>> %stddev %change %stddev
>> >> > > >>> \ | \
>> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>> >> > > >>
>> >> > > >> That's weird.
>> >> > > >>
>> >> > > >> I don't understand why the change would reduce number or minor faults.
>> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >> > > >
>> >> > > > unixbench runs in fixed time mode. That is, the total time to run
>> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
>> >> > > > change may reflect only the work done.
>> >> > > >
>> >> > > >> Hm. Is reproducible? Across reboot?
>> >> > > >
>> >> > >
>> >> > > And FYI, there is no swap setup for test, all root file system including
>> >> > > benchmark files are in tmpfs, so no real page reclaim will be
>> >> > > triggered. But it appears that active file cache reduced after the
>> >> > > commit.
>> >> > >
>> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
>> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
>> >> > >
>> >> > > I think this is the expected behavior of the commit?
>> >> >
>> >> > Yes, it's expected.
>> >> >
>> >> > After the change faularound would produce old pte. It means there's more
>> >> > chance for these pages to be on inactive lru, unless somebody actually
>> >> > touch them and flip accessed bit.
>> >>
>> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> >> anonymous LRU list on swapless system so I really wonder why active file
>> >> LRU is shrunk.
>> >
>> > Hm. Good point. I don't why we have anything on file lru if there's no
>> > filesystems except tmpfs.
>> >
>> > Ying, how do you get stuff to the tmpfs?
>>
>> We put root file system and benchmark into a set of compressed cpio
>> archive, then concatenate them into one initrd, and finally kernel use
>> that initrd as initramfs.
>
> I see.
>
> Could you share your 4 full vmstat(/proc/vmstat) files?
>
> old:
>
> cat /proc/vmstat > before.old.vmstat
> do benchmark
> cat /proc/vmstat > after.old.vmstat
>
> new:
>
> cat /proc/vmstat > before.new.vmstat
> do benchmark
> cat /proc/vmstat > after.new.vmstat
>
> IOW, I want to see stats related to reclaim.
Hi,
The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
bad commit (fbc-proc-vmstat.gz) are attached with the email.
The contents of the file is more than the vmstat before and after
benchmark running, but are sampled every 1 seconds. Every sample begin
with "time: <time>". You can check the first and last samples. The
first /proc/vmstat capturing is started at the same time of the
benchmark, so it is not exactly the vmstat before the benchmark running.
Best Regards,
Huang, Ying
[-- Attachment #2: parent-proc-vmstat.gz --]
[-- Type: application/gzip, Size: 78486 bytes --]
[-- Attachment #3: fbc-proc-vmstat.gz --]
[-- Type: application/gzip, Size: 77915 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-16 22:27 ` Huang, Ying
@ 2016-06-17 5:41 ` Minchan Kim
-1 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-17 5:41 UTC (permalink / raw)
To: Huang, Ying
Cc: Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel,
Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon,
Mel Gorman, Andrew Morton, lkp
On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
>
> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> >>
> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> >> > >
> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> >> > > >
> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> >> > > >>>
> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> >> > > >>>
> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> >> > > >>>
> >> >> > > >>> in testcase: unixbench
> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> >> > > >>>
> >> >> > > >>>
> >> >> > > >>> Details are as below:
> >> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> >> > > >>>
> >> >> > > >>>
> >> >> > > >>> =========================================================================================
> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> >> > > >>>
> >> >> > > >>> commit:
> >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> >> > > >>>
> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> >> >> > > >>> ---------------- --------------------------
> >> >> > > >>> fail:runs %reproduction fail:runs
> >> >> > > >>> | | |
> >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> >> > > >>> %stddev %change %stddev
> >> >> > > >>> \ | \
> >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> >> >> > > >>
> >> >> > > >> That's weird.
> >> >> > > >>
> >> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> >> > > >
> >> >> > > > unixbench runs in fixed time mode. That is, the total time to run
> >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
> >> >> > > > change may reflect only the work done.
> >> >> > > >
> >> >> > > >> Hm. Is reproducible? Across reboot?
> >> >> > > >
> >> >> > >
> >> >> > > And FYI, there is no swap setup for test, all root file system including
> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> >> > > triggered. But it appears that active file cache reduced after the
> >> >> > > commit.
> >> >> > >
> >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
> >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
> >> >> > >
> >> >> > > I think this is the expected behavior of the commit?
> >> >> >
> >> >> > Yes, it's expected.
> >> >> >
> >> >> > After the change faularound would produce old pte. It means there's more
> >> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> >> > touch them and flip accessed bit.
> >> >>
> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> >> anonymous LRU list on swapless system so I really wonder why active file
> >> >> LRU is shrunk.
> >> >
> >> > Hm. Good point. I don't why we have anything on file lru if there's no
> >> > filesystems except tmpfs.
> >> >
> >> > Ying, how do you get stuff to the tmpfs?
> >>
> >> We put root file system and benchmark into a set of compressed cpio
> >> archive, then concatenate them into one initrd, and finally kernel use
> >> that initrd as initramfs.
> >
> > I see.
> >
> > Could you share your 4 full vmstat(/proc/vmstat) files?
> >
> > old:
> >
> > cat /proc/vmstat > before.old.vmstat
> > do benchmark
> > cat /proc/vmstat > after.old.vmstat
> >
> > new:
> >
> > cat /proc/vmstat > before.new.vmstat
> > do benchmark
> > cat /proc/vmstat > after.new.vmstat
> >
> > IOW, I want to see stats related to reclaim.
>
> Hi,
>
> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
> bad commit (fbc-proc-vmstat.gz) are attached with the email.
>
> The contents of the file is more than the vmstat before and after
> benchmark running, but are sampled every 1 seconds. Every sample begin
> with "time: <time>". You can check the first and last samples. The
> first /proc/vmstat capturing is started at the same time of the
> benchmark, so it is not exactly the vmstat before the benchmark running.
>
Thanks for the testing!
nr_active_file was shrunk 48% but the vaule itself is not huge so
I don't think it affects performance a lot.
There was no reclaim activity for testing. :(
pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
because unixbench was time fixed mode and 6% regressed so no
doubt.
No interesting data.
It seems you tested it with THP, maybe always mode?
I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
again? it might you already did.
Is it still 6% regressed with disabling THP?
nr_free_pages -6663 -6461 96.97%
nr_alloc_batch 2594 4013 154.70%
nr_inactive_anon 112 112 100.00%
nr_active_anon 2536 2159 85.13%
nr_inactive_file -567 -227 40.04%
nr_active_file 648 315 48.61%
nr_unevictable 0 0 0.00%
nr_mlock 0 0 0.00%
nr_anon_pages 2634 2161 82.04%
nr_mapped 511 530 103.72%
nr_file_pages 207 215 103.86%
nr_dirty -7 -6 85.71%
nr_writeback 0 0 0.00%
nr_slab_reclaimable 158 328 207.59%
nr_slab_unreclaimable 2208 2115 95.79%
nr_page_table_pages 268 247 92.16%
nr_kernel_stack 143 80 55.94%
nr_unstable 1 1 100.00%
nr_bounce 0 0 0.00%
nr_vmscan_write 0 0 0.00%
nr_vmscan_immediate_reclaim 0 0 0.00%
nr_writeback_temp 0 0 0.00%
nr_isolated_anon 0 0 0.00%
nr_isolated_file 0 0 0.00%
nr_shmem 131 131 100.00%
nr_dirtied 67 78 116.42%
nr_written 74 84 113.51%
nr_pages_scanned 0 0 0.00%
numa_hit 483752446 453696304 93.79%
numa_miss 0 0 0.00%
numa_foreign 0 0 0.00%
numa_interleave 0 0 0.00%
numa_local 483752445 453696304 93.79%
numa_other 1 0 0.00%
workingset_refault 0 0 0.00%
workingset_activate 0 0 0.00%
workingset_nodereclaim 0 0 0.00%
nr_anon_transparent_hugepages 1 0 0.00%
nr_free_cma 0 0 0.00%
nr_dirty_threshold -1316 -1274 96.81%
nr_dirty_background_threshold -658 -637 96.81%
pgpgin 0 0 0.00%
pgpgout 0 0 0.00%
pswpin 0 0 0.00%
pswpout 0 0 0.00%
pgalloc_dma 0 0 0.00%
pgalloc_dma32 60130977 56323630 93.67%
pgalloc_normal 457203182 428863437 93.80%
pgalloc_movable 0 0 0.00%
pgfree 517327743 485181251 93.79%
pgactivate 2059556 1930950 93.76%
pgdeactivate 0 0 0.00%
pgfault 572723351 537107146 93.78%
pgmajfault 0 0 0.00%
pglazyfreed 0 0 0.00%
pgrefill_dma 0 0 0.00%
pgrefill_dma32 0 0 0.00%
pgrefill_normal 0 0 0.00%
pgrefill_movable 0 0 0.00%
pgsteal_kswapd_dma 0 0 0.00%
pgsteal_kswapd_dma32 0 0 0.00%
pgsteal_kswapd_normal 0 0 0.00%
pgsteal_kswapd_movable 0 0 0.00%
pgsteal_direct_dma 0 0 0.00%
pgsteal_direct_dma32 0 0 0.00%
pgsteal_direct_normal 0 0 0.00%
pgsteal_direct_movable 0 0 0.00%
pgscan_kswapd_dma 0 0 0.00%
pgscan_kswapd_dma32 0 0 0.00%
pgscan_kswapd_normal 0 0 0.00%
pgscan_kswapd_movable 0 0 0.00%
pgscan_direct_dma 0 0 0.00%
pgscan_direct_dma32 0 0 0.00%
pgscan_direct_normal 0 0 0.00%
pgscan_direct_movable 0 0 0.00%
pgscan_direct_throttle 0 0 0.00%
zone_reclaim_failed 0 0 0.00%
pginodesteal 0 0 0.00%
slabs_scanned 0 0 0.00%
kswapd_inodesteal 0 0 0.00%
kswapd_low_wmark_hit_quickly 0 0 0.00%
kswapd_high_wmark_hit_quickly 0 0 0.00%
pageoutrun 0 0 0.00%
allocstall 0 0 0.00%
pgrotated 0 0 0.00%
drop_pagecache 0 0 0.00%
drop_slab 0 0 0.00%
numa_pte_updates 0 0 0.00%
numa_huge_pte_updates 0 0 0.00%
numa_hint_faults 0 0 0.00%
numa_hint_faults_local 0 0 0.00%
numa_pages_migrated 0 0 0.00%
pgmigrate_success 0 0 0.00%
pgmigrate_fail 0 0 0.00%
compact_migrate_scanned 0 0 0.00%
compact_free_scanned 0 0 0.00%
compact_isolated 0 0 0.00%
compact_stall 0 0 0.00%
compact_fail 0 0 0.00%
compact_success 0 0 0.00%
compact_daemon_wake 0 0 0.00%
htlb_buddy_alloc_success 0 0 0.00%
htlb_buddy_alloc_fail 0 0 0.00%
unevictable_pgs_culled 0 0 0.00%
unevictable_pgs_scanned 0 0 0.00%
unevictable_pgs_rescued 0 0 0.00%
unevictable_pgs_mlocked 0 0 0.00%
unevictable_pgs_munlocked 0 0 0.00%
unevictable_pgs_cleared 0 0 0.00%
unevictable_pgs_stranded 0 0 0.00%
thp_fault_alloc 22731 21604 95.04%
thp_fault_fallback 0 0 0.00%
thp_collapse_alloc 1 0 0.00%
thp_collapse_alloc_failed 0 0 0.00%
thp_split_page 0 0 0.00%
thp_split_page_failed 0 0 0.00%
thp_deferred_split_page 22731 21604 95.04%
thp_split_pmd 0 0 0.00%
thp_zero_page_alloc 0 0 0.00%
thp_zero_page_alloc_failed 0 0 0.00%
balloon_inflate 0 0 0.00%
balloon_deflate 0 0 0.00%
balloon_migrate 0 0 0.00%
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-17 5:41 ` Minchan Kim
0 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-17 5:41 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 14301 bytes --]
On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
>
> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> >>
> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> >> > >
> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> >> > > >
> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> >> > > >>>
> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> >> > > >>>
> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> >> > > >>>
> >> >> > > >>> in testcase: unixbench
> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> >> > > >>>
> >> >> > > >>>
> >> >> > > >>> Details are as below:
> >> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> >> > > >>>
> >> >> > > >>>
> >> >> > > >>> =========================================================================================
> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> >> > > >>>
> >> >> > > >>> commit:
> >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> >> > > >>>
> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> >> >> > > >>> ---------------- --------------------------
> >> >> > > >>> fail:runs %reproduction fail:runs
> >> >> > > >>> | | |
> >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> >> > > >>> %stddev %change %stddev
> >> >> > > >>> \ | \
> >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> >> >> > > >>
> >> >> > > >> That's weird.
> >> >> > > >>
> >> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> >> > > >
> >> >> > > > unixbench runs in fixed time mode. That is, the total time to run
> >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
> >> >> > > > change may reflect only the work done.
> >> >> > > >
> >> >> > > >> Hm. Is reproducible? Across reboot?
> >> >> > > >
> >> >> > >
> >> >> > > And FYI, there is no swap setup for test, all root file system including
> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> >> > > triggered. But it appears that active file cache reduced after the
> >> >> > > commit.
> >> >> > >
> >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
> >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
> >> >> > >
> >> >> > > I think this is the expected behavior of the commit?
> >> >> >
> >> >> > Yes, it's expected.
> >> >> >
> >> >> > After the change faularound would produce old pte. It means there's more
> >> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> >> > touch them and flip accessed bit.
> >> >>
> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> >> anonymous LRU list on swapless system so I really wonder why active file
> >> >> LRU is shrunk.
> >> >
> >> > Hm. Good point. I don't why we have anything on file lru if there's no
> >> > filesystems except tmpfs.
> >> >
> >> > Ying, how do you get stuff to the tmpfs?
> >>
> >> We put root file system and benchmark into a set of compressed cpio
> >> archive, then concatenate them into one initrd, and finally kernel use
> >> that initrd as initramfs.
> >
> > I see.
> >
> > Could you share your 4 full vmstat(/proc/vmstat) files?
> >
> > old:
> >
> > cat /proc/vmstat > before.old.vmstat
> > do benchmark
> > cat /proc/vmstat > after.old.vmstat
> >
> > new:
> >
> > cat /proc/vmstat > before.new.vmstat
> > do benchmark
> > cat /proc/vmstat > after.new.vmstat
> >
> > IOW, I want to see stats related to reclaim.
>
> Hi,
>
> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
> bad commit (fbc-proc-vmstat.gz) are attached with the email.
>
> The contents of the file is more than the vmstat before and after
> benchmark running, but are sampled every 1 seconds. Every sample begin
> with "time: <time>". You can check the first and last samples. The
> first /proc/vmstat capturing is started at the same time of the
> benchmark, so it is not exactly the vmstat before the benchmark running.
>
Thanks for the testing!
nr_active_file was shrunk 48% but the vaule itself is not huge so
I don't think it affects performance a lot.
There was no reclaim activity for testing. :(
pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
because unixbench was time fixed mode and 6% regressed so no
doubt.
No interesting data.
It seems you tested it with THP, maybe always mode?
I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
again? it might you already did.
Is it still 6% regressed with disabling THP?
nr_free_pages -6663 -6461 96.97%
nr_alloc_batch 2594 4013 154.70%
nr_inactive_anon 112 112 100.00%
nr_active_anon 2536 2159 85.13%
nr_inactive_file -567 -227 40.04%
nr_active_file 648 315 48.61%
nr_unevictable 0 0 0.00%
nr_mlock 0 0 0.00%
nr_anon_pages 2634 2161 82.04%
nr_mapped 511 530 103.72%
nr_file_pages 207 215 103.86%
nr_dirty -7 -6 85.71%
nr_writeback 0 0 0.00%
nr_slab_reclaimable 158 328 207.59%
nr_slab_unreclaimable 2208 2115 95.79%
nr_page_table_pages 268 247 92.16%
nr_kernel_stack 143 80 55.94%
nr_unstable 1 1 100.00%
nr_bounce 0 0 0.00%
nr_vmscan_write 0 0 0.00%
nr_vmscan_immediate_reclaim 0 0 0.00%
nr_writeback_temp 0 0 0.00%
nr_isolated_anon 0 0 0.00%
nr_isolated_file 0 0 0.00%
nr_shmem 131 131 100.00%
nr_dirtied 67 78 116.42%
nr_written 74 84 113.51%
nr_pages_scanned 0 0 0.00%
numa_hit 483752446 453696304 93.79%
numa_miss 0 0 0.00%
numa_foreign 0 0 0.00%
numa_interleave 0 0 0.00%
numa_local 483752445 453696304 93.79%
numa_other 1 0 0.00%
workingset_refault 0 0 0.00%
workingset_activate 0 0 0.00%
workingset_nodereclaim 0 0 0.00%
nr_anon_transparent_hugepages 1 0 0.00%
nr_free_cma 0 0 0.00%
nr_dirty_threshold -1316 -1274 96.81%
nr_dirty_background_threshold -658 -637 96.81%
pgpgin 0 0 0.00%
pgpgout 0 0 0.00%
pswpin 0 0 0.00%
pswpout 0 0 0.00%
pgalloc_dma 0 0 0.00%
pgalloc_dma32 60130977 56323630 93.67%
pgalloc_normal 457203182 428863437 93.80%
pgalloc_movable 0 0 0.00%
pgfree 517327743 485181251 93.79%
pgactivate 2059556 1930950 93.76%
pgdeactivate 0 0 0.00%
pgfault 572723351 537107146 93.78%
pgmajfault 0 0 0.00%
pglazyfreed 0 0 0.00%
pgrefill_dma 0 0 0.00%
pgrefill_dma32 0 0 0.00%
pgrefill_normal 0 0 0.00%
pgrefill_movable 0 0 0.00%
pgsteal_kswapd_dma 0 0 0.00%
pgsteal_kswapd_dma32 0 0 0.00%
pgsteal_kswapd_normal 0 0 0.00%
pgsteal_kswapd_movable 0 0 0.00%
pgsteal_direct_dma 0 0 0.00%
pgsteal_direct_dma32 0 0 0.00%
pgsteal_direct_normal 0 0 0.00%
pgsteal_direct_movable 0 0 0.00%
pgscan_kswapd_dma 0 0 0.00%
pgscan_kswapd_dma32 0 0 0.00%
pgscan_kswapd_normal 0 0 0.00%
pgscan_kswapd_movable 0 0 0.00%
pgscan_direct_dma 0 0 0.00%
pgscan_direct_dma32 0 0 0.00%
pgscan_direct_normal 0 0 0.00%
pgscan_direct_movable 0 0 0.00%
pgscan_direct_throttle 0 0 0.00%
zone_reclaim_failed 0 0 0.00%
pginodesteal 0 0 0.00%
slabs_scanned 0 0 0.00%
kswapd_inodesteal 0 0 0.00%
kswapd_low_wmark_hit_quickly 0 0 0.00%
kswapd_high_wmark_hit_quickly 0 0 0.00%
pageoutrun 0 0 0.00%
allocstall 0 0 0.00%
pgrotated 0 0 0.00%
drop_pagecache 0 0 0.00%
drop_slab 0 0 0.00%
numa_pte_updates 0 0 0.00%
numa_huge_pte_updates 0 0 0.00%
numa_hint_faults 0 0 0.00%
numa_hint_faults_local 0 0 0.00%
numa_pages_migrated 0 0 0.00%
pgmigrate_success 0 0 0.00%
pgmigrate_fail 0 0 0.00%
compact_migrate_scanned 0 0 0.00%
compact_free_scanned 0 0 0.00%
compact_isolated 0 0 0.00%
compact_stall 0 0 0.00%
compact_fail 0 0 0.00%
compact_success 0 0 0.00%
compact_daemon_wake 0 0 0.00%
htlb_buddy_alloc_success 0 0 0.00%
htlb_buddy_alloc_fail 0 0 0.00%
unevictable_pgs_culled 0 0 0.00%
unevictable_pgs_scanned 0 0 0.00%
unevictable_pgs_rescued 0 0 0.00%
unevictable_pgs_mlocked 0 0 0.00%
unevictable_pgs_munlocked 0 0 0.00%
unevictable_pgs_cleared 0 0 0.00%
unevictable_pgs_stranded 0 0 0.00%
thp_fault_alloc 22731 21604 95.04%
thp_fault_fallback 0 0 0.00%
thp_collapse_alloc 1 0 0.00%
thp_collapse_alloc_failed 0 0 0.00%
thp_split_page 0 0 0.00%
thp_split_page_failed 0 0 0.00%
thp_deferred_split_page 22731 21604 95.04%
thp_split_pmd 0 0 0.00%
thp_zero_page_alloc 0 0 0.00%
thp_zero_page_alloc_failed 0 0 0.00%
balloon_inflate 0 0 0.00%
balloon_deflate 0 0 0.00%
balloon_migrate 0 0 0.00%
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-17 5:41 ` Minchan Kim
@ 2016-06-17 19:26 ` Huang, Ying
-1 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-17 19:26 UTC (permalink / raw)
To: Minchan Kim
Cc: Huang, Ying, Kirill A. Shutemov, Kirill A. Shutemov,
Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko,
Vinayak Menon, Mel Gorman, Andrew Morton, lkp
Minchan Kim <minchan@kernel.org> writes:
> On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
>> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> >>
>> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> >> >> > >
>> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >> >> > > >
>> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >> >> > > >>>
>> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >> >> > > >>>
>> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >> >> > > >>>
>> >> >> > > >>> in testcase: unixbench
>> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>> Details are as below:
>> >> >> > > >>> -------------------------------------------------------------------------------------------------->
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>> =========================================================================================
>> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >> >> > > >>>
>> >> >> > > >>> commit:
>> >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >> >> > > >>>
>> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> >> >> > > >>> ---------------- --------------------------
>> >> >> > > >>> fail:runs %reproduction fail:runs
>> >> >> > > >>> | | |
>> >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >> >> > > >>> %stddev %change %stddev
>> >> >> > > >>> \ | \
>> >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>> >> >> > > >>
>> >> >> > > >> That's weird.
>> >> >> > > >>
>> >> >> > > >> I don't understand why the change would reduce number or minor faults.
>> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >> >> > > >
>> >> >> > > > unixbench runs in fixed time mode. That is, the total time to run
>> >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
>> >> >> > > > change may reflect only the work done.
>> >> >> > > >
>> >> >> > > >> Hm. Is reproducible? Across reboot?
>> >> >> > > >
>> >> >> > >
>> >> >> > > And FYI, there is no swap setup for test, all root file system including
>> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
>> >> >> > > triggered. But it appears that active file cache reduced after the
>> >> >> > > commit.
>> >> >> > >
>> >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
>> >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
>> >> >> > >
>> >> >> > > I think this is the expected behavior of the commit?
>> >> >> >
>> >> >> > Yes, it's expected.
>> >> >> >
>> >> >> > After the change faularound would produce old pte. It means there's more
>> >> >> > chance for these pages to be on inactive lru, unless somebody actually
>> >> >> > touch them and flip accessed bit.
>> >> >>
>> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> >> >> anonymous LRU list on swapless system so I really wonder why active file
>> >> >> LRU is shrunk.
>> >> >
>> >> > Hm. Good point. I don't why we have anything on file lru if there's no
>> >> > filesystems except tmpfs.
>> >> >
>> >> > Ying, how do you get stuff to the tmpfs?
>> >>
>> >> We put root file system and benchmark into a set of compressed cpio
>> >> archive, then concatenate them into one initrd, and finally kernel use
>> >> that initrd as initramfs.
>> >
>> > I see.
>> >
>> > Could you share your 4 full vmstat(/proc/vmstat) files?
>> >
>> > old:
>> >
>> > cat /proc/vmstat > before.old.vmstat
>> > do benchmark
>> > cat /proc/vmstat > after.old.vmstat
>> >
>> > new:
>> >
>> > cat /proc/vmstat > before.new.vmstat
>> > do benchmark
>> > cat /proc/vmstat > after.new.vmstat
>> >
>> > IOW, I want to see stats related to reclaim.
>>
>> Hi,
>>
>> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
>> bad commit (fbc-proc-vmstat.gz) are attached with the email.
>>
>> The contents of the file is more than the vmstat before and after
>> benchmark running, but are sampled every 1 seconds. Every sample begin
>> with "time: <time>". You can check the first and last samples. The
>> first /proc/vmstat capturing is started at the same time of the
>> benchmark, so it is not exactly the vmstat before the benchmark running.
>>
>
> Thanks for the testing!
>
> nr_active_file was shrunk 48% but the vaule itself is not huge so
> I don't think it affects performance a lot.
>
> There was no reclaim activity for testing. :(
>
> pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
> because unixbench was time fixed mode and 6% regressed so no
> doubt.
>
> No interesting data.
>
> It seems you tested it with THP, maybe always mode?
Yes. With following in kconfig.
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
> I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
> again? it might you already did.
> Is it still 6% regressed with disabling THP?
Yes. I disabled THP via
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
The regression is the same as before.
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled:
gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never
commit:
4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
5c0a85fad949212b3e059692deecdeed74ae7ec7
4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
---------------- --------------------------
%stddev %change %stddev
\ | \
14332 ± 0% -6.2% 13438 ± 0% unixbench.score
6662206 ± 0% -6.2% 6252260 ± 0% unixbench.time.involuntary_context_switches
5.734e+08 ± 0% -6.2% 5.376e+08 ± 0% unixbench.time.minor_page_faults
2527 ± 0% -3.2% 2446 ± 0% unixbench.time.system_time
1291 ± 0% +5.4% 1361 ± 0% unixbench.time.user_time
19875455 ± 0% -6.3% 18622488 ± 0% unixbench.time.voluntary_context_switches
6570355 ± 0% -11.9% 5787517 ± 0% cpuidle.C1-HSW.usage
17257 ± 34% -59.1% 7055 ± 7% latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath
5976 ± 0% -43.0% 3404 ± 0% proc-vmstat.nr_active_file
45729 ± 1% -22.5% 35439 ± 1% meminfo.Active
23905 ± 0% -43.0% 13619 ± 0% meminfo.Active(file)
8465 ± 3% -29.8% 5940 ± 3% slabinfo.pid.active_objs
8476 ± 3% -29.9% 5940 ± 3% slabinfo.pid.num_objs
3.46 ± 0% +12.5% 3.89 ± 0% turbostat.CPU%c3
67.09 ± 0% -2.1% 65.65 ± 0% turbostat.PkgWatt
96090 ± 0% -5.8% 90479 ± 0% vmstat.system.cs
9083 ± 0% -2.7% 8833 ± 0% vmstat.system.in
467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.MIN_vruntime.avg
7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.MIN_vruntime.max
1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.MIN_vruntime.stddev
467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.max_vruntime.avg
7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.max_vruntime.max
1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.max_vruntime.stddev
-10724 ± -7% -12.0% -9433 ± -3% sched_debug.cfs_rq:/.spread0.avg
-17721 ± -4% -9.8% -15978 ± -2% sched_debug.cfs_rq:/.spread0.min
90355 ± 9% +14.1% 103099 ± 5% sched_debug.cpu.avg_idle.min
0.12 ± 35% +325.0% 0.52 ± 46% sched_debug.cpu.cpu_load[0].min
21913 ± 2% +29.1% 28288 ± 14% sched_debug.cpu.curr->pid.avg
49953 ± 3% +30.2% 65038 ± 0% sched_debug.cpu.curr->pid.max
23062 ± 2% +30.1% 29996 ± 4% sched_debug.cpu.curr->pid.stddev
274.39 ± 5% -10.2% 246.27 ± 3% sched_debug.cpu.nr_uninterruptible.max
242.73 ± 4% -13.5% 209.90 ± 2% sched_debug.cpu.nr_uninterruptible.stddev
Best Regards,
Huang, Ying
> nr_free_pages -6663 -6461 96.97%
> nr_alloc_batch 2594 4013 154.70%
> nr_inactive_anon 112 112 100.00%
> nr_active_anon 2536 2159 85.13%
> nr_inactive_file -567 -227 40.04%
> nr_active_file 648 315 48.61%
> nr_unevictable 0 0 0.00%
> nr_mlock 0 0 0.00%
> nr_anon_pages 2634 2161 82.04%
> nr_mapped 511 530 103.72%
> nr_file_pages 207 215 103.86%
> nr_dirty -7 -6 85.71%
> nr_writeback 0 0 0.00%
> nr_slab_reclaimable 158 328 207.59%
> nr_slab_unreclaimable 2208 2115 95.79%
> nr_page_table_pages 268 247 92.16%
> nr_kernel_stack 143 80 55.94%
> nr_unstable 1 1 100.00%
> nr_bounce 0 0 0.00%
> nr_vmscan_write 0 0 0.00%
> nr_vmscan_immediate_reclaim 0 0 0.00%
> nr_writeback_temp 0 0 0.00%
> nr_isolated_anon 0 0 0.00%
> nr_isolated_file 0 0 0.00%
> nr_shmem 131 131 100.00%
> nr_dirtied 67 78 116.42%
> nr_written 74 84 113.51%
> nr_pages_scanned 0 0 0.00%
> numa_hit 483752446 453696304 93.79%
> numa_miss 0 0 0.00%
> numa_foreign 0 0 0.00%
> numa_interleave 0 0 0.00%
> numa_local 483752445 453696304 93.79%
> numa_other 1 0 0.00%
> workingset_refault 0 0 0.00%
> workingset_activate 0 0 0.00%
> workingset_nodereclaim 0 0 0.00%
> nr_anon_transparent_hugepages 1 0 0.00%
> nr_free_cma 0 0 0.00%
> nr_dirty_threshold -1316 -1274 96.81%
> nr_dirty_background_threshold -658 -637 96.81%
> pgpgin 0 0 0.00%
> pgpgout 0 0 0.00%
> pswpin 0 0 0.00%
> pswpout 0 0 0.00%
> pgalloc_dma 0 0 0.00%
> pgalloc_dma32 60130977 56323630 93.67%
> pgalloc_normal 457203182 428863437 93.80%
> pgalloc_movable 0 0 0.00%
> pgfree 517327743 485181251 93.79%
> pgactivate 2059556 1930950 93.76%
> pgdeactivate 0 0 0.00%
> pgfault 572723351 537107146 93.78%
> pgmajfault 0 0 0.00%
> pglazyfreed 0 0 0.00%
> pgrefill_dma 0 0 0.00%
> pgrefill_dma32 0 0 0.00%
> pgrefill_normal 0 0 0.00%
> pgrefill_movable 0 0 0.00%
> pgsteal_kswapd_dma 0 0 0.00%
> pgsteal_kswapd_dma32 0 0 0.00%
> pgsteal_kswapd_normal 0 0 0.00%
> pgsteal_kswapd_movable 0 0 0.00%
> pgsteal_direct_dma 0 0 0.00%
> pgsteal_direct_dma32 0 0 0.00%
> pgsteal_direct_normal 0 0 0.00%
> pgsteal_direct_movable 0 0 0.00%
> pgscan_kswapd_dma 0 0 0.00%
> pgscan_kswapd_dma32 0 0 0.00%
> pgscan_kswapd_normal 0 0 0.00%
> pgscan_kswapd_movable 0 0 0.00%
> pgscan_direct_dma 0 0 0.00%
> pgscan_direct_dma32 0 0 0.00%
> pgscan_direct_normal 0 0 0.00%
> pgscan_direct_movable 0 0 0.00%
> pgscan_direct_throttle 0 0 0.00%
> zone_reclaim_failed 0 0 0.00%
> pginodesteal 0 0 0.00%
> slabs_scanned 0 0 0.00%
> kswapd_inodesteal 0 0 0.00%
> kswapd_low_wmark_hit_quickly 0 0 0.00%
> kswapd_high_wmark_hit_quickly 0 0 0.00%
> pageoutrun 0 0 0.00%
> allocstall 0 0 0.00%
> pgrotated 0 0 0.00%
> drop_pagecache 0 0 0.00%
> drop_slab 0 0 0.00%
> numa_pte_updates 0 0 0.00%
> numa_huge_pte_updates 0 0 0.00%
> numa_hint_faults 0 0 0.00%
> numa_hint_faults_local 0 0 0.00%
> numa_pages_migrated 0 0 0.00%
> pgmigrate_success 0 0 0.00%
> pgmigrate_fail 0 0 0.00%
> compact_migrate_scanned 0 0 0.00%
> compact_free_scanned 0 0 0.00%
> compact_isolated 0 0 0.00%
> compact_stall 0 0 0.00%
> compact_fail 0 0 0.00%
> compact_success 0 0 0.00%
> compact_daemon_wake 0 0 0.00%
> htlb_buddy_alloc_success 0 0 0.00%
> htlb_buddy_alloc_fail 0 0 0.00%
> unevictable_pgs_culled 0 0 0.00%
> unevictable_pgs_scanned 0 0 0.00%
> unevictable_pgs_rescued 0 0 0.00%
> unevictable_pgs_mlocked 0 0 0.00%
> unevictable_pgs_munlocked 0 0 0.00%
> unevictable_pgs_cleared 0 0 0.00%
> unevictable_pgs_stranded 0 0 0.00%
> thp_fault_alloc 22731 21604 95.04%
> thp_fault_fallback 0 0 0.00%
> thp_collapse_alloc 1 0 0.00%
> thp_collapse_alloc_failed 0 0 0.00%
> thp_split_page 0 0 0.00%
> thp_split_page_failed 0 0 0.00%
> thp_deferred_split_page 22731 21604 95.04%
> thp_split_pmd 0 0 0.00%
> thp_zero_page_alloc 0 0 0.00%
> thp_zero_page_alloc_failed 0 0 0.00%
> balloon_inflate 0 0 0.00%
> balloon_deflate 0 0 0.00%
> balloon_migrate 0 0 0.00%
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-17 19:26 ` Huang, Ying
0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2016-06-17 19:26 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 18205 bytes --]
Minchan Kim <minchan@kernel.org> writes:
> On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
>> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> >>
>> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> >> >> > >
>> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >> >> > > >
>> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >> >> > > >>>
>> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >> >> > > >>>
>> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >> >> > > >>>
>> >> >> > > >>> in testcase: unixbench
>> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>> Details are as below:
>> >> >> > > >>> -------------------------------------------------------------------------------------------------->
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>> =========================================================================================
>> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >> >> > > >>>
>> >> >> > > >>> commit:
>> >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >> >> > > >>>
>> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
>> >> >> > > >>> ---------------- --------------------------
>> >> >> > > >>> fail:runs %reproduction fail:runs
>> >> >> > > >>> | | |
>> >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >> >> > > >>> %stddev %change %stddev
>> >> >> > > >>> \ | \
>> >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
>> >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
>> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
>> >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
>> >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
>> >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
>> >> >> > > >>
>> >> >> > > >> That's weird.
>> >> >> > > >>
>> >> >> > > >> I don't understand why the change would reduce number or minor faults.
>> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >> >> > > >
>> >> >> > > > unixbench runs in fixed time mode. That is, the total time to run
>> >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
>> >> >> > > > change may reflect only the work done.
>> >> >> > > >
>> >> >> > > >> Hm. Is reproducible? Across reboot?
>> >> >> > > >
>> >> >> > >
>> >> >> > > And FYI, there is no swap setup for test, all root file system including
>> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
>> >> >> > > triggered. But it appears that active file cache reduced after the
>> >> >> > > commit.
>> >> >> > >
>> >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
>> >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
>> >> >> > >
>> >> >> > > I think this is the expected behavior of the commit?
>> >> >> >
>> >> >> > Yes, it's expected.
>> >> >> >
>> >> >> > After the change faularound would produce old pte. It means there's more
>> >> >> > chance for these pages to be on inactive lru, unless somebody actually
>> >> >> > touch them and flip accessed bit.
>> >> >>
>> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> >> >> anonymous LRU list on swapless system so I really wonder why active file
>> >> >> LRU is shrunk.
>> >> >
>> >> > Hm. Good point. I don't why we have anything on file lru if there's no
>> >> > filesystems except tmpfs.
>> >> >
>> >> > Ying, how do you get stuff to the tmpfs?
>> >>
>> >> We put root file system and benchmark into a set of compressed cpio
>> >> archive, then concatenate them into one initrd, and finally kernel use
>> >> that initrd as initramfs.
>> >
>> > I see.
>> >
>> > Could you share your 4 full vmstat(/proc/vmstat) files?
>> >
>> > old:
>> >
>> > cat /proc/vmstat > before.old.vmstat
>> > do benchmark
>> > cat /proc/vmstat > after.old.vmstat
>> >
>> > new:
>> >
>> > cat /proc/vmstat > before.new.vmstat
>> > do benchmark
>> > cat /proc/vmstat > after.new.vmstat
>> >
>> > IOW, I want to see stats related to reclaim.
>>
>> Hi,
>>
>> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
>> bad commit (fbc-proc-vmstat.gz) are attached with the email.
>>
>> The contents of the file is more than the vmstat before and after
>> benchmark running, but are sampled every 1 seconds. Every sample begin
>> with "time: <time>". You can check the first and last samples. The
>> first /proc/vmstat capturing is started at the same time of the
>> benchmark, so it is not exactly the vmstat before the benchmark running.
>>
>
> Thanks for the testing!
>
> nr_active_file was shrunk 48% but the vaule itself is not huge so
> I don't think it affects performance a lot.
>
> There was no reclaim activity for testing. :(
>
> pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
> because unixbench was time fixed mode and 6% regressed so no
> doubt.
>
> No interesting data.
>
> It seems you tested it with THP, maybe always mode?
Yes. With following in kconfig.
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
> I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
> again? it might you already did.
> Is it still 6% regressed with disabling THP?
Yes. I disabled THP via
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
The regression is the same as before.
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled:
gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never
commit:
4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
5c0a85fad949212b3e059692deecdeed74ae7ec7
4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
---------------- --------------------------
%stddev %change %stddev
\ | \
14332 ± 0% -6.2% 13438 ± 0% unixbench.score
6662206 ± 0% -6.2% 6252260 ± 0% unixbench.time.involuntary_context_switches
5.734e+08 ± 0% -6.2% 5.376e+08 ± 0% unixbench.time.minor_page_faults
2527 ± 0% -3.2% 2446 ± 0% unixbench.time.system_time
1291 ± 0% +5.4% 1361 ± 0% unixbench.time.user_time
19875455 ± 0% -6.3% 18622488 ± 0% unixbench.time.voluntary_context_switches
6570355 ± 0% -11.9% 5787517 ± 0% cpuidle.C1-HSW.usage
17257 ± 34% -59.1% 7055 ± 7% latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath
5976 ± 0% -43.0% 3404 ± 0% proc-vmstat.nr_active_file
45729 ± 1% -22.5% 35439 ± 1% meminfo.Active
23905 ± 0% -43.0% 13619 ± 0% meminfo.Active(file)
8465 ± 3% -29.8% 5940 ± 3% slabinfo.pid.active_objs
8476 ± 3% -29.9% 5940 ± 3% slabinfo.pid.num_objs
3.46 ± 0% +12.5% 3.89 ± 0% turbostat.CPU%c3
67.09 ± 0% -2.1% 65.65 ± 0% turbostat.PkgWatt
96090 ± 0% -5.8% 90479 ± 0% vmstat.system.cs
9083 ± 0% -2.7% 8833 ± 0% vmstat.system.in
467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.MIN_vruntime.avg
7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.MIN_vruntime.max
1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.MIN_vruntime.stddev
467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.max_vruntime.avg
7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.max_vruntime.max
1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.max_vruntime.stddev
-10724 ± -7% -12.0% -9433 ± -3% sched_debug.cfs_rq:/.spread0.avg
-17721 ± -4% -9.8% -15978 ± -2% sched_debug.cfs_rq:/.spread0.min
90355 ± 9% +14.1% 103099 ± 5% sched_debug.cpu.avg_idle.min
0.12 ± 35% +325.0% 0.52 ± 46% sched_debug.cpu.cpu_load[0].min
21913 ± 2% +29.1% 28288 ± 14% sched_debug.cpu.curr->pid.avg
49953 ± 3% +30.2% 65038 ± 0% sched_debug.cpu.curr->pid.max
23062 ± 2% +30.1% 29996 ± 4% sched_debug.cpu.curr->pid.stddev
274.39 ± 5% -10.2% 246.27 ± 3% sched_debug.cpu.nr_uninterruptible.max
242.73 ± 4% -13.5% 209.90 ± 2% sched_debug.cpu.nr_uninterruptible.stddev
Best Regards,
Huang, Ying
> nr_free_pages -6663 -6461 96.97%
> nr_alloc_batch 2594 4013 154.70%
> nr_inactive_anon 112 112 100.00%
> nr_active_anon 2536 2159 85.13%
> nr_inactive_file -567 -227 40.04%
> nr_active_file 648 315 48.61%
> nr_unevictable 0 0 0.00%
> nr_mlock 0 0 0.00%
> nr_anon_pages 2634 2161 82.04%
> nr_mapped 511 530 103.72%
> nr_file_pages 207 215 103.86%
> nr_dirty -7 -6 85.71%
> nr_writeback 0 0 0.00%
> nr_slab_reclaimable 158 328 207.59%
> nr_slab_unreclaimable 2208 2115 95.79%
> nr_page_table_pages 268 247 92.16%
> nr_kernel_stack 143 80 55.94%
> nr_unstable 1 1 100.00%
> nr_bounce 0 0 0.00%
> nr_vmscan_write 0 0 0.00%
> nr_vmscan_immediate_reclaim 0 0 0.00%
> nr_writeback_temp 0 0 0.00%
> nr_isolated_anon 0 0 0.00%
> nr_isolated_file 0 0 0.00%
> nr_shmem 131 131 100.00%
> nr_dirtied 67 78 116.42%
> nr_written 74 84 113.51%
> nr_pages_scanned 0 0 0.00%
> numa_hit 483752446 453696304 93.79%
> numa_miss 0 0 0.00%
> numa_foreign 0 0 0.00%
> numa_interleave 0 0 0.00%
> numa_local 483752445 453696304 93.79%
> numa_other 1 0 0.00%
> workingset_refault 0 0 0.00%
> workingset_activate 0 0 0.00%
> workingset_nodereclaim 0 0 0.00%
> nr_anon_transparent_hugepages 1 0 0.00%
> nr_free_cma 0 0 0.00%
> nr_dirty_threshold -1316 -1274 96.81%
> nr_dirty_background_threshold -658 -637 96.81%
> pgpgin 0 0 0.00%
> pgpgout 0 0 0.00%
> pswpin 0 0 0.00%
> pswpout 0 0 0.00%
> pgalloc_dma 0 0 0.00%
> pgalloc_dma32 60130977 56323630 93.67%
> pgalloc_normal 457203182 428863437 93.80%
> pgalloc_movable 0 0 0.00%
> pgfree 517327743 485181251 93.79%
> pgactivate 2059556 1930950 93.76%
> pgdeactivate 0 0 0.00%
> pgfault 572723351 537107146 93.78%
> pgmajfault 0 0 0.00%
> pglazyfreed 0 0 0.00%
> pgrefill_dma 0 0 0.00%
> pgrefill_dma32 0 0 0.00%
> pgrefill_normal 0 0 0.00%
> pgrefill_movable 0 0 0.00%
> pgsteal_kswapd_dma 0 0 0.00%
> pgsteal_kswapd_dma32 0 0 0.00%
> pgsteal_kswapd_normal 0 0 0.00%
> pgsteal_kswapd_movable 0 0 0.00%
> pgsteal_direct_dma 0 0 0.00%
> pgsteal_direct_dma32 0 0 0.00%
> pgsteal_direct_normal 0 0 0.00%
> pgsteal_direct_movable 0 0 0.00%
> pgscan_kswapd_dma 0 0 0.00%
> pgscan_kswapd_dma32 0 0 0.00%
> pgscan_kswapd_normal 0 0 0.00%
> pgscan_kswapd_movable 0 0 0.00%
> pgscan_direct_dma 0 0 0.00%
> pgscan_direct_dma32 0 0 0.00%
> pgscan_direct_normal 0 0 0.00%
> pgscan_direct_movable 0 0 0.00%
> pgscan_direct_throttle 0 0 0.00%
> zone_reclaim_failed 0 0 0.00%
> pginodesteal 0 0 0.00%
> slabs_scanned 0 0 0.00%
> kswapd_inodesteal 0 0 0.00%
> kswapd_low_wmark_hit_quickly 0 0 0.00%
> kswapd_high_wmark_hit_quickly 0 0 0.00%
> pageoutrun 0 0 0.00%
> allocstall 0 0 0.00%
> pgrotated 0 0 0.00%
> drop_pagecache 0 0 0.00%
> drop_slab 0 0 0.00%
> numa_pte_updates 0 0 0.00%
> numa_huge_pte_updates 0 0 0.00%
> numa_hint_faults 0 0 0.00%
> numa_hint_faults_local 0 0 0.00%
> numa_pages_migrated 0 0 0.00%
> pgmigrate_success 0 0 0.00%
> pgmigrate_fail 0 0 0.00%
> compact_migrate_scanned 0 0 0.00%
> compact_free_scanned 0 0 0.00%
> compact_isolated 0 0 0.00%
> compact_stall 0 0 0.00%
> compact_fail 0 0 0.00%
> compact_success 0 0 0.00%
> compact_daemon_wake 0 0 0.00%
> htlb_buddy_alloc_success 0 0 0.00%
> htlb_buddy_alloc_fail 0 0 0.00%
> unevictable_pgs_culled 0 0 0.00%
> unevictable_pgs_scanned 0 0 0.00%
> unevictable_pgs_rescued 0 0 0.00%
> unevictable_pgs_mlocked 0 0 0.00%
> unevictable_pgs_munlocked 0 0 0.00%
> unevictable_pgs_cleared 0 0 0.00%
> unevictable_pgs_stranded 0 0 0.00%
> thp_fault_alloc 22731 21604 95.04%
> thp_fault_fallback 0 0 0.00%
> thp_collapse_alloc 1 0 0.00%
> thp_collapse_alloc_failed 0 0 0.00%
> thp_split_page 0 0 0.00%
> thp_split_page_failed 0 0 0.00%
> thp_deferred_split_page 22731 21604 95.04%
> thp_split_pmd 0 0 0.00%
> thp_zero_page_alloc 0 0 0.00%
> thp_zero_page_alloc_failed 0 0 0.00%
> balloon_inflate 0 0 0.00%
> balloon_deflate 0 0 0.00%
> balloon_migrate 0 0 0.00%
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
2016-06-17 19:26 ` Huang, Ying
@ 2016-06-20 0:06 ` Minchan Kim
-1 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-20 0:06 UTC (permalink / raw)
To: Huang, Ying
Cc: Minchan Kim, Kirill A. Shutemov, Kirill A. Shutemov,
Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko,
Vinayak Menon, Mel Gorman, Andrew Morton, lkp, Thomas Gleixner,
Ingo Molnar, H. Peter Anvin, x86
On Fri, Jun 17, 2016 at 12:26:51PM -0700, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
>
> > On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >>
> >> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> >> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> >> >>
> >> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> >> >> > >
> >> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> >> >> > > >
> >> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> >> >> > > >>>
> >> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> >> >> > > >>>
> >> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> >> >> > > >>>
> >> >> >> > > >>> in testcase: unixbench
> >> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> >> >> > > >>>
> >> >> >> > > >>>
> >> >> >> > > >>> Details are as below:
> >> >> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> >> >> > > >>>
> >> >> >> > > >>>
> >> >> >> > > >>> =========================================================================================
> >> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> >> >> > > >>>
> >> >> >> > > >>> commit:
> >> >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> >> >> > > >>>
> >> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> >> >> >> > > >>> ---------------- --------------------------
> >> >> >> > > >>> fail:runs %reproduction fail:runs
> >> >> >> > > >>> | | |
> >> >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> >> >> > > >>> %stddev %change %stddev
> >> >> >> > > >>> \ | \
> >> >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> >> >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> >> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> >> >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> >> >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> >> >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> >> >> >> > > >>
> >> >> >> > > >> That's weird.
> >> >> >> > > >>
> >> >> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> >> >> > > >
> >> >> >> > > > unixbench runs in fixed time mode. That is, the total time to run
> >> >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
> >> >> >> > > > change may reflect only the work done.
> >> >> >> > > >
> >> >> >> > > >> Hm. Is reproducible? Across reboot?
> >> >> >> > > >
> >> >> >> > >
> >> >> >> > > And FYI, there is no swap setup for test, all root file system including
> >> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> >> >> > > triggered. But it appears that active file cache reduced after the
> >> >> >> > > commit.
> >> >> >> > >
> >> >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
> >> >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
> >> >> >> > >
> >> >> >> > > I think this is the expected behavior of the commit?
> >> >> >> >
> >> >> >> > Yes, it's expected.
> >> >> >> >
> >> >> >> > After the change faularound would produce old pte. It means there's more
> >> >> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> >> >> > touch them and flip accessed bit.
> >> >> >>
> >> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> >> >> anonymous LRU list on swapless system so I really wonder why active file
> >> >> >> LRU is shrunk.
> >> >> >
> >> >> > Hm. Good point. I don't why we have anything on file lru if there's no
> >> >> > filesystems except tmpfs.
> >> >> >
> >> >> > Ying, how do you get stuff to the tmpfs?
> >> >>
> >> >> We put root file system and benchmark into a set of compressed cpio
> >> >> archive, then concatenate them into one initrd, and finally kernel use
> >> >> that initrd as initramfs.
> >> >
> >> > I see.
> >> >
> >> > Could you share your 4 full vmstat(/proc/vmstat) files?
> >> >
> >> > old:
> >> >
> >> > cat /proc/vmstat > before.old.vmstat
> >> > do benchmark
> >> > cat /proc/vmstat > after.old.vmstat
> >> >
> >> > new:
> >> >
> >> > cat /proc/vmstat > before.new.vmstat
> >> > do benchmark
> >> > cat /proc/vmstat > after.new.vmstat
> >> >
> >> > IOW, I want to see stats related to reclaim.
> >>
> >> Hi,
> >>
> >> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
> >> bad commit (fbc-proc-vmstat.gz) are attached with the email.
> >>
> >> The contents of the file is more than the vmstat before and after
> >> benchmark running, but are sampled every 1 seconds. Every sample begin
> >> with "time: <time>". You can check the first and last samples. The
> >> first /proc/vmstat capturing is started at the same time of the
> >> benchmark, so it is not exactly the vmstat before the benchmark running.
> >>
> >
> > Thanks for the testing!
> >
> > nr_active_file was shrunk 48% but the vaule itself is not huge so
> > I don't think it affects performance a lot.
> >
> > There was no reclaim activity for testing. :(
> >
> > pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
> > because unixbench was time fixed mode and 6% regressed so no
> > doubt.
> >
> > No interesting data.
> >
> > It seems you tested it with THP, maybe always mode?
>
> Yes. With following in kconfig.
>
> CONFIG_TRANSPARENT_HUGEPAGE=y
> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
>
> > I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
> > again? it might you already did.
> > Is it still 6% regressed with disabling THP?
>
> Yes. I disabled THP via
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
>
> The regression is the same as before.
Still, 6% user_time regression and there is no difference with previous
experiment which enabled THP so I agree the regression is caused by just
access bit setting from CPU side, which is rather surprising to me.
I don't know how unixbench shell scripts testing touch the memory
but it should be per-page overhead and 6% regression for that is too heavy.
Anyway, at least, it would be better to notice it to x86 maintainer.
Thanks for the test!
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled:
> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never
>
> commit:
> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>
> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> ---------------- --------------------------
> %stddev %change %stddev
> \ | \
> 14332 ± 0% -6.2% 13438 ± 0% unixbench.score
> 6662206 ± 0% -6.2% 6252260 ± 0% unixbench.time.involuntary_context_switches
> 5.734e+08 ± 0% -6.2% 5.376e+08 ± 0% unixbench.time.minor_page_faults
> 2527 ± 0% -3.2% 2446 ± 0% unixbench.time.system_time
> 1291 ± 0% +5.4% 1361 ± 0% unixbench.time.user_time
> 19875455 ± 0% -6.3% 18622488 ± 0% unixbench.time.voluntary_context_switches
> 6570355 ± 0% -11.9% 5787517 ± 0% cpuidle.C1-HSW.usage
> 17257 ± 34% -59.1% 7055 ± 7% latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath
> 5976 ± 0% -43.0% 3404 ± 0% proc-vmstat.nr_active_file
> 45729 ± 1% -22.5% 35439 ± 1% meminfo.Active
> 23905 ± 0% -43.0% 13619 ± 0% meminfo.Active(file)
> 8465 ± 3% -29.8% 5940 ± 3% slabinfo.pid.active_objs
> 8476 ± 3% -29.9% 5940 ± 3% slabinfo.pid.num_objs
> 3.46 ± 0% +12.5% 3.89 ± 0% turbostat.CPU%c3
> 67.09 ± 0% -2.1% 65.65 ± 0% turbostat.PkgWatt
> 96090 ± 0% -5.8% 90479 ± 0% vmstat.system.cs
> 9083 ± 0% -2.7% 8833 ± 0% vmstat.system.in
> 467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.MIN_vruntime.avg
> 7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.MIN_vruntime.max
> 1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.MIN_vruntime.stddev
> 467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.max_vruntime.avg
> 7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.max_vruntime.max
> 1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.max_vruntime.stddev
> -10724 ± -7% -12.0% -9433 ± -3% sched_debug.cfs_rq:/.spread0.avg
> -17721 ± -4% -9.8% -15978 ± -2% sched_debug.cfs_rq:/.spread0.min
> 90355 ± 9% +14.1% 103099 ± 5% sched_debug.cpu.avg_idle.min
> 0.12 ± 35% +325.0% 0.52 ± 46% sched_debug.cpu.cpu_load[0].min
> 21913 ± 2% +29.1% 28288 ± 14% sched_debug.cpu.curr->pid.avg
> 49953 ± 3% +30.2% 65038 ± 0% sched_debug.cpu.curr->pid.max
> 23062 ± 2% +30.1% 29996 ± 4% sched_debug.cpu.curr->pid.stddev
> 274.39 ± 5% -10.2% 246.27 ± 3% sched_debug.cpu.nr_uninterruptible.max
> 242.73 ± 4% -13.5% 209.90 ± 2% sched_debug.cpu.nr_uninterruptible.stddev
>
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-20 0:06 ` Minchan Kim
0 siblings, 0 replies; 46+ messages in thread
From: Minchan Kim @ 2016-06-20 0:06 UTC (permalink / raw)
To: lkp
[-- Attachment #1: Type: text/plain, Size: 11011 bytes --]
On Fri, Jun 17, 2016 at 12:26:51PM -0700, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
>
> > On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >>
> >> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> >> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> >> >>
> >> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> >> >> > >
> >> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> >> >> > > >
> >> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> >> >> > > >>>
> >> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> >> >> > > >>>
> >> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> >> >> > > >>>
> >> >> >> > > >>> in testcase: unixbench
> >> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> >> >> > > >>>
> >> >> >> > > >>>
> >> >> >> > > >>> Details are as below:
> >> >> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> >> >> > > >>>
> >> >> >> > > >>>
> >> >> >> > > >>> =========================================================================================
> >> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> >> >> > > >>> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> >> >> > > >>>
> >> >> >> > > >>> commit:
> >> >> >> > > >>> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> >> >> > > >>> 5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> >> >> > > >>>
> >> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> >> >> >> > > >>> ---------------- --------------------------
> >> >> >> > > >>> fail:runs %reproduction fail:runs
> >> >> >> > > >>> | | |
> >> >> >> > > >>> 3:4 -75% :4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> >> >> > > >>> %stddev %change %stddev
> >> >> >> > > >>> \ | \
> >> >> >> > > >>> 14321 . 0% -6.3% 13425 . 0% unixbench.score
> >> >> >> > > >>> 1996897 . 0% -6.1% 1874635 . 0% unixbench.time.involuntary_context_switches
> >> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08 . 0% unixbench.time.minor_page_faults
> >> >> >> > > >>> 758.65 . 0% -3.0% 735.86 . 0% unixbench.time.system_time
> >> >> >> > > >>> 387.66 . 0% +5.4% 408.49 . 0% unixbench.time.user_time
> >> >> >> > > >>> 5950278 . 0% -6.2% 5583456 . 0% unixbench.time.voluntary_context_switches
> >> >> >> > > >>
> >> >> >> > > >> That's weird.
> >> >> >> > > >>
> >> >> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> >> >> > > >
> >> >> >> > > > unixbench runs in fixed time mode. That is, the total time to run
> >> >> >> > > > unixbench is fixed, but the work done varies. So the minor_page_faults
> >> >> >> > > > change may reflect only the work done.
> >> >> >> > > >
> >> >> >> > > >> Hm. Is reproducible? Across reboot?
> >> >> >> > > >
> >> >> >> > >
> >> >> >> > > And FYI, there is no swap setup for test, all root file system including
> >> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> >> >> > > triggered. But it appears that active file cache reduced after the
> >> >> >> > > commit.
> >> >> >> > >
> >> >> >> > > 111331 . 1% -13.3% 96503 . 0% meminfo.Active
> >> >> >> > > 27603 . 1% -43.9% 15486 . 0% meminfo.Active(file)
> >> >> >> > >
> >> >> >> > > I think this is the expected behavior of the commit?
> >> >> >> >
> >> >> >> > Yes, it's expected.
> >> >> >> >
> >> >> >> > After the change faularound would produce old pte. It means there's more
> >> >> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> >> >> > touch them and flip accessed bit.
> >> >> >>
> >> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> >> >> anonymous LRU list on swapless system so I really wonder why active file
> >> >> >> LRU is shrunk.
> >> >> >
> >> >> > Hm. Good point. I don't why we have anything on file lru if there's no
> >> >> > filesystems except tmpfs.
> >> >> >
> >> >> > Ying, how do you get stuff to the tmpfs?
> >> >>
> >> >> We put root file system and benchmark into a set of compressed cpio
> >> >> archive, then concatenate them into one initrd, and finally kernel use
> >> >> that initrd as initramfs.
> >> >
> >> > I see.
> >> >
> >> > Could you share your 4 full vmstat(/proc/vmstat) files?
> >> >
> >> > old:
> >> >
> >> > cat /proc/vmstat > before.old.vmstat
> >> > do benchmark
> >> > cat /proc/vmstat > after.old.vmstat
> >> >
> >> > new:
> >> >
> >> > cat /proc/vmstat > before.new.vmstat
> >> > do benchmark
> >> > cat /proc/vmstat > after.new.vmstat
> >> >
> >> > IOW, I want to see stats related to reclaim.
> >>
> >> Hi,
> >>
> >> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
> >> bad commit (fbc-proc-vmstat.gz) are attached with the email.
> >>
> >> The contents of the file is more than the vmstat before and after
> >> benchmark running, but are sampled every 1 seconds. Every sample begin
> >> with "time: <time>". You can check the first and last samples. The
> >> first /proc/vmstat capturing is started at the same time of the
> >> benchmark, so it is not exactly the vmstat before the benchmark running.
> >>
> >
> > Thanks for the testing!
> >
> > nr_active_file was shrunk 48% but the vaule itself is not huge so
> > I don't think it affects performance a lot.
> >
> > There was no reclaim activity for testing. :(
> >
> > pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
> > because unixbench was time fixed mode and 6% regressed so no
> > doubt.
> >
> > No interesting data.
> >
> > It seems you tested it with THP, maybe always mode?
>
> Yes. With following in kconfig.
>
> CONFIG_TRANSPARENT_HUGEPAGE=y
> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
>
> > I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
> > again? it might you already did.
> > Is it still 6% regressed with disabling THP?
>
> Yes. I disabled THP via
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
>
> The regression is the same as before.
Still, 6% user_time regression and there is no difference with previous
experiment which enabled THP so I agree the regression is caused by just
access bit setting from CPU side, which is rather surprising to me.
I don't know how unixbench shell scripts testing touch the memory
but it should be per-page overhead and 6% regression for that is too heavy.
Anyway, at least, it would be better to notice it to x86 maintainer.
Thanks for the test!
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled:
> gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never
>
> commit:
> 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> 5c0a85fad949212b3e059692deecdeed74ae7ec7
>
> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
> ---------------- --------------------------
> %stddev %change %stddev
> \ | \
> 14332 ± 0% -6.2% 13438 ± 0% unixbench.score
> 6662206 ± 0% -6.2% 6252260 ± 0% unixbench.time.involuntary_context_switches
> 5.734e+08 ± 0% -6.2% 5.376e+08 ± 0% unixbench.time.minor_page_faults
> 2527 ± 0% -3.2% 2446 ± 0% unixbench.time.system_time
> 1291 ± 0% +5.4% 1361 ± 0% unixbench.time.user_time
> 19875455 ± 0% -6.3% 18622488 ± 0% unixbench.time.voluntary_context_switches
> 6570355 ± 0% -11.9% 5787517 ± 0% cpuidle.C1-HSW.usage
> 17257 ± 34% -59.1% 7055 ± 7% latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath
> 5976 ± 0% -43.0% 3404 ± 0% proc-vmstat.nr_active_file
> 45729 ± 1% -22.5% 35439 ± 1% meminfo.Active
> 23905 ± 0% -43.0% 13619 ± 0% meminfo.Active(file)
> 8465 ± 3% -29.8% 5940 ± 3% slabinfo.pid.active_objs
> 8476 ± 3% -29.9% 5940 ± 3% slabinfo.pid.num_objs
> 3.46 ± 0% +12.5% 3.89 ± 0% turbostat.CPU%c3
> 67.09 ± 0% -2.1% 65.65 ± 0% turbostat.PkgWatt
> 96090 ± 0% -5.8% 90479 ± 0% vmstat.system.cs
> 9083 ± 0% -2.7% 8833 ± 0% vmstat.system.in
> 467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.MIN_vruntime.avg
> 7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.MIN_vruntime.max
> 1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.MIN_vruntime.stddev
> 467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.max_vruntime.avg
> 7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.max_vruntime.max
> 1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.max_vruntime.stddev
> -10724 ± -7% -12.0% -9433 ± -3% sched_debug.cfs_rq:/.spread0.avg
> -17721 ± -4% -9.8% -15978 ± -2% sched_debug.cfs_rq:/.spread0.min
> 90355 ± 9% +14.1% 103099 ± 5% sched_debug.cpu.avg_idle.min
> 0.12 ± 35% +325.0% 0.52 ± 46% sched_debug.cpu.cpu_load[0].min
> 21913 ± 2% +29.1% 28288 ± 14% sched_debug.cpu.curr->pid.avg
> 49953 ± 3% +30.2% 65038 ± 0% sched_debug.cpu.curr->pid.max
> 23062 ± 2% +30.1% 29996 ± 4% sched_debug.cpu.curr->pid.stddev
> 274.39 ± 5% -10.2% 246.27 ± 3% sched_debug.cpu.nr_uninterruptible.max
> 242.73 ± 4% -13.5% 209.90 ± 2% sched_debug.cpu.nr_uninterruptible.stddev
>
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2016-06-20 0:06 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-06 2:27 [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression kernel test robot
2016-06-06 2:27 ` kernel test robot
2016-06-06 9:51 ` [lkp] " Kirill A. Shutemov
2016-06-06 9:51 ` Kirill A. Shutemov
2016-06-08 7:21 ` [LKP] [lkp] " Huang, Ying
2016-06-08 7:21 ` Huang, Ying
2016-06-08 8:41 ` [LKP] [lkp] " Huang, Ying
2016-06-08 8:41 ` Huang, Ying
2016-06-08 8:58 ` [LKP] [lkp] " Kirill A. Shutemov
2016-06-08 8:58 ` Kirill A. Shutemov
2016-06-12 0:49 ` [LKP] [lkp] " Huang, Ying
2016-06-12 0:49 ` Huang, Ying
2016-06-12 1:02 ` [LKP] [lkp] " Linus Torvalds
2016-06-12 1:02 ` Linus Torvalds
2016-06-13 9:02 ` [LKP] [lkp] " Huang, Ying
2016-06-13 9:02 ` Huang, Ying
2016-06-14 13:38 ` [LKP] [lkp] " Minchan Kim
2016-06-14 13:38 ` Minchan Kim
2016-06-15 23:42 ` [LKP] [lkp] " Huang, Ying
2016-06-15 23:42 ` Huang, Ying
2016-06-13 12:52 ` [LKP] [lkp] " Kirill A. Shutemov
2016-06-13 12:52 ` Kirill A. Shutemov
2016-06-14 6:11 ` [LKP] [lkp] " Linus Torvalds
2016-06-14 6:11 ` Linus Torvalds
2016-06-14 8:26 ` [LKP] [lkp] " Kirill A. Shutemov
2016-06-14 8:26 ` Kirill A. Shutemov
2016-06-14 16:07 ` [LKP] [lkp] " Rik van Riel
2016-06-14 16:07 ` Rik van Riel
2016-06-14 14:03 ` [LKP] [lkp] " Christian Borntraeger
2016-06-14 14:03 ` Christian Borntraeger
2016-06-14 8:57 ` [LKP] [lkp] " Minchan Kim
2016-06-14 8:57 ` Minchan Kim
2016-06-14 14:34 ` [LKP] [lkp] " Kirill A. Shutemov
2016-06-14 14:34 ` Kirill A. Shutemov
2016-06-15 23:52 ` [LKP] [lkp] " Huang, Ying
2016-06-15 23:52 ` Huang, Ying
2016-06-16 0:13 ` [LKP] [lkp] " Minchan Kim
2016-06-16 0:13 ` Minchan Kim
2016-06-16 22:27 ` [LKP] [lkp] " Huang, Ying
2016-06-16 22:27 ` Huang, Ying
2016-06-17 5:41 ` [LKP] [lkp] " Minchan Kim
2016-06-17 5:41 ` Minchan Kim
2016-06-17 19:26 ` [LKP] [lkp] " Huang, Ying
2016-06-17 19:26 ` Huang, Ying
2016-06-20 0:06 ` [LKP] [lkp] " Minchan Kim
2016-06-20 0:06 ` Minchan Kim
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.