linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
@ 2016-06-06  2:27 kernel test robot
  2016-06-06  9:51 ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: kernel test robot @ 2016-06-06  2:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Michal Hocko, Minchan Kim, Rik van Riel,
	Mel Gorman, Michal Hocko, Vinayak Menon, Andrew Morton, LKML,
	lkp

[-- Attachment #1: Type: text/plain, Size: 4496 bytes --]


FYI, we noticed a -6.3% regression of unixbench.score due to commit:

commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

in testcase: unixbench
on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8


Details are as below:
-------------------------------------------------------------------------------------------------->


=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
  gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench

commit: 
  4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
  5c0a85fad949212b3e059692deecdeed74ae7ec7

4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
---------------- -------------------------- 
       fail:runs  %reproduction    fail:runs
           |             |             |    
          3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
         %stddev     %change         %stddev
             \          |                \  
     14321 ±  0%      -6.3%      13425 ±  0%  unixbench.score
   1996897 ±  0%      -6.1%    1874635 ±  0%  unixbench.time.involuntary_context_switches
 1.721e+08 ±  0%      -6.2%  1.613e+08 ±  0%  unixbench.time.minor_page_faults
    758.65 ±  0%      -3.0%     735.86 ±  0%  unixbench.time.system_time
    387.66 ±  0%      +5.4%     408.49 ±  0%  unixbench.time.user_time
   5950278 ±  0%      -6.2%    5583456 ±  0%  unixbench.time.voluntary_context_switches
   1960642 ±  0%     -11.4%    1737753 ±  0%  cpuidle.C1-HSW.usage
      5851 ±  0%     -43.8%       3286 ±  1%  proc-vmstat.nr_active_file
     46185 ±  0%     -21.2%      36385 ±  2%  meminfo.Active
     23404 ±  0%     -43.8%      13147 ±  1%  meminfo.Active(file)
      4109 ±  5%     -19.6%       3302 ±  4%  slabinfo.pid.active_objs
      4109 ±  5%     -19.6%       3302 ±  4%  slabinfo.pid.num_objs
     94603 ±  0%      -5.7%      89247 ±  0%  vmstat.system.cs
      8976 ±  0%      -2.5%       8754 ±  0%  vmstat.system.in
      3.38 ±  2%     +11.8%       3.77 ±  0%  turbostat.CPU%c3
      0.24 ±101%     -86.3%       0.03 ± 54%  turbostat.Pkg%pc3
     66.53 ±  0%      -1.7%      65.41 ±  0%  turbostat.PkgWatt
      2061 ±  1%      -8.5%       1886 ±  0%  sched_debug.cfs_rq:/.exec_clock.stddev
    737154 ±  5%     +10.8%     817107 ±  3%  sched_debug.cpu.avg_idle.max
    133057 ±  5%     -33.2%      88864 ± 11%  sched_debug.cpu.avg_idle.min
    181562 ±  8%     +15.9%     210434 ±  3%  sched_debug.cpu.avg_idle.stddev
      0.97 ±  7%     +19.0%       1.16 ±  8%  sched_debug.cpu.clock.stddev
      0.97 ±  7%     +19.0%       1.16 ±  8%  sched_debug.cpu.clock_task.stddev
    248.06 ± 11%     +31.0%     324.94 ±  8%  sched_debug.cpu.cpu_load[1].max
     55.65 ± 14%     +28.1%      71.30 ±  8%  sched_debug.cpu.cpu_load[1].stddev
    233.38 ± 10%     +34.4%     313.56 ±  8%  sched_debug.cpu.cpu_load[2].max
     49.79 ± 15%     +35.6%      67.50 ±  9%  sched_debug.cpu.cpu_load[2].stddev
    233.25 ± 12%     +29.9%     302.94 ±  6%  sched_debug.cpu.cpu_load[3].max
     46.56 ±  8%     +12.2%      52.25 ±  6%  sched_debug.cpu.cpu_load[3].min
     48.51 ± 15%     +31.4%      63.76 ±  7%  sched_debug.cpu.cpu_load[3].stddev
    238.44 ± 12%     +19.0%     283.69 ±  3%  sched_debug.cpu.cpu_load[4].max
     49.56 ±  9%     +13.4%      56.19 ±  4%  sched_debug.cpu.cpu_load[4].min
     48.22 ± 13%     +20.1%      57.93 ±  5%  sched_debug.cpu.cpu_load[4].stddev
     14792 ± 30%     +71.9%      25424 ± 17%  sched_debug.cpu.curr->pid.avg
     42862 ±  1%     +42.6%      61121 ±  0%  sched_debug.cpu.curr->pid.max
     19466 ± 10%     +35.4%      26351 ±  9%  sched_debug.cpu.curr->pid.stddev
      1067 ±  6%     -14.9%     909.35 ±  4%  sched_debug.cpu.ttwu_local.stddev



To reproduce:

        git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
        cd lkp-tests
        bin/lkp install job.yaml  # job file is attached in this email
        bin/lkp run     job.yaml


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Xiaolong

[-- Attachment #2: job.yaml --]
[-- Type: text/plain, Size: 3497 bytes --]

---
LKP_SERVER: inn
LKP_CGI_PORT: 80
LKP_CIFS_PORT: 139
testcase: unixbench
default-monitors:
  wait: activate-monitor
  kmsg: 
  uptime: 
  iostat: 
  heartbeat: 
  vmstat: 
  numa-numastat: 
  numa-vmstat: 
  numa-meminfo: 
  proc-vmstat: 
  proc-stat:
    interval: 10
  meminfo: 
  slabinfo: 
  interrupts: 
  lock_stat: 
  latency_stats: 
  softirqs: 
  bdi_dev_mapping: 
  diskstats: 
  nfsstat: 
  cpuidle: 
  cpufreq-stats: 
  turbostat: 
  pmeter: 
  sched_debug:
    interval: 60
cpufreq_governor: performance
NFS_HANG_DF_TIMEOUT: 200
NFS_HANG_CHECK_INTERVAL: 900
default-watchdogs:
  oom-killer: 
  watchdog: 
  nfs-hang: 
commit: 5c0a85fad949212b3e059692deecdeed74ae7ec7
model: Haswell High-end Desktop
nr_cpu: 16
memory: 16G
hdd_partitions: 
swap_partitions: 
rootfs_partition: 
description: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
category: benchmark
nr_task: 1
unixbench:
  test: shell8
queue: bisect
testbox: lituya
tbox_group: lituya
kconfig: x86_64-rhel
enqueue_time: 2016-06-04 03:26:52.444586006 +08:00
compiler: gcc-4.9
rootfs: debian-x86_64-2015-02-07.cgz
id: 101932ca34f6ff20613b88f6bed66fbc4afdfb95
user: lkp
head_commit: 73aa85b30706f742655a10c967c033b56c731aff
base_commit: 1a695a905c18548062509178b98bc91e67510864
branch: internal-devel/devel-hourly-2016060108-internal
result_root: "/result/unixbench/performance-1-shell8/lituya/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/1"
job_file: "/lkp/scheduled/lituya/bisect_unixbench-performance-1-shell8-debian-x86_64-2015-02-07.cgz-x86_64-rhel-5c0a85fad949212b3e059692deecdeed74ae7ec7-20160604-57400-1fovod8-1.yaml"
max_uptime: 1032.28
initrd: "/osimage/debian/debian-x86_64-2015-02-07.cgz"
bootloader_append:
- root=/dev/ram0
- user=lkp
- job=/lkp/scheduled/lituya/bisect_unixbench-performance-1-shell8-debian-x86_64-2015-02-07.cgz-x86_64-rhel-5c0a85fad949212b3e059692deecdeed74ae7ec7-20160604-57400-1fovod8-1.yaml
- ARCH=x86_64
- kconfig=x86_64-rhel
- branch=internal-devel/devel-hourly-2016060108-internal
- commit=5c0a85fad949212b3e059692deecdeed74ae7ec7
- BOOT_IMAGE=/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/vmlinuz-4.6.0-06629-g5c0a85f
- max_uptime=1032
- RESULT_ROOT=/result/unixbench/performance-1-shell8/lituya/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/1
- LKP_SERVER=inn
- |2-


  earlyprintk=ttyS0,115200 systemd.log_level=err
  debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100
  panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0
  console=ttyS0,115200 console=tty0 vga=normal

  rw
lkp_initrd: "/lkp/lkp/lkp-x86_64.cgz"
modules_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/modules.cgz"
bm_initrd: "/osimage/deps/debian-x86_64-2015-02-07.cgz/lkp.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/run-ipconfig.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/turbostat.cgz,/lkp/benchmarks/turbostat.cgz,/lkp/benchmarks/unixbench.cgz"
linux_headers_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/linux-headers.cgz"
repeat_to: 2
kernel: "/pkg/linux/x86_64-rhel/gcc-4.9/5c0a85fad949212b3e059692deecdeed74ae7ec7/vmlinuz-4.6.0-06629-g5c0a85f"
dequeue_time: 2016-06-04 03:40:50.807274201 +08:00
job_state: finished
loadavg: 5.70 2.70 1.05 1/257 3744
start_time: '1465010834'
end_time: '1465011023'
version: "/lkp/lkp/.src-20160603-214427"

[-- Attachment #3: reproduce --]
[-- Type: text/plain, Size: 1532 bytes --]

2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu12/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu13/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu14/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu15/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor
2016-06-04 11:23:32 echo performance > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor
2016-06-04 11:23:33 ./Run shell8 -c 1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-06  2:27 [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression kernel test robot
@ 2016-06-06  9:51 ` Kirill A. Shutemov
  2016-06-08  7:21   ` [LKP] " Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2016-06-06  9:51 UTC (permalink / raw)
  To: kernel test robot
  Cc: Linus Torvalds, Michal Hocko, Minchan Kim, Rik van Riel,
	Mel Gorman, Michal Hocko, Vinayak Menon, Andrew Morton, LKML,
	lkp

On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> 
> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> 
> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> 
> in testcase: unixbench
> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> 
> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> 
> commit: 
>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>   5c0a85fad949212b3e059692deecdeed74ae7ec7
> 
> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
> ---------------- -------------------------- 
>        fail:runs  %reproduction    fail:runs
>            |             |             |    
>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>          %stddev     %change         %stddev
>              \          |                \  
>      14321 ±  0%      -6.3%      13425 ±  0%  unixbench.score
>    1996897 ±  0%      -6.1%    1874635 ±  0%  unixbench.time.involuntary_context_switches
>  1.721e+08 ±  0%      -6.2%  1.613e+08 ±  0%  unixbench.time.minor_page_faults
>     758.65 ±  0%      -3.0%     735.86 ±  0%  unixbench.time.system_time
>     387.66 ±  0%      +5.4%     408.49 ±  0%  unixbench.time.user_time
>    5950278 ±  0%      -6.2%    5583456 ±  0%  unixbench.time.voluntary_context_switches

That's weird.

I don't understand why the change would reduce number or minor faults.
It should stay the same on x86-64. Rise of user_time is puzzling too.

Hm. Is reproducible? Across reboot?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-06  9:51 ` Kirill A. Shutemov
@ 2016-06-08  7:21   ` Huang, Ying
  2016-06-08  8:41     ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2016-06-08  7:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kernel test robot, Rik van Riel, Michal Hocko, lkp, LKML,
	Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman,
	Andrew Morton, Linus Torvalds

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> 
>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> 
>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> 
>> in testcase: unixbench
>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> 
>> 
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>> 
>> 
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> 
>> commit: 
>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
>> 
>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
>> ---------------- -------------------------- 
>>        fail:runs  %reproduction    fail:runs
>>            |             |             |    
>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>>          %stddev     %change         %stddev
>>              \          |                \  
>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
>
> That's weird.
>
> I don't understand why the change would reduce number or minor faults.
> It should stay the same on x86-64. Rise of user_time is puzzling too.

unixbench runs in fixed time mode.  That is, the total time to run
unixbench is fixed, but the work done varies.  So the minor_page_faults
change may reflect only the work done.

> Hm. Is reproducible? Across reboot?

Yes.  LKP will run every benchmark after reboot via kexec.  We run 3
times for both the commit and its parent.  The result is quite stable.
You can find the standard deviation in percent is near 0 across
different runs.  Here is another comparison with profile data.

=========================================================================================
compiler/cpufreq_governor/debug-setup/kconfig/nr_task/rootfs/tbox_group/test/testcase:
  gcc-4.9/performance/profile/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench

commit: 
  4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
  5c0a85fad949212b3e059692deecdeed74ae7ec7

4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     14056 ±  0%      -6.3%      13172 ±  0%  unixbench.score
   6464046 ±  0%      -6.1%    6071922 ±  0%  unixbench.time.involuntary_context_switches
 5.555e+08 ±  0%      -6.2%  5.211e+08 ±  0%  unixbench.time.minor_page_faults
      2537 ±  0%      -3.2%       2455 ±  0%  unixbench.time.system_time
      1284 ±  0%      +5.8%       1359 ±  0%  unixbench.time.user_time
  19192611 ±  0%      -6.2%   18010830 ±  0%  unixbench.time.voluntary_context_switches
   7709931 ±  0%     -11.0%    6860574 ±  0%  cpuidle.C1-HSW.usage
      6900 ±  1%     -43.9%       3871 ±  0%  proc-vmstat.nr_active_file
     40813 ±  1%     -77.9%       9015 ±114%  softirqs.NET_RX
    111331 ±  1%     -13.3%      96503 ±  0%  meminfo.Active
     27603 ±  1%     -43.9%      15486 ±  0%  meminfo.Active(file)
     93169 ±  0%      -5.8%      87766 ±  0%  vmstat.system.cs
     19768 ±  0%      -1.7%      19437 ±  0%  vmstat.system.in
      6.22 ±  0%     +10.3%       6.86 ±  0%  turbostat.CPU%c3
      0.02 ± 20%     -85.7%       0.00 ±141%  turbostat.Pkg%pc3
     68.99 ±  0%      -1.7%      67.84 ±  0%  turbostat.PkgWatt
      1.38 ±  5%     -42.0%       0.80 ±  5%  perf-profile.cycles-pp.page_remove_rmap.unmap_page_range.unmap_single_vma.unmap_vmas.exit_mmap
      0.83 ±  4%     +28.8%       1.07 ± 21%  perf-profile.cycles-pp.release_pages.free_pages_and_swap_cache.tlb_flush_mmu_free.tlb_finish_mmu.exit_mmap
      1.55 ±  3%     -10.6%       1.38 ±  2%  perf-profile.cycles-pp.unmap_single_vma.unmap_vmas.exit_mmap.mmput.flush_old_exec
      1.59 ±  3%      -9.8%       1.44 ±  3%  perf-profile.cycles-pp.unmap_vmas.exit_mmap.mmput.flush_old_exec.load_elf_binary
    389.00 ±  0%     +32.1%     514.00 ±  8%  slabinfo.file_lock_cache.active_objs
    389.00 ±  0%     +32.1%     514.00 ±  8%  slabinfo.file_lock_cache.num_objs
      7075 ±  3%     -17.7%       5823 ±  7%  slabinfo.pid.active_objs
      7075 ±  3%     -17.7%       5823 ±  7%  slabinfo.pid.num_objs
      0.67 ± 34%     +86.4%       1.24 ± 30%  sched_debug.cfs_rq:/.runnable_load_avg.min
     -9013 ± -1%     +14.4%     -10315 ± -9%  sched_debug.cfs_rq:/.spread0.avg
     83127 ±  5%     +16.9%      97163 ±  8%  sched_debug.cpu.avg_idle.min
     17777 ± 16%     +66.6%      29608 ± 22%  sched_debug.cpu.curr->pid.avg
     50223 ± 10%     +49.3%      74974 ±  0%  sched_debug.cpu.curr->pid.max
     22281 ± 13%     +51.8%      33816 ±  6%  sched_debug.cpu.curr->pid.stddev
    251.79 ±  5%     -13.8%     217.15 ±  5%  sched_debug.cpu.nr_uninterruptible.max
   -261.12 ± -2%     -13.4%    -226.03 ± -1%  sched_debug.cpu.nr_uninterruptible.min
    221.14 ±  3%     -14.7%     188.60 ±  1%  sched_debug.cpu.nr_uninterruptible.stddev
  1.94e+11 ±  0%      -5.8%  1.827e+11 ±  0%  perf-stat.L1-dcache-load-misses
 3.496e+12 ±  0%      -6.5%  3.268e+12 ±  0%  perf-stat.L1-dcache-loads
 2.262e+12 ±  1%      -5.5%  2.137e+12 ±  0%  perf-stat.L1-dcache-stores
 9.711e+10 ±  0%      -3.7%  9.353e+10 ±  0%  perf-stat.L1-icache-load-misses
 8.051e+08 ±  0%      -8.8%  7.343e+08 ±  1%  perf-stat.LLC-load-misses
 7.184e+10 ±  1%      -5.6%   6.78e+10 ±  0%  perf-stat.LLC-loads
 5.867e+08 ±  2%      -7.0%  5.456e+08 ±  0%  perf-stat.LLC-store-misses
 1.524e+10 ±  1%      -5.6%  1.438e+10 ±  0%  perf-stat.LLC-stores
 2.711e+12 ±  0%      -6.3%  2.539e+12 ±  0%  perf-stat.branch-instructions
 5.948e+10 ±  0%      -3.9%  5.715e+10 ±  0%  perf-stat.branch-load-misses
 2.715e+12 ±  0%      -6.4%  2.542e+12 ±  0%  perf-stat.branch-loads
 5.947e+10 ±  0%      -3.9%  5.713e+10 ±  0%  perf-stat.branch-misses
 1.448e+09 ±  0%      -9.3%  1.313e+09 ±  1%  perf-stat.cache-misses
 1.931e+11 ±  0%      -5.8%  1.818e+11 ±  0%  perf-stat.cache-references
  58882705 ±  0%      -5.8%   55467522 ±  0%  perf-stat.context-switches
  17037466 ±  0%      -6.1%   15999111 ±  0%  perf-stat.cpu-migrations
 6.732e+09 ±  1%     +90.7%  1.284e+10 ±  0%  perf-stat.dTLB-load-misses
 3.474e+12 ±  0%      -6.6%  3.245e+12 ±  0%  perf-stat.dTLB-loads
 1.215e+09 ±  0%      -5.5%  1.149e+09 ±  0%  perf-stat.dTLB-store-misses
 2.286e+12 ±  0%      -5.8%  2.153e+12 ±  0%  perf-stat.dTLB-stores
 3.511e+09 ±  0%     +20.4%  4.226e+09 ±  0%  perf-stat.iTLB-load-misses
 2.317e+09 ±  0%      -6.8%   2.16e+09 ±  0%  perf-stat.iTLB-loads
 1.343e+13 ±  0%      -6.0%  1.263e+13 ±  0%  perf-stat.instructions
 5.504e+08 ±  0%      -6.2%  5.163e+08 ±  0%  perf-stat.minor-faults
  8.09e+08 ±  1%      -9.0%   7.36e+08 ±  1%  perf-stat.node-loads
 5.932e+08 ±  0%      -8.7%  5.417e+08 ±  1%  perf-stat.node-stores
 5.504e+08 ±  0%      -6.2%  5.163e+08 ±  0%  perf-stat.page-faults

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-08  7:21   ` [LKP] " Huang, Ying
@ 2016-06-08  8:41     ` Huang, Ying
  2016-06-08  8:58       ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2016-06-08  8:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML,
	Linus Torvalds, Michal Hocko, Minchan Kim, Vinayak Menon,
	Mel Gorman, Andrew Morton, lkp

"Huang, Ying" <ying.huang@intel.com> writes:

> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>
>> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>>> 
>>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>>> 
>>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>> 
>>> in testcase: unixbench
>>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>>> 
>>> 
>>> Details are as below:
>>> -------------------------------------------------------------------------------------------------->
>>> 
>>> 
>>> =========================================================================================
>>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>>> 
>>> commit: 
>>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
>>> 
>>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
>>> ---------------- -------------------------- 
>>>        fail:runs  %reproduction    fail:runs
>>>            |             |             |    
>>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>>>          %stddev     %change         %stddev
>>>              \          |                \  
>>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
>>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
>>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
>>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
>>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
>>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
>>
>> That's weird.
>>
>> I don't understand why the change would reduce number or minor faults.
>> It should stay the same on x86-64. Rise of user_time is puzzling too.
>
> unixbench runs in fixed time mode.  That is, the total time to run
> unixbench is fixed, but the work done varies.  So the minor_page_faults
> change may reflect only the work done.
>
>> Hm. Is reproducible? Across reboot?
>

And FYI, there is no swap setup for test, all root file system including
benchmark files are in tmpfs, so no real page reclaim will be
triggered.  But it appears that active file cache reduced after the
commit.

    111331 ±  1%     -13.3%      96503 ±  0%  meminfo.Active
     27603 ±  1%     -43.9%      15486 ±  0%  meminfo.Active(file)

I think this is the expected behavior of the commit?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-08  8:41     ` Huang, Ying
@ 2016-06-08  8:58       ` Kirill A. Shutemov
  2016-06-12  0:49         ` Huang, Ying
  2016-06-14  8:57         ` Minchan Kim
  0 siblings, 2 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2016-06-08  8:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko,
	Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, lkp

On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> "Huang, Ying" <ying.huang@intel.com> writes:
> 
> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >
> >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >>> 
> >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >>> 
> >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >>> 
> >>> in testcase: unixbench
> >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >>> 
> >>> 
> >>> Details are as below:
> >>> -------------------------------------------------------------------------------------------------->
> >>> 
> >>> 
> >>> =========================================================================================
> >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >>> 
> >>> commit: 
> >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
> >>> 
> >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
> >>> ---------------- -------------------------- 
> >>>        fail:runs  %reproduction    fail:runs
> >>>            |             |             |    
> >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >>>          %stddev     %change         %stddev
> >>>              \          |                \  
> >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
> >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
> >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
> >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
> >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
> >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
> >>
> >> That's weird.
> >>
> >> I don't understand why the change would reduce number or minor faults.
> >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >
> > unixbench runs in fixed time mode.  That is, the total time to run
> > unixbench is fixed, but the work done varies.  So the minor_page_faults
> > change may reflect only the work done.
> >
> >> Hm. Is reproducible? Across reboot?
> >
> 
> And FYI, there is no swap setup for test, all root file system including
> benchmark files are in tmpfs, so no real page reclaim will be
> triggered.  But it appears that active file cache reduced after the
> commit.
> 
>     111331 ±  1%     -13.3%      96503 ±  0%  meminfo.Active
>      27603 ±  1%     -43.9%      15486 ±  0%  meminfo.Active(file)
> 
> I think this is the expected behavior of the commit?

Yes, it's expected.

After the change faularound would produce old pte. It means there's more
chance for these pages to be on inactive lru, unless somebody actually
touch them and flip accessed bit.

I wounder if this regression can attributed to cost of setting accessed
bit. It looks too high, but who knows.

I don't have time to do testing myself right now. I will put this on todo
list.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-08  8:58       ` Kirill A. Shutemov
@ 2016-06-12  0:49         ` Huang, Ying
  2016-06-12  1:02           ` Linus Torvalds
  2016-06-14  8:57         ` Minchan Kim
  1 sibling, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2016-06-12  0:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Linus Torvalds,
	Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman,
	Andrew Morton, lkp

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> "Huang, Ying" <ying.huang@intel.com> writes:
>> 
>> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >
>> >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >>> 
>> >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >>> 
>> >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >>> 
>> >>> in testcase: unixbench
>> >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >>> 
>> >>> 
>> >>> Details are as below:
>> >>> -------------------------------------------------------------------------------------------------->
>> >>> 
>> >>> 
>> >>> =========================================================================================
>> >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >>> 
>> >>> commit: 
>> >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >>> 
>> >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
>> >>> ---------------- -------------------------- 
>> >>>        fail:runs  %reproduction    fail:runs
>> >>>            |             |             |    
>> >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >>>          %stddev     %change         %stddev
>> >>>              \          |                \  
>> >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
>> >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
>> >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
>> >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
>> >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
>> >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
>> >>
>> >> That's weird.
>> >>
>> >> I don't understand why the change would reduce number or minor faults.
>> >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >
>> > unixbench runs in fixed time mode.  That is, the total time to run
>> > unixbench is fixed, but the work done varies.  So the minor_page_faults
>> > change may reflect only the work done.
>> >
>> >> Hm. Is reproducible? Across reboot?
>> >
>> 
>> And FYI, there is no swap setup for test, all root file system including
>> benchmark files are in tmpfs, so no real page reclaim will be
>> triggered.  But it appears that active file cache reduced after the
>> commit.
>> 
>>     111331 .  1%     -13.3%      96503 .  0%  meminfo.Active
>>      27603 .  1%     -43.9%      15486 .  0%  meminfo.Active(file)
>> 
>> I think this is the expected behavior of the commit?
>
> Yes, it's expected.
>
> After the change faularound would produce old pte. It means there's more
> chance for these pages to be on inactive lru, unless somebody actually
> touch them and flip accessed bit.
>
> I wounder if this regression can attributed to cost of setting accessed
> bit. It looks too high, but who knows.

>From perf profile, the time spent in page_fault and its children
functions are almost same (7.85% vs 7.81%).  So the time spent in page
fault and page table operation itself doesn't changed much.  So, you
mean CPU may be slower to load the page table entry to TLB if accessed
bit is not set?

> I don't have time to do testing myself right now. I will put this on todo
> list.

Which kind of test your want to do?  I want to check whether I can help.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-12  0:49         ` Huang, Ying
@ 2016-06-12  1:02           ` Linus Torvalds
  2016-06-13  9:02             ` Huang, Ying
  2016-06-13 12:52             ` Kirill A. Shutemov
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2016-06-12  1:02 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kirill A. Shutemov, Rik van Riel, Michal Hocko, LKML,
	Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman,
	Andrew Morton, LKP

On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>
> From perf profile, the time spent in page_fault and its children
> functions are almost same (7.85% vs 7.81%).  So the time spent in page
> fault and page table operation itself doesn't changed much.  So, you
> mean CPU may be slower to load the page table entry to TLB if accessed
> bit is not set?

So the CPU does take a microfault internally when it needs to set the
accessed/dirty bit. It's not architecturally visible, but you can see
it when you do timing loops.

I've timed it at over a thousand cycles on at least some CPU's, but
that's still peanuts compared to a real page fault. It shouldn't be
*that* noticeable, ie no way it's a 6% regression on its own.

           Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-12  1:02           ` Linus Torvalds
@ 2016-06-13  9:02             ` Huang, Ying
  2016-06-14 13:38               ` Minchan Kim
  2016-06-13 12:52             ` Kirill A. Shutemov
  1 sibling, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2016-06-13  9:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Huang, Ying, Kirill A. Shutemov, Rik van Riel, Michal Hocko,
	LKML, Michal Hocko, Minchan Kim, Vinayak Menon, Mel Gorman,
	Andrew Morton, LKP

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>>
>> From perf profile, the time spent in page_fault and its children
>> functions are almost same (7.85% vs 7.81%).  So the time spent in page
>> fault and page table operation itself doesn't changed much.  So, you
>> mean CPU may be slower to load the page table entry to TLB if accessed
>> bit is not set?
>
> So the CPU does take a microfault internally when it needs to set the
> accessed/dirty bit. It's not architecturally visible, but you can see
> it when you do timing loops.
>
> I've timed it at over a thousand cycles on at least some CPU's, but
> that's still peanuts compared to a real page fault. It shouldn't be
> *that* noticeable, ie no way it's a 6% regression on its own.

I done some simple counting, and found that about 3.15e9 PTE are set to
old during the test after the commit.  This may interpret the user_time
increase as below, because these accessed bit microfault is accounted as
user time.

    387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time

I also make a one line debug patch as below on top of the commit to set
the PTE to young unconditionally, which recover the regression.

modified   mm/filemap.c
@@ -2193,7 +2193,7 @@ repeat:
 		if (file->f_ra.mmap_miss > 0)
 			file->f_ra.mmap_miss--;
 		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
-		do_set_pte(vma, addr, page, pte, false, false, true);
+		do_set_pte(vma, addr, page, pte, false, false, false);
 		unlock_page(page);
 		atomic64_inc(&old_pte_count);
 		goto next;

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-12  1:02           ` Linus Torvalds
  2016-06-13  9:02             ` Huang, Ying
@ 2016-06-13 12:52             ` Kirill A. Shutemov
  2016-06-14  6:11               ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2016-06-13 12:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko,
	Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP,
	Dave Hansen

On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
> >
> > From perf profile, the time spent in page_fault and its children
> > functions are almost same (7.85% vs 7.81%).  So the time spent in page
> > fault and page table operation itself doesn't changed much.  So, you
> > mean CPU may be slower to load the page table entry to TLB if accessed
> > bit is not set?
> 
> So the CPU does take a microfault internally when it needs to set the
> accessed/dirty bit. It's not architecturally visible, but you can see
> it when you do timing loops.
> 
> I've timed it at over a thousand cycles on at least some CPU's, but
> that's still peanuts compared to a real page fault. It shouldn't be
> *that* noticeable, ie no way it's a 6% regression on its own.

Looks like setting accessed bit is the problem.

Withouth mkold:

Score: 1952.9

  Performance counter stats for './Run shell8 -c 1' (3 runs):
 
    468,562,316,621      cycles:u                                                      ( +-  0.02% )
      4,596,299,472      dtlb_load_misses_walk_duration:u                                     ( +-  0.07% )
      5,245,488,559      itlb_misses_walk_duration:u                                     ( +-  0.10% )
 
      189.336404566 seconds time elapsed                                          ( +-  0.01% )

With mkold:

Score: 1885.5

  Performance counter stats for './Run shell8 -c 1' (3 runs):
 
    503,185,676,256      cycles:u                                                      ( +-  0.06% )
      8,137,007,894      dtlb_load_misses_walk_duration:u                                     ( +-  0.85% )
      7,220,632,283      itlb_misses_walk_duration:u                                     ( +-  1.40% )
 
      189.363223499 seconds time elapsed                                          ( +-  0.01% )

We spend 36% more time in page walk only, about 1% of total userspace time.
Combining this with page walk footprint on caches, I guess we can get to
this 3.5% score difference I see.

I'm not sure if there's anything we can do to solve the issue without
screwing relacim logic again. :(

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-13 12:52             ` Kirill A. Shutemov
@ 2016-06-14  6:11               ` Linus Torvalds
  2016-06-14  8:26                 ` Kirill A. Shutemov
  2016-06-14 14:03                 ` Christian Borntraeger
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2016-06-14  6:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko,
	Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP,
	Dave Hansen

On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
>>
>> I've timed it at over a thousand cycles on at least some CPU's, but
>> that's still peanuts compared to a real page fault. It shouldn't be
>> *that* noticeable, ie no way it's a 6% regression on its own.
>
> Looks like setting accessed bit is the problem.

Ok. I've definitely seen it as an issue, but never to the point of
several percent on a real benchmark that wasn't explicitly testing
that cost.

I reported the excessive dirty/accessed bit cost to Intel back in the
P4 days, but it's apparently not been high enough for anybody to care.

> We spend 36% more time in page walk only, about 1% of total userspace time.
> Combining this with page walk footprint on caches, I guess we can get to
> this 3.5% score difference I see.
>
> I'm not sure if there's anything we can do to solve the issue without
> screwing relacim logic again. :(

I think we should say "screw the reclaim logic" for now, and revert
commit 5c0a85fad949 for now.

Considering how much trouble the accessed bit is on some other
architectures too, I wonder if we should strive to simply not care
about it, and always leaving it set. And then rely entirely on just
unmapping the pages and making the "we took a page fault after
unmapping" be the real activity tester.

So get rid of the "if the page is young, mark it old but leave it in
the page tables" logic entirely. When we unmap a page, it will always
either be in the swap cache or the page cache anyway, so faulting it
in again should be just a minor fault with no actual IO happening.

That might be less of an impact in the end - yes, the unmap and
re-fault is much more expensive, but it presumably happens to much
fewer pages.

What do you think?

             Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-14  6:11               ` Linus Torvalds
@ 2016-06-14  8:26                 ` Kirill A. Shutemov
  2016-06-14 16:07                   ` Rik van Riel
  2016-06-14 14:03                 ` Christian Borntraeger
  1 sibling, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2016-06-14  8:26 UTC (permalink / raw)
  To: Linus Torvalds, Rik van Riel, Mel Gorman
  Cc: Kirill A. Shutemov, Huang, Ying, Michal Hocko, LKML,
	Michal Hocko, Minchan Kim, Vinayak Menon, Andrew Morton, LKP,
	Dave Hansen, Vladimir Davydov

On Mon, Jun 13, 2016 at 11:11:05PM -0700, Linus Torvalds wrote:
> On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> >>
> >> I've timed it at over a thousand cycles on at least some CPU's, but
> >> that's still peanuts compared to a real page fault. It shouldn't be
> >> *that* noticeable, ie no way it's a 6% regression on its own.
> >
> > Looks like setting accessed bit is the problem.
> 
> Ok. I've definitely seen it as an issue, but never to the point of
> several percent on a real benchmark that wasn't explicitly testing
> that cost.
> 
> I reported the excessive dirty/accessed bit cost to Intel back in the
> P4 days, but it's apparently not been high enough for anybody to care.
> 
> > We spend 36% more time in page walk only, about 1% of total userspace time.
> > Combining this with page walk footprint on caches, I guess we can get to
> > this 3.5% score difference I see.
> >
> > I'm not sure if there's anything we can do to solve the issue without
> > screwing relacim logic again. :(
> 
> I think we should say "screw the reclaim logic" for now, and revert
> commit 5c0a85fad949 for now.

Okay. I'll prepare the patch.

> Considering how much trouble the accessed bit is on some other
> architectures too, I wonder if we should strive to simply not care
> about it, and always leaving it set. And then rely entirely on just
> unmapping the pages and making the "we took a page fault after
> unmapping" be the real activity tester.
> 
> So get rid of the "if the page is young, mark it old but leave it in
> the page tables" logic entirely. When we unmap a page, it will always
> either be in the swap cache or the page cache anyway, so faulting it
> in again should be just a minor fault with no actual IO happening.
> 
> That might be less of an impact in the end - yes, the unmap and
> re-fault is much more expensive, but it presumably happens to much
> fewer pages.
> 
> What do you think?

Well, we cannot do this for anonymous memory. No swap -- no swap cache, if
I read code correctly.

I guess it's doable for file mappings. Although I would expect regressions
in other benchmarks. IIUC, it would require page unmapping to propogate
page to active list, which is suboptimal.

And implications for page_idle is not clear to me.

Rik, Mel, any comments?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-08  8:58       ` Kirill A. Shutemov
  2016-06-12  0:49         ` Huang, Ying
@ 2016-06-14  8:57         ` Minchan Kim
  2016-06-14 14:34           ` Kirill A. Shutemov
  1 sibling, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2016-06-14  8:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Linus Torvalds,
	Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton, lkp

On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> > "Huang, Ying" <ying.huang@intel.com> writes:
> > 
> > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> > >
> > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> > >>> 
> > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> > >>> 
> > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > >>> 
> > >>> in testcase: unixbench
> > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> > >>> 
> > >>> 
> > >>> Details are as below:
> > >>> -------------------------------------------------------------------------------------------------->
> > >>> 
> > >>> 
> > >>> =========================================================================================
> > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> > >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> > >>> 
> > >>> commit: 
> > >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> > >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
> > >>> 
> > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
> > >>> ---------------- -------------------------- 
> > >>>        fail:runs  %reproduction    fail:runs
> > >>>            |             |             |    
> > >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> > >>>          %stddev     %change         %stddev
> > >>>              \          |                \  
> > >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
> > >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
> > >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
> > >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
> > >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
> > >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
> > >>
> > >> That's weird.
> > >>
> > >> I don't understand why the change would reduce number or minor faults.
> > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> > >
> > > unixbench runs in fixed time mode.  That is, the total time to run
> > > unixbench is fixed, but the work done varies.  So the minor_page_faults
> > > change may reflect only the work done.
> > >
> > >> Hm. Is reproducible? Across reboot?
> > >
> > 
> > And FYI, there is no swap setup for test, all root file system including
> > benchmark files are in tmpfs, so no real page reclaim will be
> > triggered.  But it appears that active file cache reduced after the
> > commit.
> > 
> >     111331 ±  1%     -13.3%      96503 ±  0%  meminfo.Active
> >      27603 ±  1%     -43.9%      15486 ±  0%  meminfo.Active(file)
> > 
> > I think this is the expected behavior of the commit?
> 
> Yes, it's expected.
> 
> After the change faularound would produce old pte. It means there's more
> chance for these pages to be on inactive lru, unless somebody actually
> touch them and flip accessed bit.

Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
anonymous LRU list on swapless system so I really wonder why active file
LRU is shrunk.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-13  9:02             ` Huang, Ying
@ 2016-06-14 13:38               ` Minchan Kim
  2016-06-15 23:42                 ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2016-06-14 13:38 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Linus Torvalds, Kirill A. Shutemov, Rik van Riel, Michal Hocko,
	LKML, Michal Hocko, Vinayak Menon, Mel Gorman, Andrew Morton,
	LKP

On Mon, Jun 13, 2016 at 05:02:15PM +0800, Huang, Ying wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> From perf profile, the time spent in page_fault and its children
> >> functions are almost same (7.85% vs 7.81%).  So the time spent in page
> >> fault and page table operation itself doesn't changed much.  So, you
> >> mean CPU may be slower to load the page table entry to TLB if accessed
> >> bit is not set?
> >
> > So the CPU does take a microfault internally when it needs to set the
> > accessed/dirty bit. It's not architecturally visible, but you can see
> > it when you do timing loops.
> >
> > I've timed it at over a thousand cycles on at least some CPU's, but
> > that's still peanuts compared to a real page fault. It shouldn't be
> > *that* noticeable, ie no way it's a 6% regression on its own.
> 
> I done some simple counting, and found that about 3.15e9 PTE are set to
> old during the test after the commit.  This may interpret the user_time
> increase as below, because these accessed bit microfault is accounted as
> user time.
> 
>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
> 
> I also make a one line debug patch as below on top of the commit to set
> the PTE to young unconditionally, which recover the regression.

With this patch, meminfo.Active(file) is almost same unlike previous
experiment?

> 
> modified   mm/filemap.c
> @@ -2193,7 +2193,7 @@ repeat:
>  		if (file->f_ra.mmap_miss > 0)
>  			file->f_ra.mmap_miss--;
>  		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> -		do_set_pte(vma, addr, page, pte, false, false, true);
> +		do_set_pte(vma, addr, page, pte, false, false, false);
>  		unlock_page(page);
>  		atomic64_inc(&old_pte_count);
>  		goto next;
> 
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-14  6:11               ` Linus Torvalds
  2016-06-14  8:26                 ` Kirill A. Shutemov
@ 2016-06-14 14:03                 ` Christian Borntraeger
  1 sibling, 0 replies; 23+ messages in thread
From: Christian Borntraeger @ 2016-06-14 14:03 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Huang, Ying, Rik van Riel, Michal Hocko, LKML, Michal Hocko,
	Minchan Kim, Vinayak Menon, Mel Gorman, Andrew Morton, LKP,
	Dave Hansen, Martin Schwidefsky, linux-s390

On 06/14/2016 08:11 AM, Linus Torvalds wrote:
> On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>> On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
>>>
>>> I've timed it at over a thousand cycles on at least some CPU's, but
>>> that's still peanuts compared to a real page fault. It shouldn't be
>>> *that* noticeable, ie no way it's a 6% regression on its own.
>>
>> Looks like setting accessed bit is the problem.
> 
> Ok. I've definitely seen it as an issue, but never to the point of
> several percent on a real benchmark that wasn't explicitly testing
> that cost.
> 
> I reported the excessive dirty/accessed bit cost to Intel back in the
> P4 days, but it's apparently not been high enough for anybody to care.
> 
>> We spend 36% more time in page walk only, about 1% of total userspace time.
>> Combining this with page walk footprint on caches, I guess we can get to
>> this 3.5% score difference I see.
>>
>> I'm not sure if there's anything we can do to solve the issue without
>> screwing relacim logic again. :(
> 
> I think we should say "screw the reclaim logic" for now, and revert
> commit 5c0a85fad949 for now.
> 
> Considering how much trouble the accessed bit is on some other
> architectures too, I wonder if we should strive to simply not care
> about it, and always leaving it set. And then rely entirely on just
> unmapping the pages and making the "we took a page fault after
> unmapping" be the real activity tester.
> 
> So get rid of the "if the page is young, mark it old but leave it in
> the page tables" logic entirely. When we unmap a page, it will always
> either be in the swap cache or the page cache anyway, so faulting it
> in again should be just a minor fault with no actual IO happening.
> 
> That might be less of an impact in the end - yes, the unmap and
> re-fault is much more expensive, but it presumably happens to much
> fewer pages.

FWIW, something like that is what Martin did for s390 3 years ago.
We now use invalidation and page faults to implement the *young 
functions in  pgtable.h (basically using a SW young bit). This
helped us to get rid of the storage keys (which contain the HW 
reference bit). The performance did not seem to suffer.

See commit 0944fe3f4a323f436180d39402cae7f9c46ead17
s390/mm: implement software referenced bits

> 
> What do you think?

Your proposal would be to do the software tracking via
invalidation/fault part of the generic mm code and not to hide it
in the architecture backend. Correct?

> 
>              Linus
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-14  8:57         ` Minchan Kim
@ 2016-06-14 14:34           ` Kirill A. Shutemov
  2016-06-15 23:52             ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2016-06-14 14:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Kirill A. Shutemov, Huang, Ying, Rik van Riel, Michal Hocko,
	LKML, Linus Torvalds, Michal Hocko, Vinayak Menon, Mel Gorman,
	Andrew Morton, lkp

On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> > > "Huang, Ying" <ying.huang@intel.com> writes:
> > > 
> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> > > >
> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> > > >>> 
> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> > > >>> 
> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > > >>> 
> > > >>> in testcase: unixbench
> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> > > >>> 
> > > >>> 
> > > >>> Details are as below:
> > > >>> -------------------------------------------------------------------------------------------------->
> > > >>> 
> > > >>> 
> > > >>> =========================================================================================
> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> > > >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> > > >>> 
> > > >>> commit: 
> > > >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> > > >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
> > > >>> 
> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
> > > >>> ---------------- -------------------------- 
> > > >>>        fail:runs  %reproduction    fail:runs
> > > >>>            |             |             |    
> > > >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> > > >>>          %stddev     %change         %stddev
> > > >>>              \          |                \  
> > > >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
> > > >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
> > > >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
> > > >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
> > > >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
> > > >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
> > > >>
> > > >> That's weird.
> > > >>
> > > >> I don't understand why the change would reduce number or minor faults.
> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> > > >
> > > > unixbench runs in fixed time mode.  That is, the total time to run
> > > > unixbench is fixed, but the work done varies.  So the minor_page_faults
> > > > change may reflect only the work done.
> > > >
> > > >> Hm. Is reproducible? Across reboot?
> > > >
> > > 
> > > And FYI, there is no swap setup for test, all root file system including
> > > benchmark files are in tmpfs, so no real page reclaim will be
> > > triggered.  But it appears that active file cache reduced after the
> > > commit.
> > > 
> > >     111331 ±  1%     -13.3%      96503 ±  0%  meminfo.Active
> > >      27603 ±  1%     -43.9%      15486 ±  0%  meminfo.Active(file)
> > > 
> > > I think this is the expected behavior of the commit?
> > 
> > Yes, it's expected.
> > 
> > After the change faularound would produce old pte. It means there's more
> > chance for these pages to be on inactive lru, unless somebody actually
> > touch them and flip accessed bit.
> 
> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> anonymous LRU list on swapless system so I really wonder why active file
> LRU is shrunk.

Hm. Good point. I don't why we have anything on file lru if there's no
filesystems except tmpfs.

Ying, how do you get stuff to the tmpfs?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-14  8:26                 ` Kirill A. Shutemov
@ 2016-06-14 16:07                   ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2016-06-14 16:07 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Mel Gorman
  Cc: Kirill A. Shutemov, Huang, Ying, Michal Hocko, LKML,
	Michal Hocko, Minchan Kim, Vinayak Menon, Andrew Morton, LKP,
	Dave Hansen, Vladimir Davydov

[-- Attachment #1: Type: text/plain, Size: 3579 bytes --]

On Tue, 2016-06-14 at 11:26 +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 13, 2016 at 11:11:05PM -0700, Linus Torvalds wrote:
> > 
> > On Mon, Jun 13, 2016 at 5:52 AM, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > > 
> > > On Sat, Jun 11, 2016 at 06:02:57PM -0700, Linus Torvalds wrote:
> > > > 
> > > > 
> > > > I've timed it at over a thousand cycles on at least some CPU's,
> > > > but
> > > > that's still peanuts compared to a real page fault. It
> > > > shouldn't be
> > > > *that* noticeable, ie no way it's a 6% regression on its own.
> > > Looks like setting accessed bit is the problem.
> > Ok. I've definitely seen it as an issue, but never to the point of
> > several percent on a real benchmark that wasn't explicitly testing
> > that cost.
> > 
> > I reported the excessive dirty/accessed bit cost to Intel back in
> > the
> > P4 days, but it's apparently not been high enough for anybody to
> > care.
> > 
> > > 
> > > We spend 36% more time in page walk only, about 1% of total
> > > userspace time.
> > > Combining this with page walk footprint on caches, I guess we can
> > > get to
> > > this 3.5% score difference I see.
> > > 
> > > I'm not sure if there's anything we can do to solve the issue
> > > without
> > > screwing relacim logic again. :(
> > I think we should say "screw the reclaim logic" for now, and revert
> > commit 5c0a85fad949 for now.
> Okay. I'll prepare the patch.
> 
> > 
> > Considering how much trouble the accessed bit is on some other
> > architectures too, I wonder if we should strive to simply not care
> > about it, and always leaving it set. And then rely entirely on just
> > unmapping the pages and making the "we took a page fault after
> > unmapping" be the real activity tester.
> > 
> > So get rid of the "if the page is young, mark it old but leave it
> > in
> > the page tables" logic entirely. When we unmap a page, it will
> > always
> > either be in the swap cache or the page cache anyway, so faulting
> > it
> > in again should be just a minor fault with no actual IO happening.
> > 
> > That might be less of an impact in the end - yes, the unmap and
> > re-fault is much more expensive, but it presumably happens to much
> > fewer pages.
> > 
> > What do you think?
> Well, we cannot do this for anonymous memory. No swap -- no swap
> cache, if
> I read code correctly.
> 
> I guess it's doable for file mappings. Although I would expect
> regressions
> in other benchmarks. IIUC, it would require page unmapping to
> propogate
> page to active list, which is suboptimal.
> 
> And implications for page_idle is not clear to me.
> 
> Rik, Mel, any comments?

We can clear the accessed/young bit when anon pages are moved
from the active to the inactive list.

Reclaim does not care about the young bit on active anon pages
at all. For anon pages it uses a two hand clock algorithm, with
only pages on the inactive list being cared about.

For file pages, I believe we do look at the young bit on mapped
pages when they reach the end of the inactive list. Again, we
only care about the young bit on inactive pages.

One option may be to count on actively used file pages actually
being on the active list, and always set the young bit on ptes
when the page is already active.

Then we can let reclaim do its thing with the smaller number of
pages that are on the inactive list, while doing the faster thing
for pages that are on the active list.

Does that make sense?

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-14 13:38               ` Minchan Kim
@ 2016-06-15 23:42                 ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2016-06-15 23:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Linus Torvalds, Kirill A. Shutemov, Rik van Riel,
	Michal Hocko, LKML, Michal Hocko, Vinayak Menon, Mel Gorman,
	Andrew Morton, LKP

Minchan Kim <minchan@kernel.org> writes:

> On Mon, Jun 13, 2016 at 05:02:15PM +0800, Huang, Ying wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> 
>> > On Sat, Jun 11, 2016 at 5:49 PM, Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> From perf profile, the time spent in page_fault and its children
>> >> functions are almost same (7.85% vs 7.81%).  So the time spent in page
>> >> fault and page table operation itself doesn't changed much.  So, you
>> >> mean CPU may be slower to load the page table entry to TLB if accessed
>> >> bit is not set?
>> >
>> > So the CPU does take a microfault internally when it needs to set the
>> > accessed/dirty bit. It's not architecturally visible, but you can see
>> > it when you do timing loops.
>> >
>> > I've timed it at over a thousand cycles on at least some CPU's, but
>> > that's still peanuts compared to a real page fault. It shouldn't be
>> > *that* noticeable, ie no way it's a 6% regression on its own.
>> 
>> I done some simple counting, and found that about 3.15e9 PTE are set to
>> old during the test after the commit.  This may interpret the user_time
>> increase as below, because these accessed bit microfault is accounted as
>> user time.
>> 
>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
>> 
>> I also make a one line debug patch as below on top of the commit to set
>> the PTE to young unconditionally, which recover the regression.
>
> With this patch, meminfo.Active(file) is almost same unlike previous
> experiment?

Yes.  meminfo.Active(file) is almost same of that of the parent commit of
the first bad commit.

Best Regards,
Huang, Ying

>> 
>> modified   mm/filemap.c
>> @@ -2193,7 +2193,7 @@ repeat:
>>  		if (file->f_ra.mmap_miss > 0)
>>  			file->f_ra.mmap_miss--;
>>  		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
>> -		do_set_pte(vma, addr, page, pte, false, false, true);
>> +		do_set_pte(vma, addr, page, pte, false, false, false);
>>  		unlock_page(page);
>>  		atomic64_inc(&old_pte_count);
>>  		goto next;
>> 
>> Best Regards,
>> Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-14 14:34           ` Kirill A. Shutemov
@ 2016-06-15 23:52             ` Huang, Ying
  2016-06-16  0:13               ` Minchan Kim
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2016-06-15 23:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Minchan Kim, Kirill A. Shutemov, Huang, Ying, Rik van Riel,
	Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon,
	Mel Gorman, Andrew Morton, lkp

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> > > 
>> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> > > >
>> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> > > >>> 
>> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> > > >>> 
>> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> > > >>> 
>> > > >>> in testcase: unixbench
>> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> > > >>> 
>> > > >>> 
>> > > >>> Details are as below:
>> > > >>> -------------------------------------------------------------------------------------------------->
>> > > >>> 
>> > > >>> 
>> > > >>> =========================================================================================
>> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> > > >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> > > >>> 
>> > > >>> commit: 
>> > > >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> > > >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
>> > > >>> 
>> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
>> > > >>> ---------------- -------------------------- 
>> > > >>>        fail:runs  %reproduction    fail:runs
>> > > >>>            |             |             |    
>> > > >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> > > >>>          %stddev     %change         %stddev
>> > > >>>              \          |                \  
>> > > >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
>> > > >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
>> > > >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
>> > > >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
>> > > >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
>> > > >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
>> > > >>
>> > > >> That's weird.
>> > > >>
>> > > >> I don't understand why the change would reduce number or minor faults.
>> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> > > >
>> > > > unixbench runs in fixed time mode.  That is, the total time to run
>> > > > unixbench is fixed, but the work done varies.  So the minor_page_faults
>> > > > change may reflect only the work done.
>> > > >
>> > > >> Hm. Is reproducible? Across reboot?
>> > > >
>> > > 
>> > > And FYI, there is no swap setup for test, all root file system including
>> > > benchmark files are in tmpfs, so no real page reclaim will be
>> > > triggered.  But it appears that active file cache reduced after the
>> > > commit.
>> > > 
>> > >     111331 .  1%     -13.3%      96503 .  0%  meminfo.Active
>> > >      27603 .  1%     -43.9%      15486 .  0%  meminfo.Active(file)
>> > > 
>> > > I think this is the expected behavior of the commit?
>> > 
>> > Yes, it's expected.
>> > 
>> > After the change faularound would produce old pte. It means there's more
>> > chance for these pages to be on inactive lru, unless somebody actually
>> > touch them and flip accessed bit.
>> 
>> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> anonymous LRU list on swapless system so I really wonder why active file
>> LRU is shrunk.
>
> Hm. Good point. I don't why we have anything on file lru if there's no
> filesystems except tmpfs.
>
> Ying, how do you get stuff to the tmpfs?

We put root file system and benchmark into a set of compressed cpio
archive, then concatenate them into one initrd, and finally kernel use
that initrd as initramfs.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-15 23:52             ` Huang, Ying
@ 2016-06-16  0:13               ` Minchan Kim
  2016-06-16 22:27                 ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2016-06-16  0:13 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel,
	Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon,
	Mel Gorman, Andrew Morton, lkp

On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> > > 
> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> > > >
> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> > > >>> 
> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> > > >>> 
> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> > > >>> 
> >> > > >>> in testcase: unixbench
> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> > > >>> 
> >> > > >>> 
> >> > > >>> Details are as below:
> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> > > >>> 
> >> > > >>> 
> >> > > >>> =========================================================================================
> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> > > >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> > > >>> 
> >> > > >>> commit: 
> >> > > >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> > > >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> > > >>> 
> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
> >> > > >>> ---------------- -------------------------- 
> >> > > >>>        fail:runs  %reproduction    fail:runs
> >> > > >>>            |             |             |    
> >> > > >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> > > >>>          %stddev     %change         %stddev
> >> > > >>>              \          |                \  
> >> > > >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
> >> > > >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
> >> > > >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
> >> > > >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
> >> > > >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
> >> > > >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
> >> > > >>
> >> > > >> That's weird.
> >> > > >>
> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> > > >
> >> > > > unixbench runs in fixed time mode.  That is, the total time to run
> >> > > > unixbench is fixed, but the work done varies.  So the minor_page_faults
> >> > > > change may reflect only the work done.
> >> > > >
> >> > > >> Hm. Is reproducible? Across reboot?
> >> > > >
> >> > > 
> >> > > And FYI, there is no swap setup for test, all root file system including
> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> > > triggered.  But it appears that active file cache reduced after the
> >> > > commit.
> >> > > 
> >> > >     111331 .  1%     -13.3%      96503 .  0%  meminfo.Active
> >> > >      27603 .  1%     -43.9%      15486 .  0%  meminfo.Active(file)
> >> > > 
> >> > > I think this is the expected behavior of the commit?
> >> > 
> >> > Yes, it's expected.
> >> > 
> >> > After the change faularound would produce old pte. It means there's more
> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> > touch them and flip accessed bit.
> >> 
> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> anonymous LRU list on swapless system so I really wonder why active file
> >> LRU is shrunk.
> >
> > Hm. Good point. I don't why we have anything on file lru if there's no
> > filesystems except tmpfs.
> >
> > Ying, how do you get stuff to the tmpfs?
> 
> We put root file system and benchmark into a set of compressed cpio
> archive, then concatenate them into one initrd, and finally kernel use
> that initrd as initramfs.

I see.

Could you share your 4 full vmstat(/proc/vmstat) files?

old:

cat /proc/vmstat > before.old.vmstat
do benchmark
cat /proc/vmstat > after.old.vmstat

new:

cat /proc/vmstat > before.new.vmstat
do benchmark
cat /proc/vmstat > after.new.vmstat

IOW, I want to see stats related to reclaim.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-16  0:13               ` Minchan Kim
@ 2016-06-16 22:27                 ` Huang, Ying
  2016-06-17  5:41                   ` Minchan Kim
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2016-06-16 22:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Kirill A. Shutemov, Kirill A. Shutemov,
	Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko,
	Vinayak Menon, Mel Gorman, Andrew Morton, lkp

[-- Attachment #1: Type: text/plain, Size: 5440 bytes --]

Minchan Kim <minchan@kernel.org> writes:

> On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> 
>> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> >> > > 
>> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >> > > >
>> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >> > > >>> 
>> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >> > > >>> 
>> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >> > > >>> 
>> >> > > >>> in testcase: unixbench
>> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >> > > >>> 
>> >> > > >>> 
>> >> > > >>> Details are as below:
>> >> > > >>> -------------------------------------------------------------------------------------------------->
>> >> > > >>> 
>> >> > > >>> 
>> >> > > >>> =========================================================================================
>> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >> > > >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >> > > >>> 
>> >> > > >>> commit: 
>> >> > > >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >> > > >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >> > > >>> 
>> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
>> >> > > >>> ---------------- -------------------------- 
>> >> > > >>>        fail:runs  %reproduction    fail:runs
>> >> > > >>>            |             |             |    
>> >> > > >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >> > > >>>          %stddev     %change         %stddev
>> >> > > >>>              \          |                \  
>> >> > > >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
>> >> > > >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
>> >> > > >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
>> >> > > >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
>> >> > > >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
>> >> > > >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
>> >> > > >>
>> >> > > >> That's weird.
>> >> > > >>
>> >> > > >> I don't understand why the change would reduce number or minor faults.
>> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >> > > >
>> >> > > > unixbench runs in fixed time mode.  That is, the total time to run
>> >> > > > unixbench is fixed, but the work done varies.  So the minor_page_faults
>> >> > > > change may reflect only the work done.
>> >> > > >
>> >> > > >> Hm. Is reproducible? Across reboot?
>> >> > > >
>> >> > > 
>> >> > > And FYI, there is no swap setup for test, all root file system including
>> >> > > benchmark files are in tmpfs, so no real page reclaim will be
>> >> > > triggered.  But it appears that active file cache reduced after the
>> >> > > commit.
>> >> > > 
>> >> > >     111331 .  1%     -13.3%      96503 .  0%  meminfo.Active
>> >> > >      27603 .  1%     -43.9%      15486 .  0%  meminfo.Active(file)
>> >> > > 
>> >> > > I think this is the expected behavior of the commit?
>> >> > 
>> >> > Yes, it's expected.
>> >> > 
>> >> > After the change faularound would produce old pte. It means there's more
>> >> > chance for these pages to be on inactive lru, unless somebody actually
>> >> > touch them and flip accessed bit.
>> >> 
>> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> >> anonymous LRU list on swapless system so I really wonder why active file
>> >> LRU is shrunk.
>> >
>> > Hm. Good point. I don't why we have anything on file lru if there's no
>> > filesystems except tmpfs.
>> >
>> > Ying, how do you get stuff to the tmpfs?
>> 
>> We put root file system and benchmark into a set of compressed cpio
>> archive, then concatenate them into one initrd, and finally kernel use
>> that initrd as initramfs.
>
> I see.
>
> Could you share your 4 full vmstat(/proc/vmstat) files?
>
> old:
>
> cat /proc/vmstat > before.old.vmstat
> do benchmark
> cat /proc/vmstat > after.old.vmstat
>
> new:
>
> cat /proc/vmstat > before.new.vmstat
> do benchmark
> cat /proc/vmstat > after.new.vmstat
>
> IOW, I want to see stats related to reclaim.

Hi,

The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
bad commit (fbc-proc-vmstat.gz) are attached with the email.

The contents of the file is more than the vmstat before and after
benchmark running, but are sampled every 1 seconds.  Every sample begin
with "time: <time>".  You can check the first and last samples.  The
first /proc/vmstat capturing is started at the same time of the
benchmark, so it is not exactly the vmstat before the benchmark running.


[-- Attachment #2: parent-proc-vmstat.gz --]
[-- Type: application/gzip, Size: 78486 bytes --]

[-- Attachment #3: fbc-proc-vmstat.gz --]
[-- Type: application/gzip, Size: 77915 bytes --]

[-- Attachment #4: Type: text/plain, Size: 27 bytes --]


Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-16 22:27                 ` Huang, Ying
@ 2016-06-17  5:41                   ` Minchan Kim
  2016-06-17 19:26                     ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2016-06-17  5:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Rik van Riel,
	Michal Hocko, LKML, Linus Torvalds, Michal Hocko, Vinayak Menon,
	Mel Gorman, Andrew Morton, lkp

On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> >> 
> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> >> > > 
> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> >> > > >
> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> >> > > >>> 
> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> >> > > >>> 
> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> >> > > >>> 
> >> >> > > >>> in testcase: unixbench
> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> >> > > >>> 
> >> >> > > >>> 
> >> >> > > >>> Details are as below:
> >> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> >> > > >>> 
> >> >> > > >>> 
> >> >> > > >>> =========================================================================================
> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> >> > > >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> >> > > >>> 
> >> >> > > >>> commit: 
> >> >> > > >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> >> > > >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> >> > > >>> 
> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
> >> >> > > >>> ---------------- -------------------------- 
> >> >> > > >>>        fail:runs  %reproduction    fail:runs
> >> >> > > >>>            |             |             |    
> >> >> > > >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> >> > > >>>          %stddev     %change         %stddev
> >> >> > > >>>              \          |                \  
> >> >> > > >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
> >> >> > > >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
> >> >> > > >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
> >> >> > > >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
> >> >> > > >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
> >> >> > > >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
> >> >> > > >>
> >> >> > > >> That's weird.
> >> >> > > >>
> >> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> >> > > >
> >> >> > > > unixbench runs in fixed time mode.  That is, the total time to run
> >> >> > > > unixbench is fixed, but the work done varies.  So the minor_page_faults
> >> >> > > > change may reflect only the work done.
> >> >> > > >
> >> >> > > >> Hm. Is reproducible? Across reboot?
> >> >> > > >
> >> >> > > 
> >> >> > > And FYI, there is no swap setup for test, all root file system including
> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> >> > > triggered.  But it appears that active file cache reduced after the
> >> >> > > commit.
> >> >> > > 
> >> >> > >     111331 .  1%     -13.3%      96503 .  0%  meminfo.Active
> >> >> > >      27603 .  1%     -43.9%      15486 .  0%  meminfo.Active(file)
> >> >> > > 
> >> >> > > I think this is the expected behavior of the commit?
> >> >> > 
> >> >> > Yes, it's expected.
> >> >> > 
> >> >> > After the change faularound would produce old pte. It means there's more
> >> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> >> > touch them and flip accessed bit.
> >> >> 
> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> >> anonymous LRU list on swapless system so I really wonder why active file
> >> >> LRU is shrunk.
> >> >
> >> > Hm. Good point. I don't why we have anything on file lru if there's no
> >> > filesystems except tmpfs.
> >> >
> >> > Ying, how do you get stuff to the tmpfs?
> >> 
> >> We put root file system and benchmark into a set of compressed cpio
> >> archive, then concatenate them into one initrd, and finally kernel use
> >> that initrd as initramfs.
> >
> > I see.
> >
> > Could you share your 4 full vmstat(/proc/vmstat) files?
> >
> > old:
> >
> > cat /proc/vmstat > before.old.vmstat
> > do benchmark
> > cat /proc/vmstat > after.old.vmstat
> >
> > new:
> >
> > cat /proc/vmstat > before.new.vmstat
> > do benchmark
> > cat /proc/vmstat > after.new.vmstat
> >
> > IOW, I want to see stats related to reclaim.
> 
> Hi,
> 
> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
> bad commit (fbc-proc-vmstat.gz) are attached with the email.
> 
> The contents of the file is more than the vmstat before and after
> benchmark running, but are sampled every 1 seconds.  Every sample begin
> with "time: <time>".  You can check the first and last samples.  The
> first /proc/vmstat capturing is started at the same time of the
> benchmark, so it is not exactly the vmstat before the benchmark running.
> 

Thanks for the testing!

nr_active_file was shrunk 48% but the vaule itself is not huge so
I don't think it affects performance a lot.

There was no reclaim activity for testing. :(

pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
because unixbench was time fixed mode and 6% regressed so no
doubt.

No interesting data.

It seems you tested it with THP, maybe always mode?
I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
again? it might you already did.
Is it still 6% regressed with disabling THP?

                 nr_free_pages      -6663      -6461    96.97%
                nr_alloc_batch       2594       4013   154.70%
              nr_inactive_anon        112        112   100.00%
                nr_active_anon       2536       2159    85.13%
              nr_inactive_file       -567       -227    40.04%
                nr_active_file        648        315    48.61%
                nr_unevictable          0          0     0.00%
                      nr_mlock          0          0     0.00%
                 nr_anon_pages       2634       2161    82.04%
                     nr_mapped        511        530   103.72%
                 nr_file_pages        207        215   103.86%
                      nr_dirty         -7         -6    85.71%
                  nr_writeback          0          0     0.00%
           nr_slab_reclaimable        158        328   207.59%
         nr_slab_unreclaimable       2208       2115    95.79%
           nr_page_table_pages        268        247    92.16%
               nr_kernel_stack        143         80    55.94%
                   nr_unstable          1          1   100.00%
                     nr_bounce          0          0     0.00%
               nr_vmscan_write          0          0     0.00%
   nr_vmscan_immediate_reclaim          0          0     0.00%
             nr_writeback_temp          0          0     0.00%
              nr_isolated_anon          0          0     0.00%
              nr_isolated_file          0          0     0.00%
                      nr_shmem        131        131   100.00%
                    nr_dirtied         67         78   116.42%
                    nr_written         74         84   113.51%
              nr_pages_scanned          0          0     0.00%
                      numa_hit  483752446  453696304    93.79%
                     numa_miss          0          0     0.00%
                  numa_foreign          0          0     0.00%
               numa_interleave          0          0     0.00%
                    numa_local  483752445  453696304    93.79%
                    numa_other          1          0     0.00%
            workingset_refault          0          0     0.00%
           workingset_activate          0          0     0.00%
        workingset_nodereclaim          0          0     0.00%
 nr_anon_transparent_hugepages          1          0     0.00%
                   nr_free_cma          0          0     0.00%
            nr_dirty_threshold      -1316      -1274    96.81%
 nr_dirty_background_threshold       -658       -637    96.81%
                        pgpgin          0          0     0.00%
                       pgpgout          0          0     0.00%
                        pswpin          0          0     0.00%
                       pswpout          0          0     0.00%
                   pgalloc_dma          0          0     0.00%
                 pgalloc_dma32   60130977   56323630    93.67%
                pgalloc_normal  457203182  428863437    93.80%
               pgalloc_movable          0          0     0.00%
                        pgfree  517327743  485181251    93.79%
                    pgactivate    2059556    1930950    93.76%
                  pgdeactivate          0          0     0.00%
                       pgfault  572723351  537107146    93.78%
                    pgmajfault          0          0     0.00%
                   pglazyfreed          0          0     0.00%
                  pgrefill_dma          0          0     0.00%
                pgrefill_dma32          0          0     0.00%
               pgrefill_normal          0          0     0.00%
              pgrefill_movable          0          0     0.00%
            pgsteal_kswapd_dma          0          0     0.00%
          pgsteal_kswapd_dma32          0          0     0.00%
         pgsteal_kswapd_normal          0          0     0.00%
        pgsteal_kswapd_movable          0          0     0.00%
            pgsteal_direct_dma          0          0     0.00%
          pgsteal_direct_dma32          0          0     0.00%
         pgsteal_direct_normal          0          0     0.00%
        pgsteal_direct_movable          0          0     0.00%
             pgscan_kswapd_dma          0          0     0.00%
           pgscan_kswapd_dma32          0          0     0.00%
          pgscan_kswapd_normal          0          0     0.00%
         pgscan_kswapd_movable          0          0     0.00%
             pgscan_direct_dma          0          0     0.00%
           pgscan_direct_dma32          0          0     0.00%
          pgscan_direct_normal          0          0     0.00%
         pgscan_direct_movable          0          0     0.00%
        pgscan_direct_throttle          0          0     0.00%
           zone_reclaim_failed          0          0     0.00%
                  pginodesteal          0          0     0.00%
                 slabs_scanned          0          0     0.00%
             kswapd_inodesteal          0          0     0.00%
  kswapd_low_wmark_hit_quickly          0          0     0.00%
 kswapd_high_wmark_hit_quickly          0          0     0.00%
                    pageoutrun          0          0     0.00%
                    allocstall          0          0     0.00%
                     pgrotated          0          0     0.00%
                drop_pagecache          0          0     0.00%
                     drop_slab          0          0     0.00%
              numa_pte_updates          0          0     0.00%
         numa_huge_pte_updates          0          0     0.00%
              numa_hint_faults          0          0     0.00%
        numa_hint_faults_local          0          0     0.00%
           numa_pages_migrated          0          0     0.00%
             pgmigrate_success          0          0     0.00%
                pgmigrate_fail          0          0     0.00%
       compact_migrate_scanned          0          0     0.00%
          compact_free_scanned          0          0     0.00%
              compact_isolated          0          0     0.00%
                 compact_stall          0          0     0.00%
                  compact_fail          0          0     0.00%
               compact_success          0          0     0.00%
           compact_daemon_wake          0          0     0.00%
      htlb_buddy_alloc_success          0          0     0.00%
         htlb_buddy_alloc_fail          0          0     0.00%
        unevictable_pgs_culled          0          0     0.00%
       unevictable_pgs_scanned          0          0     0.00%
       unevictable_pgs_rescued          0          0     0.00%
       unevictable_pgs_mlocked          0          0     0.00%
     unevictable_pgs_munlocked          0          0     0.00%
       unevictable_pgs_cleared          0          0     0.00%
      unevictable_pgs_stranded          0          0     0.00%
               thp_fault_alloc      22731      21604    95.04%
            thp_fault_fallback          0          0     0.00%
            thp_collapse_alloc          1          0     0.00%
     thp_collapse_alloc_failed          0          0     0.00%
                thp_split_page          0          0     0.00%
         thp_split_page_failed          0          0     0.00%
       thp_deferred_split_page      22731      21604    95.04%
                 thp_split_pmd          0          0     0.00%
           thp_zero_page_alloc          0          0     0.00%
    thp_zero_page_alloc_failed          0          0     0.00%
               balloon_inflate          0          0     0.00%
               balloon_deflate          0          0     0.00%
               balloon_migrate          0          0     0.00%

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-17  5:41                   ` Minchan Kim
@ 2016-06-17 19:26                     ` Huang, Ying
  2016-06-20  0:06                       ` Minchan Kim
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2016-06-17 19:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Kirill A. Shutemov, Kirill A. Shutemov,
	Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko,
	Vinayak Menon, Mel Gorman, Andrew Morton, lkp

Minchan Kim <minchan@kernel.org> writes:

> On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
>> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> >> 
>> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
>> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
>> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
>> >> >> > > 
>> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>> >> >> > > >
>> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
>> >> >> > > >>> 
>> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
>> >> >> > > >>> 
>> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
>> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >> >> > > >>> 
>> >> >> > > >>> in testcase: unixbench
>> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
>> >> >> > > >>> 
>> >> >> > > >>> 
>> >> >> > > >>> Details are as below:
>> >> >> > > >>> -------------------------------------------------------------------------------------------------->
>> >> >> > > >>> 
>> >> >> > > >>> 
>> >> >> > > >>> =========================================================================================
>> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >> >> > > >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >> >> > > >>> 
>> >> >> > > >>> commit: 
>> >> >> > > >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >> >> > > >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >> >> > > >>> 
>> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
>> >> >> > > >>> ---------------- -------------------------- 
>> >> >> > > >>>        fail:runs  %reproduction    fail:runs
>> >> >> > > >>>            |             |             |    
>> >> >> > > >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >> >> > > >>>          %stddev     %change         %stddev
>> >> >> > > >>>              \          |                \  
>> >> >> > > >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
>> >> >> > > >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
>> >> >> > > >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
>> >> >> > > >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
>> >> >> > > >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
>> >> >> > > >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
>> >> >> > > >>
>> >> >> > > >> That's weird.
>> >> >> > > >>
>> >> >> > > >> I don't understand why the change would reduce number or minor faults.
>> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
>> >> >> > > >
>> >> >> > > > unixbench runs in fixed time mode.  That is, the total time to run
>> >> >> > > > unixbench is fixed, but the work done varies.  So the minor_page_faults
>> >> >> > > > change may reflect only the work done.
>> >> >> > > >
>> >> >> > > >> Hm. Is reproducible? Across reboot?
>> >> >> > > >
>> >> >> > > 
>> >> >> > > And FYI, there is no swap setup for test, all root file system including
>> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
>> >> >> > > triggered.  But it appears that active file cache reduced after the
>> >> >> > > commit.
>> >> >> > > 
>> >> >> > >     111331 .  1%     -13.3%      96503 .  0%  meminfo.Active
>> >> >> > >      27603 .  1%     -43.9%      15486 .  0%  meminfo.Active(file)
>> >> >> > > 
>> >> >> > > I think this is the expected behavior of the commit?
>> >> >> > 
>> >> >> > Yes, it's expected.
>> >> >> > 
>> >> >> > After the change faularound would produce old pte. It means there's more
>> >> >> > chance for these pages to be on inactive lru, unless somebody actually
>> >> >> > touch them and flip accessed bit.
>> >> >> 
>> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
>> >> >> anonymous LRU list on swapless system so I really wonder why active file
>> >> >> LRU is shrunk.
>> >> >
>> >> > Hm. Good point. I don't why we have anything on file lru if there's no
>> >> > filesystems except tmpfs.
>> >> >
>> >> > Ying, how do you get stuff to the tmpfs?
>> >> 
>> >> We put root file system and benchmark into a set of compressed cpio
>> >> archive, then concatenate them into one initrd, and finally kernel use
>> >> that initrd as initramfs.
>> >
>> > I see.
>> >
>> > Could you share your 4 full vmstat(/proc/vmstat) files?
>> >
>> > old:
>> >
>> > cat /proc/vmstat > before.old.vmstat
>> > do benchmark
>> > cat /proc/vmstat > after.old.vmstat
>> >
>> > new:
>> >
>> > cat /proc/vmstat > before.new.vmstat
>> > do benchmark
>> > cat /proc/vmstat > after.new.vmstat
>> >
>> > IOW, I want to see stats related to reclaim.
>> 
>> Hi,
>> 
>> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
>> bad commit (fbc-proc-vmstat.gz) are attached with the email.
>> 
>> The contents of the file is more than the vmstat before and after
>> benchmark running, but are sampled every 1 seconds.  Every sample begin
>> with "time: <time>".  You can check the first and last samples.  The
>> first /proc/vmstat capturing is started at the same time of the
>> benchmark, so it is not exactly the vmstat before the benchmark running.
>> 
>
> Thanks for the testing!
>
> nr_active_file was shrunk 48% but the vaule itself is not huge so
> I don't think it affects performance a lot.
>
> There was no reclaim activity for testing. :(
>
> pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
> because unixbench was time fixed mode and 6% regressed so no
> doubt.
>
> No interesting data.
>
> It seems you tested it with THP, maybe always mode?

Yes.  With following in kconfig.

CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y

> I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
> again? it might you already did.
> Is it still 6% regressed with disabling THP?

Yes.  I disabled THP via

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

The regression is the same as before.

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled:
  gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never

commit: 
  4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
  5c0a85fad949212b3e059692deecdeed74ae7ec7

4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     14332 ±  0%      -6.2%      13438 ±  0%  unixbench.score
   6662206 ±  0%      -6.2%    6252260 ±  0%  unixbench.time.involuntary_context_switches
 5.734e+08 ±  0%      -6.2%  5.376e+08 ±  0%  unixbench.time.minor_page_faults
      2527 ±  0%      -3.2%       2446 ±  0%  unixbench.time.system_time
      1291 ±  0%      +5.4%       1361 ±  0%  unixbench.time.user_time
  19875455 ±  0%      -6.3%   18622488 ±  0%  unixbench.time.voluntary_context_switches
   6570355 ±  0%     -11.9%    5787517 ±  0%  cpuidle.C1-HSW.usage
     17257 ± 34%     -59.1%       7055 ±  7%  latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath
      5976 ±  0%     -43.0%       3404 ±  0%  proc-vmstat.nr_active_file
     45729 ±  1%     -22.5%      35439 ±  1%  meminfo.Active
     23905 ±  0%     -43.0%      13619 ±  0%  meminfo.Active(file)
      8465 ±  3%     -29.8%       5940 ±  3%  slabinfo.pid.active_objs
      8476 ±  3%     -29.9%       5940 ±  3%  slabinfo.pid.num_objs
      3.46 ±  0%     +12.5%       3.89 ±  0%  turbostat.CPU%c3
     67.09 ±  0%      -2.1%      65.65 ±  0%  turbostat.PkgWatt
     96090 ±  0%      -5.8%      90479 ±  0%  vmstat.system.cs
      9083 ±  0%      -2.7%       8833 ±  0%  vmstat.system.in
    467.35 ± 78%    +416.7%       2414 ± 45%  sched_debug.cfs_rq:/.MIN_vruntime.avg
      7477 ± 78%    +327.7%      31981 ± 39%  sched_debug.cfs_rq:/.MIN_vruntime.max
      1810 ± 78%    +360.1%       8327 ± 40%  sched_debug.cfs_rq:/.MIN_vruntime.stddev
    467.35 ± 78%    +416.7%       2414 ± 45%  sched_debug.cfs_rq:/.max_vruntime.avg
      7477 ± 78%    +327.7%      31981 ± 39%  sched_debug.cfs_rq:/.max_vruntime.max
      1810 ± 78%    +360.1%       8327 ± 40%  sched_debug.cfs_rq:/.max_vruntime.stddev
    -10724 ± -7%     -12.0%      -9433 ± -3%  sched_debug.cfs_rq:/.spread0.avg
    -17721 ± -4%      -9.8%     -15978 ± -2%  sched_debug.cfs_rq:/.spread0.min
     90355 ±  9%     +14.1%     103099 ±  5%  sched_debug.cpu.avg_idle.min
      0.12 ± 35%    +325.0%       0.52 ± 46%  sched_debug.cpu.cpu_load[0].min
     21913 ±  2%     +29.1%      28288 ± 14%  sched_debug.cpu.curr->pid.avg
     49953 ±  3%     +30.2%      65038 ±  0%  sched_debug.cpu.curr->pid.max
     23062 ±  2%     +30.1%      29996 ±  4%  sched_debug.cpu.curr->pid.stddev
    274.39 ±  5%     -10.2%     246.27 ±  3%  sched_debug.cpu.nr_uninterruptible.max
    242.73 ±  4%     -13.5%     209.90 ±  2%  sched_debug.cpu.nr_uninterruptible.stddev

Best Regards,
Huang, Ying

>                  nr_free_pages      -6663      -6461    96.97%
>                 nr_alloc_batch       2594       4013   154.70%
>               nr_inactive_anon        112        112   100.00%
>                 nr_active_anon       2536       2159    85.13%
>               nr_inactive_file       -567       -227    40.04%
>                 nr_active_file        648        315    48.61%
>                 nr_unevictable          0          0     0.00%
>                       nr_mlock          0          0     0.00%
>                  nr_anon_pages       2634       2161    82.04%
>                      nr_mapped        511        530   103.72%
>                  nr_file_pages        207        215   103.86%
>                       nr_dirty         -7         -6    85.71%
>                   nr_writeback          0          0     0.00%
>            nr_slab_reclaimable        158        328   207.59%
>          nr_slab_unreclaimable       2208       2115    95.79%
>            nr_page_table_pages        268        247    92.16%
>                nr_kernel_stack        143         80    55.94%
>                    nr_unstable          1          1   100.00%
>                      nr_bounce          0          0     0.00%
>                nr_vmscan_write          0          0     0.00%
>    nr_vmscan_immediate_reclaim          0          0     0.00%
>              nr_writeback_temp          0          0     0.00%
>               nr_isolated_anon          0          0     0.00%
>               nr_isolated_file          0          0     0.00%
>                       nr_shmem        131        131   100.00%
>                     nr_dirtied         67         78   116.42%
>                     nr_written         74         84   113.51%
>               nr_pages_scanned          0          0     0.00%
>                       numa_hit  483752446  453696304    93.79%
>                      numa_miss          0          0     0.00%
>                   numa_foreign          0          0     0.00%
>                numa_interleave          0          0     0.00%
>                     numa_local  483752445  453696304    93.79%
>                     numa_other          1          0     0.00%
>             workingset_refault          0          0     0.00%
>            workingset_activate          0          0     0.00%
>         workingset_nodereclaim          0          0     0.00%
>  nr_anon_transparent_hugepages          1          0     0.00%
>                    nr_free_cma          0          0     0.00%
>             nr_dirty_threshold      -1316      -1274    96.81%
>  nr_dirty_background_threshold       -658       -637    96.81%
>                         pgpgin          0          0     0.00%
>                        pgpgout          0          0     0.00%
>                         pswpin          0          0     0.00%
>                        pswpout          0          0     0.00%
>                    pgalloc_dma          0          0     0.00%
>                  pgalloc_dma32   60130977   56323630    93.67%
>                 pgalloc_normal  457203182  428863437    93.80%
>                pgalloc_movable          0          0     0.00%
>                         pgfree  517327743  485181251    93.79%
>                     pgactivate    2059556    1930950    93.76%
>                   pgdeactivate          0          0     0.00%
>                        pgfault  572723351  537107146    93.78%
>                     pgmajfault          0          0     0.00%
>                    pglazyfreed          0          0     0.00%
>                   pgrefill_dma          0          0     0.00%
>                 pgrefill_dma32          0          0     0.00%
>                pgrefill_normal          0          0     0.00%
>               pgrefill_movable          0          0     0.00%
>             pgsteal_kswapd_dma          0          0     0.00%
>           pgsteal_kswapd_dma32          0          0     0.00%
>          pgsteal_kswapd_normal          0          0     0.00%
>         pgsteal_kswapd_movable          0          0     0.00%
>             pgsteal_direct_dma          0          0     0.00%
>           pgsteal_direct_dma32          0          0     0.00%
>          pgsteal_direct_normal          0          0     0.00%
>         pgsteal_direct_movable          0          0     0.00%
>              pgscan_kswapd_dma          0          0     0.00%
>            pgscan_kswapd_dma32          0          0     0.00%
>           pgscan_kswapd_normal          0          0     0.00%
>          pgscan_kswapd_movable          0          0     0.00%
>              pgscan_direct_dma          0          0     0.00%
>            pgscan_direct_dma32          0          0     0.00%
>           pgscan_direct_normal          0          0     0.00%
>          pgscan_direct_movable          0          0     0.00%
>         pgscan_direct_throttle          0          0     0.00%
>            zone_reclaim_failed          0          0     0.00%
>                   pginodesteal          0          0     0.00%
>                  slabs_scanned          0          0     0.00%
>              kswapd_inodesteal          0          0     0.00%
>   kswapd_low_wmark_hit_quickly          0          0     0.00%
>  kswapd_high_wmark_hit_quickly          0          0     0.00%
>                     pageoutrun          0          0     0.00%
>                     allocstall          0          0     0.00%
>                      pgrotated          0          0     0.00%
>                 drop_pagecache          0          0     0.00%
>                      drop_slab          0          0     0.00%
>               numa_pte_updates          0          0     0.00%
>          numa_huge_pte_updates          0          0     0.00%
>               numa_hint_faults          0          0     0.00%
>         numa_hint_faults_local          0          0     0.00%
>            numa_pages_migrated          0          0     0.00%
>              pgmigrate_success          0          0     0.00%
>                 pgmigrate_fail          0          0     0.00%
>        compact_migrate_scanned          0          0     0.00%
>           compact_free_scanned          0          0     0.00%
>               compact_isolated          0          0     0.00%
>                  compact_stall          0          0     0.00%
>                   compact_fail          0          0     0.00%
>                compact_success          0          0     0.00%
>            compact_daemon_wake          0          0     0.00%
>       htlb_buddy_alloc_success          0          0     0.00%
>          htlb_buddy_alloc_fail          0          0     0.00%
>         unevictable_pgs_culled          0          0     0.00%
>        unevictable_pgs_scanned          0          0     0.00%
>        unevictable_pgs_rescued          0          0     0.00%
>        unevictable_pgs_mlocked          0          0     0.00%
>      unevictable_pgs_munlocked          0          0     0.00%
>        unevictable_pgs_cleared          0          0     0.00%
>       unevictable_pgs_stranded          0          0     0.00%
>                thp_fault_alloc      22731      21604    95.04%
>             thp_fault_fallback          0          0     0.00%
>             thp_collapse_alloc          1          0     0.00%
>      thp_collapse_alloc_failed          0          0     0.00%
>                 thp_split_page          0          0     0.00%
>          thp_split_page_failed          0          0     0.00%
>        thp_deferred_split_page      22731      21604    95.04%
>                  thp_split_pmd          0          0     0.00%
>            thp_zero_page_alloc          0          0     0.00%
>     thp_zero_page_alloc_failed          0          0     0.00%
>                balloon_inflate          0          0     0.00%
>                balloon_deflate          0          0     0.00%
>                balloon_migrate          0          0     0.00%

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LKP] [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression
  2016-06-17 19:26                     ` Huang, Ying
@ 2016-06-20  0:06                       ` Minchan Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2016-06-20  0:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Kirill A. Shutemov, Kirill A. Shutemov,
	Rik van Riel, Michal Hocko, LKML, Linus Torvalds, Michal Hocko,
	Vinayak Menon, Mel Gorman, Andrew Morton, lkp, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Jun 17, 2016 at 12:26:51PM -0700, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
> >> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> >> >> 
> >> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
> >> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov wrote:
> >> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying wrote:
> >> >> >> > > "Huang, Ying" <ying.huang@intel.com> writes:
> >> >> >> > > 
> >> >> >> > > > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >> >> >> > > >
> >> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800, kernel test robot wrote:
> >> >> >> > > >>> 
> >> >> >> > > >>> FYI, we noticed a -6.3% regression of unixbench.score due to commit:
> >> >> >> > > >>> 
> >> >> >> > > >>> commit 5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old ptes")
> >> >> >> > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> >> >> > > >>> 
> >> >> >> > > >>> in testcase: unixbench
> >> >> >> > > >>> on test machine: lituya: 16 threads Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
> >> >> >> > > >>> with following parameters: cpufreq_governor=performance/nr_task=1/test=shell8
> >> >> >> > > >>> 
> >> >> >> > > >>> 
> >> >> >> > > >>> Details are as below:
> >> >> >> > > >>> -------------------------------------------------------------------------------------------------->
> >> >> >> > > >>> 
> >> >> >> > > >>> 
> >> >> >> > > >>> =========================================================================================
> >> >> >> > > >>> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
> >> >> >> > > >>>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
> >> >> >> > > >>> 
> >> >> >> > > >>> commit: 
> >> >> >> > > >>>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
> >> >> >> > > >>>   5c0a85fad949212b3e059692deecdeed74ae7ec7
> >> >> >> > > >>> 
> >> >> >> > > >>> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
> >> >> >> > > >>> ---------------- -------------------------- 
> >> >> >> > > >>>        fail:runs  %reproduction    fail:runs
> >> >> >> > > >>>            |             |             |    
> >> >> >> > > >>>           3:4          -75%            :4     kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
> >> >> >> > > >>>          %stddev     %change         %stddev
> >> >> >> > > >>>              \          |                \  
> >> >> >> > > >>>      14321 .  0%      -6.3%      13425 .  0%  unixbench.score
> >> >> >> > > >>>    1996897 .  0%      -6.1%    1874635 .  0%  unixbench.time.involuntary_context_switches
> >> >> >> > > >>>  1.721e+08 .  0%      -6.2%  1.613e+08 .  0%  unixbench.time.minor_page_faults
> >> >> >> > > >>>     758.65 .  0%      -3.0%     735.86 .  0%  unixbench.time.system_time
> >> >> >> > > >>>     387.66 .  0%      +5.4%     408.49 .  0%  unixbench.time.user_time
> >> >> >> > > >>>    5950278 .  0%      -6.2%    5583456 .  0%  unixbench.time.voluntary_context_switches
> >> >> >> > > >>
> >> >> >> > > >> That's weird.
> >> >> >> > > >>
> >> >> >> > > >> I don't understand why the change would reduce number or minor faults.
> >> >> >> > > >> It should stay the same on x86-64. Rise of user_time is puzzling too.
> >> >> >> > > >
> >> >> >> > > > unixbench runs in fixed time mode.  That is, the total time to run
> >> >> >> > > > unixbench is fixed, but the work done varies.  So the minor_page_faults
> >> >> >> > > > change may reflect only the work done.
> >> >> >> > > >
> >> >> >> > > >> Hm. Is reproducible? Across reboot?
> >> >> >> > > >
> >> >> >> > > 
> >> >> >> > > And FYI, there is no swap setup for test, all root file system including
> >> >> >> > > benchmark files are in tmpfs, so no real page reclaim will be
> >> >> >> > > triggered.  But it appears that active file cache reduced after the
> >> >> >> > > commit.
> >> >> >> > > 
> >> >> >> > >     111331 .  1%     -13.3%      96503 .  0%  meminfo.Active
> >> >> >> > >      27603 .  1%     -43.9%      15486 .  0%  meminfo.Active(file)
> >> >> >> > > 
> >> >> >> > > I think this is the expected behavior of the commit?
> >> >> >> > 
> >> >> >> > Yes, it's expected.
> >> >> >> > 
> >> >> >> > After the change faularound would produce old pte. It means there's more
> >> >> >> > chance for these pages to be on inactive lru, unless somebody actually
> >> >> >> > touch them and flip accessed bit.
> >> >> >> 
> >> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM shouldn't scan
> >> >> >> anonymous LRU list on swapless system so I really wonder why active file
> >> >> >> LRU is shrunk.
> >> >> >
> >> >> > Hm. Good point. I don't why we have anything on file lru if there's no
> >> >> > filesystems except tmpfs.
> >> >> >
> >> >> > Ying, how do you get stuff to the tmpfs?
> >> >> 
> >> >> We put root file system and benchmark into a set of compressed cpio
> >> >> archive, then concatenate them into one initrd, and finally kernel use
> >> >> that initrd as initramfs.
> >> >
> >> > I see.
> >> >
> >> > Could you share your 4 full vmstat(/proc/vmstat) files?
> >> >
> >> > old:
> >> >
> >> > cat /proc/vmstat > before.old.vmstat
> >> > do benchmark
> >> > cat /proc/vmstat > after.old.vmstat
> >> >
> >> > new:
> >> >
> >> > cat /proc/vmstat > before.new.vmstat
> >> > do benchmark
> >> > cat /proc/vmstat > after.new.vmstat
> >> >
> >> > IOW, I want to see stats related to reclaim.
> >> 
> >> Hi,
> >> 
> >> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
> >> bad commit (fbc-proc-vmstat.gz) are attached with the email.
> >> 
> >> The contents of the file is more than the vmstat before and after
> >> benchmark running, but are sampled every 1 seconds.  Every sample begin
> >> with "time: <time>".  You can check the first and last samples.  The
> >> first /proc/vmstat capturing is started at the same time of the
> >> benchmark, so it is not exactly the vmstat before the benchmark running.
> >> 
> >
> > Thanks for the testing!
> >
> > nr_active_file was shrunk 48% but the vaule itself is not huge so
> > I don't think it affects performance a lot.
> >
> > There was no reclaim activity for testing. :(
> >
> > pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
> > because unixbench was time fixed mode and 6% regressed so no
> > doubt.
> >
> > No interesting data.
> >
> > It seems you tested it with THP, maybe always mode?
> 
> Yes.  With following in kconfig.
> 
> CONFIG_TRANSPARENT_HUGEPAGE=y
> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
> 
> > I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
> > again? it might you already did.
> > Is it still 6% regressed with disabling THP?
> 
> Yes.  I disabled THP via
> 
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
> 
> The regression is the same as before.

Still, 6% user_time regression and there is no difference with previous
experiment which enabled THP so I agree the regression is caused by just
access bit setting from CPU side, which is rather surprising to me.

I don't know how unixbench shell scripts testing touch the memory
but it should be per-page overhead and 6% regression for that is too heavy.
Anyway, at least, it would be better to notice it to x86 maintainer.

Thanks for the test!

> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled:
>   gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never
> 
> commit: 
>   4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>   5c0a85fad949212b3e059692deecdeed74ae7ec7
> 
> 4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de 
> ---------------- -------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>      14332 ±  0%      -6.2%      13438 ±  0%  unixbench.score
>    6662206 ±  0%      -6.2%    6252260 ±  0%  unixbench.time.involuntary_context_switches
>  5.734e+08 ±  0%      -6.2%  5.376e+08 ±  0%  unixbench.time.minor_page_faults
>       2527 ±  0%      -3.2%       2446 ±  0%  unixbench.time.system_time
>       1291 ±  0%      +5.4%       1361 ±  0%  unixbench.time.user_time
>   19875455 ±  0%      -6.3%   18622488 ±  0%  unixbench.time.voluntary_context_switches
>    6570355 ±  0%     -11.9%    5787517 ±  0%  cpuidle.C1-HSW.usage
>      17257 ± 34%     -59.1%       7055 ±  7%  latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath
>       5976 ±  0%     -43.0%       3404 ±  0%  proc-vmstat.nr_active_file
>      45729 ±  1%     -22.5%      35439 ±  1%  meminfo.Active
>      23905 ±  0%     -43.0%      13619 ±  0%  meminfo.Active(file)
>       8465 ±  3%     -29.8%       5940 ±  3%  slabinfo.pid.active_objs
>       8476 ±  3%     -29.9%       5940 ±  3%  slabinfo.pid.num_objs
>       3.46 ±  0%     +12.5%       3.89 ±  0%  turbostat.CPU%c3
>      67.09 ±  0%      -2.1%      65.65 ±  0%  turbostat.PkgWatt
>      96090 ±  0%      -5.8%      90479 ±  0%  vmstat.system.cs
>       9083 ±  0%      -2.7%       8833 ±  0%  vmstat.system.in
>     467.35 ± 78%    +416.7%       2414 ± 45%  sched_debug.cfs_rq:/.MIN_vruntime.avg
>       7477 ± 78%    +327.7%      31981 ± 39%  sched_debug.cfs_rq:/.MIN_vruntime.max
>       1810 ± 78%    +360.1%       8327 ± 40%  sched_debug.cfs_rq:/.MIN_vruntime.stddev
>     467.35 ± 78%    +416.7%       2414 ± 45%  sched_debug.cfs_rq:/.max_vruntime.avg
>       7477 ± 78%    +327.7%      31981 ± 39%  sched_debug.cfs_rq:/.max_vruntime.max
>       1810 ± 78%    +360.1%       8327 ± 40%  sched_debug.cfs_rq:/.max_vruntime.stddev
>     -10724 ± -7%     -12.0%      -9433 ± -3%  sched_debug.cfs_rq:/.spread0.avg
>     -17721 ± -4%      -9.8%     -15978 ± -2%  sched_debug.cfs_rq:/.spread0.min
>      90355 ±  9%     +14.1%     103099 ±  5%  sched_debug.cpu.avg_idle.min
>       0.12 ± 35%    +325.0%       0.52 ± 46%  sched_debug.cpu.cpu_load[0].min
>      21913 ±  2%     +29.1%      28288 ± 14%  sched_debug.cpu.curr->pid.avg
>      49953 ±  3%     +30.2%      65038 ±  0%  sched_debug.cpu.curr->pid.max
>      23062 ±  2%     +30.1%      29996 ±  4%  sched_debug.cpu.curr->pid.stddev
>     274.39 ±  5%     -10.2%     246.27 ±  3%  sched_debug.cpu.nr_uninterruptible.max
>     242.73 ±  4%     -13.5%     209.90 ±  2%  sched_debug.cpu.nr_uninterruptible.stddev
> 
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-06-20  0:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-06  2:27 [lkp] [mm] 5c0a85fad9: unixbench.score -6.3% regression kernel test robot
2016-06-06  9:51 ` Kirill A. Shutemov
2016-06-08  7:21   ` [LKP] " Huang, Ying
2016-06-08  8:41     ` Huang, Ying
2016-06-08  8:58       ` Kirill A. Shutemov
2016-06-12  0:49         ` Huang, Ying
2016-06-12  1:02           ` Linus Torvalds
2016-06-13  9:02             ` Huang, Ying
2016-06-14 13:38               ` Minchan Kim
2016-06-15 23:42                 ` Huang, Ying
2016-06-13 12:52             ` Kirill A. Shutemov
2016-06-14  6:11               ` Linus Torvalds
2016-06-14  8:26                 ` Kirill A. Shutemov
2016-06-14 16:07                   ` Rik van Riel
2016-06-14 14:03                 ` Christian Borntraeger
2016-06-14  8:57         ` Minchan Kim
2016-06-14 14:34           ` Kirill A. Shutemov
2016-06-15 23:52             ` Huang, Ying
2016-06-16  0:13               ` Minchan Kim
2016-06-16 22:27                 ` Huang, Ying
2016-06-17  5:41                   ` Minchan Kim
2016-06-17 19:26                     ` Huang, Ying
2016-06-20  0:06                       ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).