All of lore.kernel.org
 help / color / mirror / Atom feed
* [linus:master] [file]  0ede61d858:  will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-20  7:11 ` kernel test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2023-11-20  7:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: oe-lkp, lkp, linux-kernel, Jann Horn, Linus Torvalds, linux-doc,
	linuxppc-dev, intel-gfx, linux-fsdevel, gfs2, bpf, ying.huang,
	feng.tang, fengwei.yin, oliver.sang



Hello,

kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:


commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: will-it-scale
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:

	nr_task: 16
	mode: thread
	test: poll2
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231120/202311201406.2022ca3f-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit: 
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.01 ±  9%  +58125.6%       4.17 ±175%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     89056            -2.0%      87309        proc-vmstat.nr_slab_unreclaimable
     97958 ±  7%      -9.7%      88449 ±  4%  sched_debug.cpu.avg_idle.stddev
      0.00 ± 12%     +24.2%       0.00 ± 17%  sched_debug.cpu.next_balance.stddev
   6391048            -2.9%    6208584        will-it-scale.16.threads
    399440            -2.9%     388036        will-it-scale.per_thread_ops
   6391048            -2.9%    6208584        will-it-scale.workload
     19.99 ±  4%      -2.2       17.74        perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      1.27 ±  5%      +0.8        2.11 ±  3%  perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
     32.69 ±  4%      +5.0       37.70        perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      0.00           +27.9       27.85        perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
     20.00 ±  4%      -2.3       17.75        perf-profile.children.cycles-pp.fput
      0.24 ± 10%      -0.1        0.18 ±  2%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      1.48 ±  5%      +0.5        1.98 ±  3%  perf-profile.children.cycles-pp.__fdget
     31.85 ±  4%      +6.0       37.86        perf-profile.children.cycles-pp.__fget_light
      0.00           +27.7       27.67        perf-profile.children.cycles-pp.__get_file_rcu
     30.90 ±  4%     -20.6       10.35 ±  2%  perf-profile.self.cycles-pp.__fget_light
     19.94 ±  4%      -2.4       17.53        perf-profile.self.cycles-pp.fput
      9.81 ±  4%      -2.4        7.42 ±  2%  perf-profile.self.cycles-pp.do_poll
      0.23 ± 11%      -0.1        0.17 ±  4%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%      +8.5%  2.329e+10 ±  2%  perf-stat.i.branch-instructions
      0.22 ± 14%      -0.0        0.19 ± 14%  perf-stat.i.branch-miss-rate%
 1.404e+10 ±  2%      +8.7%  1.526e+10 ±  2%  perf-stat.i.dTLB-stores
     70.87            -2.3       68.59        perf-stat.i.iTLB-load-miss-rate%
   5267608            -5.5%    4979133 ±  2%  perf-stat.i.iTLB-load-misses
   2102507            +5.4%    2215725        perf-stat.i.iTLB-loads
     18791 ±  3%     +10.5%      20757 ±  2%  perf-stat.i.instructions-per-iTLB-miss
    266.67 ±  2%      +6.8%     284.75 ±  2%  perf-stat.i.metric.M/sec
      0.01 ± 10%     -10.5%       0.01 ±  5%  perf-stat.overall.MPKI
      0.19            -0.0        0.17        perf-stat.overall.branch-miss-rate%
      0.65            -3.1%       0.63        perf-stat.overall.cpi
      0.00 ±  4%      -0.0        0.00 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     71.48            -2.3       69.21        perf-stat.overall.iTLB-load-miss-rate%
     18757           +10.0%      20629        perf-stat.overall.instructions-per-iTLB-miss
      1.54            +3.2%       1.59        perf-stat.overall.ipc
   4795147            +6.4%    5100406        perf-stat.overall.path-length
  2.14e+10 ±  2%      +8.5%  2.322e+10 ±  2%  perf-stat.ps.branch-instructions
   1.4e+10 ±  2%      +8.7%  1.522e+10 ±  2%  perf-stat.ps.dTLB-stores
   5253923            -5.5%    4966218 ±  2%  perf-stat.ps.iTLB-load-misses
   2095770            +5.4%    2208605        perf-stat.ps.iTLB-loads
 3.065e+13            +3.3%  3.167e+13        perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-20  7:11 ` kernel test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2023-11-20  7:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: feng.tang, Jann Horn, linuxppc-dev, intel-gfx, linux-doc,
	linux-kernel, fengwei.yin, gfs2, linux-fsdevel, oliver.sang,
	ying.huang, oe-lkp, bpf, Linus Torvalds



Hello,

kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:


commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: will-it-scale
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:

	nr_task: 16
	mode: thread
	test: poll2
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231120/202311201406.2022ca3f-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit: 
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.01 ±  9%  +58125.6%       4.17 ±175%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     89056            -2.0%      87309        proc-vmstat.nr_slab_unreclaimable
     97958 ±  7%      -9.7%      88449 ±  4%  sched_debug.cpu.avg_idle.stddev
      0.00 ± 12%     +24.2%       0.00 ± 17%  sched_debug.cpu.next_balance.stddev
   6391048            -2.9%    6208584        will-it-scale.16.threads
    399440            -2.9%     388036        will-it-scale.per_thread_ops
   6391048            -2.9%    6208584        will-it-scale.workload
     19.99 ±  4%      -2.2       17.74        perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      1.27 ±  5%      +0.8        2.11 ±  3%  perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
     32.69 ±  4%      +5.0       37.70        perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      0.00           +27.9       27.85        perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
     20.00 ±  4%      -2.3       17.75        perf-profile.children.cycles-pp.fput
      0.24 ± 10%      -0.1        0.18 ±  2%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      1.48 ±  5%      +0.5        1.98 ±  3%  perf-profile.children.cycles-pp.__fdget
     31.85 ±  4%      +6.0       37.86        perf-profile.children.cycles-pp.__fget_light
      0.00           +27.7       27.67        perf-profile.children.cycles-pp.__get_file_rcu
     30.90 ±  4%     -20.6       10.35 ±  2%  perf-profile.self.cycles-pp.__fget_light
     19.94 ±  4%      -2.4       17.53        perf-profile.self.cycles-pp.fput
      9.81 ±  4%      -2.4        7.42 ±  2%  perf-profile.self.cycles-pp.do_poll
      0.23 ± 11%      -0.1        0.17 ±  4%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%      +8.5%  2.329e+10 ±  2%  perf-stat.i.branch-instructions
      0.22 ± 14%      -0.0        0.19 ± 14%  perf-stat.i.branch-miss-rate%
 1.404e+10 ±  2%      +8.7%  1.526e+10 ±  2%  perf-stat.i.dTLB-stores
     70.87            -2.3       68.59        perf-stat.i.iTLB-load-miss-rate%
   5267608            -5.5%    4979133 ±  2%  perf-stat.i.iTLB-load-misses
   2102507            +5.4%    2215725        perf-stat.i.iTLB-loads
     18791 ±  3%     +10.5%      20757 ±  2%  perf-stat.i.instructions-per-iTLB-miss
    266.67 ±  2%      +6.8%     284.75 ±  2%  perf-stat.i.metric.M/sec
      0.01 ± 10%     -10.5%       0.01 ±  5%  perf-stat.overall.MPKI
      0.19            -0.0        0.17        perf-stat.overall.branch-miss-rate%
      0.65            -3.1%       0.63        perf-stat.overall.cpi
      0.00 ±  4%      -0.0        0.00 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     71.48            -2.3       69.21        perf-stat.overall.iTLB-load-miss-rate%
     18757           +10.0%      20629        perf-stat.overall.instructions-per-iTLB-miss
      1.54            +3.2%       1.59        perf-stat.overall.ipc
   4795147            +6.4%    5100406        perf-stat.overall.path-length
  2.14e+10 ±  2%      +8.5%  2.322e+10 ±  2%  perf-stat.ps.branch-instructions
   1.4e+10 ±  2%      +8.7%  1.522e+10 ±  2%  perf-stat.ps.dTLB-stores
   5253923            -5.5%    4966218 ±  2%  perf-stat.ps.iTLB-load-misses
   2095770            +5.4%    2208605        perf-stat.ps.iTLB-loads
 3.065e+13            +3.3%  3.167e+13        perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [linus:master] [file]  0ede61d858:  will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-20  7:11 ` kernel test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2023-11-20  7:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: feng.tang, lkp, Jann Horn, linuxppc-dev, intel-gfx, linux-doc,
	linux-kernel, fengwei.yin, gfs2, linux-fsdevel, oliver.sang,
	ying.huang, oe-lkp, bpf, Linus Torvalds



Hello,

kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:


commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: will-it-scale
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:

	nr_task: 16
	mode: thread
	test: poll2
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231120/202311201406.2022ca3f-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit: 
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.01 ±  9%  +58125.6%       4.17 ±175%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     89056            -2.0%      87309        proc-vmstat.nr_slab_unreclaimable
     97958 ±  7%      -9.7%      88449 ±  4%  sched_debug.cpu.avg_idle.stddev
      0.00 ± 12%     +24.2%       0.00 ± 17%  sched_debug.cpu.next_balance.stddev
   6391048            -2.9%    6208584        will-it-scale.16.threads
    399440            -2.9%     388036        will-it-scale.per_thread_ops
   6391048            -2.9%    6208584        will-it-scale.workload
     19.99 ±  4%      -2.2       17.74        perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      1.27 ±  5%      +0.8        2.11 ±  3%  perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
     32.69 ±  4%      +5.0       37.70        perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      0.00           +27.9       27.85        perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
     20.00 ±  4%      -2.3       17.75        perf-profile.children.cycles-pp.fput
      0.24 ± 10%      -0.1        0.18 ±  2%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      1.48 ±  5%      +0.5        1.98 ±  3%  perf-profile.children.cycles-pp.__fdget
     31.85 ±  4%      +6.0       37.86        perf-profile.children.cycles-pp.__fget_light
      0.00           +27.7       27.67        perf-profile.children.cycles-pp.__get_file_rcu
     30.90 ±  4%     -20.6       10.35 ±  2%  perf-profile.self.cycles-pp.__fget_light
     19.94 ±  4%      -2.4       17.53        perf-profile.self.cycles-pp.fput
      9.81 ±  4%      -2.4        7.42 ±  2%  perf-profile.self.cycles-pp.do_poll
      0.23 ± 11%      -0.1        0.17 ±  4%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%      +8.5%  2.329e+10 ±  2%  perf-stat.i.branch-instructions
      0.22 ± 14%      -0.0        0.19 ± 14%  perf-stat.i.branch-miss-rate%
 1.404e+10 ±  2%      +8.7%  1.526e+10 ±  2%  perf-stat.i.dTLB-stores
     70.87            -2.3       68.59        perf-stat.i.iTLB-load-miss-rate%
   5267608            -5.5%    4979133 ±  2%  perf-stat.i.iTLB-load-misses
   2102507            +5.4%    2215725        perf-stat.i.iTLB-loads
     18791 ±  3%     +10.5%      20757 ±  2%  perf-stat.i.instructions-per-iTLB-miss
    266.67 ±  2%      +6.8%     284.75 ±  2%  perf-stat.i.metric.M/sec
      0.01 ± 10%     -10.5%       0.01 ±  5%  perf-stat.overall.MPKI
      0.19            -0.0        0.17        perf-stat.overall.branch-miss-rate%
      0.65            -3.1%       0.63        perf-stat.overall.cpi
      0.00 ±  4%      -0.0        0.00 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     71.48            -2.3       69.21        perf-stat.overall.iTLB-load-miss-rate%
     18757           +10.0%      20629        perf-stat.overall.instructions-per-iTLB-miss
      1.54            +3.2%       1.59        perf-stat.overall.ipc
   4795147            +6.4%    5100406        perf-stat.overall.path-length
  2.14e+10 ±  2%      +8.5%  2.322e+10 ±  2%  perf-stat.ps.branch-instructions
   1.4e+10 ±  2%      +8.7%  1.522e+10 ±  2%  perf-stat.ps.dTLB-stores
   5253923            -5.5%    4966218 ±  2%  perf-stat.ps.iTLB-load-misses
   2095770            +5.4%    2208605        perf-stat.ps.iTLB-loads
 3.065e+13            +3.3%  3.167e+13        perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file]  0ede61d858:  will-it-scale.per_thread_ops -2.9% regression
  2023-11-20  7:11 ` [Intel-gfx] " kernel test robot
  (?)
@ 2023-11-20  7:41   ` Mateusz Guzik
  -1 siblings, 0 replies; 30+ messages in thread
From: Mateusz Guzik @ 2023-11-20  7:41 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, oe-lkp, lkp, linux-kernel, Jann Horn,
	Linus Torvalds, linux-doc, linuxppc-dev, intel-gfx,
	linux-fsdevel, gfs2, bpf, ying.huang, feng.tang, fengwei.yin

On Mon, Nov 20, 2023 at 03:11:31PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
> 
> 
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
[snip]
>      30.90 ±  4%     -20.6       10.35 ±  2%  perf-profile.self.cycles-pp.__fget_light
>       0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu
[snip]

So __fget_light now got a func call.

I don't know if this is worth patching (and benchmarking after), but I
if sorting this out is of interest, triviality below is probably the
easiest way out:

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..d8d3e18800c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -856,14 +856,14 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__get_file_rcu(struct file __rcu **f)
+static __always_inline struct file *__get_file_rcu(struct file __rcu **f)
 {
 	struct file __rcu *file;
 	struct file __rcu *file_reloaded;
 	struct file __rcu *file_reloaded_cmp;
 
 	file = rcu_dereference_raw(*f);
-	if (!file)
+	if (unlikely(!file))
 		return NULL;
 
 	if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
@@ -891,7 +891,7 @@ static struct file *__get_file_rcu(struct file __rcu **f)
 	 * If the pointers don't match the file has been reallocated by
 	 * SLAB_TYPESAFE_BY_RCU.
 	 */
-	if (file == file_reloaded_cmp)
+	if (likely(file == file_reloaded_cmp))
 		return file_reloaded;
 
 	fput(file);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file]  0ede61d858:  will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-20  7:41   ` Mateusz Guzik
  0 siblings, 0 replies; 30+ messages in thread
From: Mateusz Guzik @ 2023-11-20  7:41 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, lkp, Jann Horn, linuxppc-dev, intel-gfx,
	linux-doc, linux-kernel, fengwei.yin, gfs2, linux-fsdevel,
	feng.tang, ying.huang, oe-lkp, bpf, Linus Torvalds

On Mon, Nov 20, 2023 at 03:11:31PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
> 
> 
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
[snip]
>      30.90 ±  4%     -20.6       10.35 ±  2%  perf-profile.self.cycles-pp.__fget_light
>       0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu
[snip]

So __fget_light now got a func call.

I don't know if this is worth patching (and benchmarking after), but I
if sorting this out is of interest, triviality below is probably the
easiest way out:

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..d8d3e18800c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -856,14 +856,14 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__get_file_rcu(struct file __rcu **f)
+static __always_inline struct file *__get_file_rcu(struct file __rcu **f)
 {
 	struct file __rcu *file;
 	struct file __rcu *file_reloaded;
 	struct file __rcu *file_reloaded_cmp;
 
 	file = rcu_dereference_raw(*f);
-	if (!file)
+	if (unlikely(!file))
 		return NULL;
 
 	if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
@@ -891,7 +891,7 @@ static struct file *__get_file_rcu(struct file __rcu **f)
 	 * If the pointers don't match the file has been reallocated by
 	 * SLAB_TYPESAFE_BY_RCU.
 	 */
-	if (file == file_reloaded_cmp)
+	if (likely(file == file_reloaded_cmp))
 		return file_reloaded;
 
 	fput(file);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-20  7:41   ` Mateusz Guzik
  0 siblings, 0 replies; 30+ messages in thread
From: Mateusz Guzik @ 2023-11-20  7:41 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, Jann Horn, linuxppc-dev, intel-gfx, linux-doc,
	linux-kernel, fengwei.yin, gfs2, linux-fsdevel, feng.tang,
	ying.huang, oe-lkp, bpf, Linus Torvalds

On Mon, Nov 20, 2023 at 03:11:31PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
> 
> 
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
[snip]
>      30.90 ±  4%     -20.6       10.35 ±  2%  perf-profile.self.cycles-pp.__fget_light
>       0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu
[snip]

So __fget_light now got a func call.

I don't know if this is worth patching (and benchmarking after), but I
if sorting this out is of interest, triviality below is probably the
easiest way out:

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..d8d3e18800c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -856,14 +856,14 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__get_file_rcu(struct file __rcu **f)
+static __always_inline struct file *__get_file_rcu(struct file __rcu **f)
 {
 	struct file __rcu *file;
 	struct file __rcu *file_reloaded;
 	struct file __rcu *file_reloaded_cmp;
 
 	file = rcu_dereference_raw(*f);
-	if (!file)
+	if (unlikely(!file))
 		return NULL;
 
 	if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
@@ -891,7 +891,7 @@ static struct file *__get_file_rcu(struct file __rcu **f)
 	 * If the pointers don't match the file has been reallocated by
 	 * SLAB_TYPESAFE_BY_RCU.
 	 */
-	if (file == file_reloaded_cmp)
+	if (likely(file == file_reloaded_cmp))
 		return file_reloaded;
 
 	fput(file);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-20  7:11 ` [Intel-gfx] " kernel test robot
  (?)
@ 2023-11-26 20:23   ` Linus Torvalds
  -1 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-26 20:23 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, oe-lkp, lkp, linux-kernel, Jann Horn,
	linux-doc, linuxppc-dev, intel-gfx, linux-fsdevel, gfs2, bpf,
	ying.huang, feng.tang, fengwei.yin

[-- Attachment #1: Type: text/plain, Size: 3134 bytes --]

On Sun, 19 Nov 2023 at 23:11, kernel test robot <oliver.sang@intel.com> wrote:
>
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
>
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")

Ok, so __fget_light() is one of our more performance-critical things,
and that commit ends up making it call a rather more expensive version
in __get_file_rcu(), so we have:

>      30.90 ą  4%     -20.6       10.35 ą  2%  perf-profile.self.cycles-pp.__fget_light
>       0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu

and that "20% decrease balanced by 26% increase elsewhere" then
directly causes the ~3% regression.

I took a look at the code generation, and honestly, I think we're
better off just making __fget_files_rcu() have special logic for this
all, and not use __get_file_rcu().

The 'fd' case really is special because we need to do that
non-speculative pointer access.

Because it turns out that we already have to use array_index_nospec()
to safely generate that pointer to the fd entry, and once you have to
do that "use non-speculative accesses to generate a safe pointer", you
might as well just go whole hog.

So this takes a different approach, and uses the
array_index_mask_nospec() thing that we have exactly to generate that
non-speculative mask for these things:

        /* Mask is a 0 for invalid fd's, ~0 for valid ones */
        mask = array_index_mask_nospec(fd, fdt->max_fds);

and then it does something you can consider either horribly clever, or
horribly ugly (this first part is basically just
array_index_nospec()):

        /* fdentry points to the 'fd' offset, or fdt->fd[0] */
        fdentry = fdt->fd + (fd&mask);

and then we can just *unconditionally* do the load - but since we
might be loading fd[0] for an invalid case, we need to mask the result
too:

        /* Do the load, then mask any invalid result */
        file = rcu_dereference_raw(*fdentry);
        file = (void *)(mask & (unsigned long)file);

but now we have done everything without any conditionals, and the only
conditional is now "did we load NULL" - which includes that "we masked
the bad value".

Then we just do that atomic_long_inc_not_zero() on the file, and
re-check the pointer chain we used.

I made files_lookup_fd_raw() do the same thing.

The end result is much nicer code generation at least from my quick
check. And I assume the regression will be gone, and hopefully even
turned into an improvement since this is so hot.

Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that
__get_file_rcu() does, and I don't get it. Both things come from
volatile accesses, I don't see the point of those games, but I also
didn't care, since it's no longer in a critical code path.

Christian?

NOTE! This patch is not well tested. I verified an earlier version of
this, but have been playing with it since, so caveat emptor.

IOW, I might have messed up some "trivial cleanup" when prepping for
sending it out...

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 4120 bytes --]

 fs/file.c               | 48 ++++++++++++++++++++++++++++++------------------
 include/linux/fdtable.h | 15 ++++++++++-----
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..c74a6e8687d9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		struct file *file;
 		struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 		struct file __rcu **fdentry;
+		unsigned long mask;
 
-		if (unlikely(fd >= fdt->max_fds))
+		/* Mask is a 0 for invalid fd's, ~0 for valid ones */
+		mask = array_index_mask_nospec(fd, fdt->max_fds);
+
+		/* fdentry points to the 'fd' offset, or fdt->fd[0] */
+		fdentry = fdt->fd + (fd&mask);
+
+		/* Do the load, then mask any invalid result */
+		file = rcu_dereference_raw(*fdentry);
+		file = (void *)(mask & (unsigned long)file);
+
+		if (unlikely(!file))
 			return NULL;
 
-		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+		/*
+		 * Ok, we have a file pointer that was valid at
+		 * some point, but it might have become stale since.
+		 *
+		 * We need to confirm it by incrementing the refcount
+		 * and then check the lookup again.
+		 *
+		 * atomic_long_inc_not_zero() gives us a full memory
+		 * barrier. We only really need an 'acquire' one to
+		 * protect the loads below, but we don't have that.
+		 */
+		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+			continue;
 
 		/*
-		 * Ok, we have a file pointer. However, because we do
-		 * this all locklessly under RCU, we may be racing with
-		 * that file being closed.
-		 *
 		 * Such a race can take two forms:
 		 *
 		 *  (a) the file ref already went down to zero and the
 		 *      file hasn't been reused yet or the file count
 		 *      isn't zero but the file has already been reused.
-		 */
-		file = __get_file_rcu(fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(IS_ERR(file)))
-			continue;
-
-		/*
+		 *
 		 *  (b) the file table entry has changed under us.
 		 *       Note that we don't need to re-check the 'fdt->fd'
 		 *       pointer having changed, because it always goes
@@ -991,7 +1002,8 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
+		if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+		    unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
@@ -1128,13 +1140,13 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 	 * atomic_read_acquire() pairs with atomic_dec_and_test() in
 	 * put_files_struct().
 	 */
-	if (atomic_read_acquire(&files->count) == 1) {
+	if (likely(atomic_read_acquire(&files->count) == 1)) {
 		file = files_lookup_fd_raw(files, fd);
 		if (!file || unlikely(file->f_mode & mask))
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget_files(files, fd, mask);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..a8a8b4d24619 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,12 +83,17 @@ struct dentry;
 static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+	unsigned long mask = array_index_mask_nospec(fd, fdt->max_fds);
+	struct file *needs_masking;
 
-	if (fd < fdt->max_fds) {
-		fd = array_index_nospec(fd, fdt->max_fds);
-		return rcu_dereference_raw(fdt->fd[fd]);
-	}
-	return NULL;
+	/*
+	 * 'mask' is zero for an out-of-bounds fd, all ones for ok.
+	 * 'fd|~mask' is 'fd' for ok, or 0 for out of bounds.
+	 *
+	 * Accessing fdt->fd[0] is ok, but needs masking of the result.
+	 */
+	needs_masking = rcu_dereference_raw(fdt->fd[fd&mask]);
+	return (struct file *)(mask & (unsigned long)needs_masking);
 }
 
 static inline struct file *files_lookup_fd_locked(struct files_struct *files, unsigned int fd)

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-26 20:23   ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-26 20:23 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, feng.tang, ying.huang, oe-lkp,
	bpf, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3134 bytes --]

On Sun, 19 Nov 2023 at 23:11, kernel test robot <oliver.sang@intel.com> wrote:
>
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
>
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")

Ok, so __fget_light() is one of our more performance-critical things,
and that commit ends up making it call a rather more expensive version
in __get_file_rcu(), so we have:

>      30.90 ą  4%     -20.6       10.35 ą  2%  perf-profile.self.cycles-pp.__fget_light
>       0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu

and that "20% decrease balanced by 26% increase elsewhere" then
directly causes the ~3% regression.

I took a look at the code generation, and honestly, I think we're
better off just making __fget_files_rcu() have special logic for this
all, and not use __get_file_rcu().

The 'fd' case really is special because we need to do that
non-speculative pointer access.

Because it turns out that we already have to use array_index_nospec()
to safely generate that pointer to the fd entry, and once you have to
do that "use non-speculative accesses to generate a safe pointer", you
might as well just go whole hog.

So this takes a different approach, and uses the
array_index_mask_nospec() thing that we have exactly to generate that
non-speculative mask for these things:

        /* Mask is a 0 for invalid fd's, ~0 for valid ones */
        mask = array_index_mask_nospec(fd, fdt->max_fds);

and then it does something you can consider either horribly clever, or
horribly ugly (this first part is basically just
array_index_nospec()):

        /* fdentry points to the 'fd' offset, or fdt->fd[0] */
        fdentry = fdt->fd + (fd&mask);

and then we can just *unconditionally* do the load - but since we
might be loading fd[0] for an invalid case, we need to mask the result
too:

        /* Do the load, then mask any invalid result */
        file = rcu_dereference_raw(*fdentry);
        file = (void *)(mask & (unsigned long)file);

but now we have done everything without any conditionals, and the only
conditional is now "did we load NULL" - which includes that "we masked
the bad value".

Then we just do that atomic_long_inc_not_zero() on the file, and
re-check the pointer chain we used.

I made files_lookup_fd_raw() do the same thing.

The end result is much nicer code generation at least from my quick
check. And I assume the regression will be gone, and hopefully even
turned into an improvement since this is so hot.

Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that
__get_file_rcu() does, and I don't get it. Both things come from
volatile accesses, I don't see the point of those games, but I also
didn't care, since it's no longer in a critical code path.

Christian?

NOTE! This patch is not well tested. I verified an earlier version of
this, but have been playing with it since, so caveat emptor.

IOW, I might have messed up some "trivial cleanup" when prepping for
sending it out...

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 4120 bytes --]

 fs/file.c               | 48 ++++++++++++++++++++++++++++++------------------
 include/linux/fdtable.h | 15 ++++++++++-----
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..c74a6e8687d9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		struct file *file;
 		struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 		struct file __rcu **fdentry;
+		unsigned long mask;
 
-		if (unlikely(fd >= fdt->max_fds))
+		/* Mask is a 0 for invalid fd's, ~0 for valid ones */
+		mask = array_index_mask_nospec(fd, fdt->max_fds);
+
+		/* fdentry points to the 'fd' offset, or fdt->fd[0] */
+		fdentry = fdt->fd + (fd&mask);
+
+		/* Do the load, then mask any invalid result */
+		file = rcu_dereference_raw(*fdentry);
+		file = (void *)(mask & (unsigned long)file);
+
+		if (unlikely(!file))
 			return NULL;
 
-		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+		/*
+		 * Ok, we have a file pointer that was valid at
+		 * some point, but it might have become stale since.
+		 *
+		 * We need to confirm it by incrementing the refcount
+		 * and then check the lookup again.
+		 *
+		 * atomic_long_inc_not_zero() gives us a full memory
+		 * barrier. We only really need an 'acquire' one to
+		 * protect the loads below, but we don't have that.
+		 */
+		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+			continue;
 
 		/*
-		 * Ok, we have a file pointer. However, because we do
-		 * this all locklessly under RCU, we may be racing with
-		 * that file being closed.
-		 *
 		 * Such a race can take two forms:
 		 *
 		 *  (a) the file ref already went down to zero and the
 		 *      file hasn't been reused yet or the file count
 		 *      isn't zero but the file has already been reused.
-		 */
-		file = __get_file_rcu(fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(IS_ERR(file)))
-			continue;
-
-		/*
+		 *
 		 *  (b) the file table entry has changed under us.
 		 *       Note that we don't need to re-check the 'fdt->fd'
 		 *       pointer having changed, because it always goes
@@ -991,7 +1002,8 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
+		if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+		    unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
@@ -1128,13 +1140,13 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 	 * atomic_read_acquire() pairs with atomic_dec_and_test() in
 	 * put_files_struct().
 	 */
-	if (atomic_read_acquire(&files->count) == 1) {
+	if (likely(atomic_read_acquire(&files->count) == 1)) {
 		file = files_lookup_fd_raw(files, fd);
 		if (!file || unlikely(file->f_mode & mask))
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget_files(files, fd, mask);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..a8a8b4d24619 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,12 +83,17 @@ struct dentry;
 static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+	unsigned long mask = array_index_mask_nospec(fd, fdt->max_fds);
+	struct file *needs_masking;
 
-	if (fd < fdt->max_fds) {
-		fd = array_index_nospec(fd, fdt->max_fds);
-		return rcu_dereference_raw(fdt->fd[fd]);
-	}
-	return NULL;
+	/*
+	 * 'mask' is zero for an out-of-bounds fd, all ones for ok.
+	 * 'fd|~mask' is 'fd' for ok, or 0 for out of bounds.
+	 *
+	 * Accessing fdt->fd[0] is ok, but needs masking of the result.
+	 */
+	needs_masking = rcu_dereference_raw(fdt->fd[fd&mask]);
+	return (struct file *)(mask & (unsigned long)needs_masking);
 }
 
 static inline struct file *files_lookup_fd_locked(struct files_struct *files, unsigned int fd)

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-26 20:23   ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-26 20:23 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, lkp, Jann Horn, intel-gfx, linux-doc,
	linux-kernel, fengwei.yin, gfs2, linux-fsdevel, feng.tang,
	ying.huang, oe-lkp, bpf, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3134 bytes --]

On Sun, 19 Nov 2023 at 23:11, kernel test robot <oliver.sang@intel.com> wrote:
>
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
>
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")

Ok, so __fget_light() is one of our more performance-critical things,
and that commit ends up making it call a rather more expensive version
in __get_file_rcu(), so we have:

>      30.90 ą  4%     -20.6       10.35 ą  2%  perf-profile.self.cycles-pp.__fget_light
>       0.00           +26.5       26.48        perf-profile.self.cycles-pp.__get_file_rcu

and that "20% decrease balanced by 26% increase elsewhere" then
directly causes the ~3% regression.

I took a look at the code generation, and honestly, I think we're
better off just making __fget_files_rcu() have special logic for this
all, and not use __get_file_rcu().

The 'fd' case really is special because we need to do that
non-speculative pointer access.

Because it turns out that we already have to use array_index_nospec()
to safely generate that pointer to the fd entry, and once you have to
do that "use non-speculative accesses to generate a safe pointer", you
might as well just go whole hog.

So this takes a different approach, and uses the
array_index_mask_nospec() thing that we have exactly to generate that
non-speculative mask for these things:

        /* Mask is a 0 for invalid fd's, ~0 for valid ones */
        mask = array_index_mask_nospec(fd, fdt->max_fds);

and then it does something you can consider either horribly clever, or
horribly ugly (this first part is basically just
array_index_nospec()):

        /* fdentry points to the 'fd' offset, or fdt->fd[0] */
        fdentry = fdt->fd + (fd&mask);

and then we can just *unconditionally* do the load - but since we
might be loading fd[0] for an invalid case, we need to mask the result
too:

        /* Do the load, then mask any invalid result */
        file = rcu_dereference_raw(*fdentry);
        file = (void *)(mask & (unsigned long)file);

but now we have done everything without any conditionals, and the only
conditional is now "did we load NULL" - which includes that "we masked
the bad value".

Then we just do that atomic_long_inc_not_zero() on the file, and
re-check the pointer chain we used.

I made files_lookup_fd_raw() do the same thing.

The end result is much nicer code generation at least from my quick
check. And I assume the regression will be gone, and hopefully even
turned into an improvement since this is so hot.

Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that
__get_file_rcu() does, and I don't get it. Both things come from
volatile accesses, I don't see the point of those games, but I also
didn't care, since it's no longer in a critical code path.

Christian?

NOTE! This patch is not well tested. I verified an earlier version of
this, but have been playing with it since, so caveat emptor.

IOW, I might have messed up some "trivial cleanup" when prepping for
sending it out...

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 4120 bytes --]

 fs/file.c               | 48 ++++++++++++++++++++++++++++++------------------
 include/linux/fdtable.h | 15 ++++++++++-----
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..c74a6e8687d9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		struct file *file;
 		struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 		struct file __rcu **fdentry;
+		unsigned long mask;
 
-		if (unlikely(fd >= fdt->max_fds))
+		/* Mask is a 0 for invalid fd's, ~0 for valid ones */
+		mask = array_index_mask_nospec(fd, fdt->max_fds);
+
+		/* fdentry points to the 'fd' offset, or fdt->fd[0] */
+		fdentry = fdt->fd + (fd&mask);
+
+		/* Do the load, then mask any invalid result */
+		file = rcu_dereference_raw(*fdentry);
+		file = (void *)(mask & (unsigned long)file);
+
+		if (unlikely(!file))
 			return NULL;
 
-		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+		/*
+		 * Ok, we have a file pointer that was valid at
+		 * some point, but it might have become stale since.
+		 *
+		 * We need to confirm it by incrementing the refcount
+		 * and then check the lookup again.
+		 *
+		 * atomic_long_inc_not_zero() gives us a full memory
+		 * barrier. We only really need an 'acquire' one to
+		 * protect the loads below, but we don't have that.
+		 */
+		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+			continue;
 
 		/*
-		 * Ok, we have a file pointer. However, because we do
-		 * this all locklessly under RCU, we may be racing with
-		 * that file being closed.
-		 *
 		 * Such a race can take two forms:
 		 *
 		 *  (a) the file ref already went down to zero and the
 		 *      file hasn't been reused yet or the file count
 		 *      isn't zero but the file has already been reused.
-		 */
-		file = __get_file_rcu(fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(IS_ERR(file)))
-			continue;
-
-		/*
+		 *
 		 *  (b) the file table entry has changed under us.
 		 *       Note that we don't need to re-check the 'fdt->fd'
 		 *       pointer having changed, because it always goes
@@ -991,7 +1002,8 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
+		if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+		    unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
@@ -1128,13 +1140,13 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 	 * atomic_read_acquire() pairs with atomic_dec_and_test() in
 	 * put_files_struct().
 	 */
-	if (atomic_read_acquire(&files->count) == 1) {
+	if (likely(atomic_read_acquire(&files->count) == 1)) {
 		file = files_lookup_fd_raw(files, fd);
 		if (!file || unlikely(file->f_mode & mask))
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget_files(files, fd, mask);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..a8a8b4d24619 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,12 +83,17 @@ struct dentry;
 static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+	unsigned long mask = array_index_mask_nospec(fd, fdt->max_fds);
+	struct file *needs_masking;
 
-	if (fd < fdt->max_fds) {
-		fd = array_index_nospec(fd, fdt->max_fds);
-		return rcu_dereference_raw(fdt->fd[fd]);
-	}
-	return NULL;
+	/*
+	 * 'mask' is zero for an out-of-bounds fd, all ones for ok.
+	 * 'fd|~mask' is 'fd' for ok, or 0 for out of bounds.
+	 *
+	 * Accessing fdt->fd[0] is ok, but needs masking of the result.
+	 */
+	needs_masking = rcu_dereference_raw(fdt->fd[fd&mask]);
+	return (struct file *)(mask & (unsigned long)needs_masking);
 }
 
 static inline struct file *files_lookup_fd_locked(struct files_struct *files, unsigned int fd)

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-20  7:11 ` [Intel-gfx] " kernel test robot
                   ` (3 preceding siblings ...)
  (?)
@ 2023-11-26 20:50 ` Patchwork
  -1 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2023-11-26 20:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: intel-gfx

== Series Details ==

Series: 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
URL   : https://patchwork.freedesktop.org/series/126899/
State : warning

== Summary ==

Error: dim checkpatch failed
e2225568ae06 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
-:9: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#9: 
On Sun, 19 Nov 2023 at 23:11, kernel test robot <oliver.sang@intel.com> wrote:

-:13: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU")'
#13: 
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")

-:107: CHECK:SPACING: spaces preferred around that '&' (ctx:VxV)
#107: FILE: fs/file.c:968:
+		fdentry = fdt->fd + (fd&mask);
 		                       ^

-:202: CHECK:SPACING: spaces preferred around that '&' (ctx:VxV)
#202: FILE: include/linux/fdtable.h:95:
+	needs_masking = rcu_dereference_raw(fdt->fd[fd&mask]);
 	                                              ^

-:206: ERROR:MISSING_SIGN_OFF: Missing Signed-off-by: line(s)

total: 2 errors, 1 warnings, 2 checks, 104 lines checked



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-20  7:11 ` [Intel-gfx] " kernel test robot
                   ` (4 preceding siblings ...)
  (?)
@ 2023-11-26 20:50 ` Patchwork
  -1 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2023-11-26 20:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: intel-gfx

== Series Details ==

Series: 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
URL   : https://patchwork.freedesktop.org/series/126899/
State : warning

== Summary ==

Error: dim sparse failed
Sparse version: v0.6.2



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-26 20:23   ` [Intel-gfx] " Linus Torvalds
  (?)
@ 2023-11-26 23:20     ` Linus Torvalds
  -1 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-26 23:20 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, feng.tang, ying.huang, oe-lkp,
	bpf, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> IOW, I might have messed up some "trivial cleanup" when prepping for
> sending it out...

Bah. Famous last words. One of the "trivial cleanups" made the code
more "obvious" by renaming the nospec mask as just "mask".

And that trivial rename broke that patch *entirely*, because now that
name shadowed the "fmode_t" mask argument.

Don't even ask how long it took me to go from "I *tested* this,
dammit, now it doesn't work at all" to "Oh God, I'm so stupid".

So that nobody else would waste any time on this, attached is a new
attempt. This time actually tested *after* the changes.

                  Linus

[-- Attachment #2: 0001-Improve-__fget_files_rcu-code-generation-and-thus-__.patch --]
[-- Type: text/x-patch, Size: 5014 bytes --]

From 45f70b5413a654d20ead410c533ec1d604bdb1e2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 26 Nov 2023 12:24:38 -0800
Subject: [PATCH] Improve __fget_files_rcu() code generation (and thus __fget_light())

Commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU") caused a
performance regression as reported by the kernel test robot.

The __fget_light() function is one of those critical ones for some
loads, and the code generation was unnecessarily impacted.  Let's just
write that function to better.

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Closes: https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.sang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/file.c               | 48 +++++++++++++++++++++++++----------------
 include/linux/fdtable.h | 15 ++++++++-----
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..608b4802c214 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		struct file *file;
 		struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 		struct file __rcu **fdentry;
+		unsigned long nospec;
 
-		if (unlikely(fd >= fdt->max_fds))
+		/* Mask is a 0 for invalid fd's, ~0 for valid ones */
+		nospec = array_index_mask_nospec(fd, fdt->max_fds);
+
+		/* fdentry points to the 'fd' offset, or fdt->fd[0] */
+		fdentry = fdt->fd + (fd&nospec);
+
+		/* Do the load, then mask any invalid result */
+		file = rcu_dereference_raw(*fdentry);
+		file = (void *)(nospec & (unsigned long)file);
+
+		if (unlikely(!file))
 			return NULL;
 
-		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+		/*
+		 * Ok, we have a file pointer that was valid at
+		 * some point, but it might have become stale since.
+		 *
+		 * We need to confirm it by incrementing the refcount
+		 * and then check the lookup again.
+		 *
+		 * atomic_long_inc_not_zero() gives us a full memory
+		 * barrier. We only really need an 'acquire' one to
+		 * protect the loads below, but we don't have that.
+		 */
+		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+			continue;
 
 		/*
-		 * Ok, we have a file pointer. However, because we do
-		 * this all locklessly under RCU, we may be racing with
-		 * that file being closed.
-		 *
 		 * Such a race can take two forms:
 		 *
 		 *  (a) the file ref already went down to zero and the
 		 *      file hasn't been reused yet or the file count
 		 *      isn't zero but the file has already been reused.
-		 */
-		file = __get_file_rcu(fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(IS_ERR(file)))
-			continue;
-
-		/*
+		 *
 		 *  (b) the file table entry has changed under us.
 		 *       Note that we don't need to re-check the 'fdt->fd'
 		 *       pointer having changed, because it always goes
@@ -991,7 +1002,8 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
+		if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+		    unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
@@ -1128,13 +1140,13 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 	 * atomic_read_acquire() pairs with atomic_dec_and_test() in
 	 * put_files_struct().
 	 */
-	if (atomic_read_acquire(&files->count) == 1) {
+	if (likely(atomic_read_acquire(&files->count) == 1)) {
 		file = files_lookup_fd_raw(files, fd);
 		if (!file || unlikely(file->f_mode & mask))
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget_files(files, fd, mask);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..80bd7789bab1 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,12 +83,17 @@ struct dentry;
 static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+	unsigned long mask = array_index_mask_nospec(fd, fdt->max_fds);
+	struct file *needs_masking;
 
-	if (fd < fdt->max_fds) {
-		fd = array_index_nospec(fd, fdt->max_fds);
-		return rcu_dereference_raw(fdt->fd[fd]);
-	}
-	return NULL;
+	/*
+	 * 'mask' is zero for an out-of-bounds fd, all ones for ok.
+	 * 'fd&mask' is 'fd' for ok, or 0 for out of bounds.
+	 *
+	 * Accessing fdt->fd[0] is ok, but needs masking of the result.
+	 */
+	needs_masking = rcu_dereference_raw(fdt->fd[fd&mask]);
+	return (struct file *)(mask & (unsigned long)needs_masking);
 }
 
 static inline struct file *files_lookup_fd_locked(struct files_struct *files, unsigned int fd)
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-26 23:20     ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-26 23:20 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, oe-lkp, lkp, linux-kernel, Jann Horn,
	linux-doc, linuxppc-dev, intel-gfx, linux-fsdevel, gfs2, bpf,
	ying.huang, feng.tang, fengwei.yin

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> IOW, I might have messed up some "trivial cleanup" when prepping for
> sending it out...

Bah. Famous last words. One of the "trivial cleanups" made the code
more "obvious" by renaming the nospec mask as just "mask".

And that trivial rename broke that patch *entirely*, because now that
name shadowed the "fmode_t" mask argument.

Don't even ask how long it took me to go from "I *tested* this,
dammit, now it doesn't work at all" to "Oh God, I'm so stupid".

So that nobody else would waste any time on this, attached is a new
attempt. This time actually tested *after* the changes.

                  Linus

[-- Attachment #2: 0001-Improve-__fget_files_rcu-code-generation-and-thus-__.patch --]
[-- Type: text/x-patch, Size: 5014 bytes --]

From 45f70b5413a654d20ead410c533ec1d604bdb1e2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 26 Nov 2023 12:24:38 -0800
Subject: [PATCH] Improve __fget_files_rcu() code generation (and thus __fget_light())

Commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU") caused a
performance regression as reported by the kernel test robot.

The __fget_light() function is one of those critical ones for some
loads, and the code generation was unnecessarily impacted.  Let's just
write that function to better.

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Closes: https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.sang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/file.c               | 48 +++++++++++++++++++++++++----------------
 include/linux/fdtable.h | 15 ++++++++-----
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..608b4802c214 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		struct file *file;
 		struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 		struct file __rcu **fdentry;
+		unsigned long nospec;
 
-		if (unlikely(fd >= fdt->max_fds))
+		/* Mask is a 0 for invalid fd's, ~0 for valid ones */
+		nospec = array_index_mask_nospec(fd, fdt->max_fds);
+
+		/* fdentry points to the 'fd' offset, or fdt->fd[0] */
+		fdentry = fdt->fd + (fd&nospec);
+
+		/* Do the load, then mask any invalid result */
+		file = rcu_dereference_raw(*fdentry);
+		file = (void *)(nospec & (unsigned long)file);
+
+		if (unlikely(!file))
 			return NULL;
 
-		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+		/*
+		 * Ok, we have a file pointer that was valid at
+		 * some point, but it might have become stale since.
+		 *
+		 * We need to confirm it by incrementing the refcount
+		 * and then check the lookup again.
+		 *
+		 * atomic_long_inc_not_zero() gives us a full memory
+		 * barrier. We only really need an 'acquire' one to
+		 * protect the loads below, but we don't have that.
+		 */
+		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+			continue;
 
 		/*
-		 * Ok, we have a file pointer. However, because we do
-		 * this all locklessly under RCU, we may be racing with
-		 * that file being closed.
-		 *
 		 * Such a race can take two forms:
 		 *
 		 *  (a) the file ref already went down to zero and the
 		 *      file hasn't been reused yet or the file count
 		 *      isn't zero but the file has already been reused.
-		 */
-		file = __get_file_rcu(fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(IS_ERR(file)))
-			continue;
-
-		/*
+		 *
 		 *  (b) the file table entry has changed under us.
 		 *       Note that we don't need to re-check the 'fdt->fd'
 		 *       pointer having changed, because it always goes
@@ -991,7 +1002,8 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
+		if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+		    unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
@@ -1128,13 +1140,13 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 	 * atomic_read_acquire() pairs with atomic_dec_and_test() in
 	 * put_files_struct().
 	 */
-	if (atomic_read_acquire(&files->count) == 1) {
+	if (likely(atomic_read_acquire(&files->count) == 1)) {
 		file = files_lookup_fd_raw(files, fd);
 		if (!file || unlikely(file->f_mode & mask))
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget_files(files, fd, mask);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..80bd7789bab1 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,12 +83,17 @@ struct dentry;
 static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+	unsigned long mask = array_index_mask_nospec(fd, fdt->max_fds);
+	struct file *needs_masking;
 
-	if (fd < fdt->max_fds) {
-		fd = array_index_nospec(fd, fdt->max_fds);
-		return rcu_dereference_raw(fdt->fd[fd]);
-	}
-	return NULL;
+	/*
+	 * 'mask' is zero for an out-of-bounds fd, all ones for ok.
+	 * 'fd&mask' is 'fd' for ok, or 0 for out of bounds.
+	 *
+	 * Accessing fdt->fd[0] is ok, but needs masking of the result.
+	 */
+	needs_masking = rcu_dereference_raw(fdt->fd[fd&mask]);
+	return (struct file *)(mask & (unsigned long)needs_masking);
 }
 
 static inline struct file *files_lookup_fd_locked(struct files_struct *files, unsigned int fd)
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-26 23:20     ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-26 23:20 UTC (permalink / raw)
  To: kernel test robot
  Cc: Christian Brauner, lkp, Jann Horn, intel-gfx, linux-doc,
	linux-kernel, fengwei.yin, gfs2, linux-fsdevel, feng.tang,
	ying.huang, oe-lkp, bpf, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> IOW, I might have messed up some "trivial cleanup" when prepping for
> sending it out...

Bah. Famous last words. One of the "trivial cleanups" made the code
more "obvious" by renaming the nospec mask as just "mask".

And that trivial rename broke that patch *entirely*, because now that
name shadowed the "fmode_t" mask argument.

Don't even ask how long it took me to go from "I *tested* this,
dammit, now it doesn't work at all" to "Oh God, I'm so stupid".

So that nobody else would waste any time on this, attached is a new
attempt. This time actually tested *after* the changes.

                  Linus

[-- Attachment #2: 0001-Improve-__fget_files_rcu-code-generation-and-thus-__.patch --]
[-- Type: text/x-patch, Size: 5014 bytes --]

From 45f70b5413a654d20ead410c533ec1d604bdb1e2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 26 Nov 2023 12:24:38 -0800
Subject: [PATCH] Improve __fget_files_rcu() code generation (and thus __fget_light())

Commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU") caused a
performance regression as reported by the kernel test robot.

The __fget_light() function is one of those critical ones for some
loads, and the code generation was unnecessarily impacted.  Let's just
write that function to better.

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Closes: https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.sang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/file.c               | 48 +++++++++++++++++++++++++----------------
 include/linux/fdtable.h | 15 ++++++++-----
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..608b4802c214 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		struct file *file;
 		struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 		struct file __rcu **fdentry;
+		unsigned long nospec;
 
-		if (unlikely(fd >= fdt->max_fds))
+		/* Mask is a 0 for invalid fd's, ~0 for valid ones */
+		nospec = array_index_mask_nospec(fd, fdt->max_fds);
+
+		/* fdentry points to the 'fd' offset, or fdt->fd[0] */
+		fdentry = fdt->fd + (fd&nospec);
+
+		/* Do the load, then mask any invalid result */
+		file = rcu_dereference_raw(*fdentry);
+		file = (void *)(nospec & (unsigned long)file);
+
+		if (unlikely(!file))
 			return NULL;
 
-		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+		/*
+		 * Ok, we have a file pointer that was valid at
+		 * some point, but it might have become stale since.
+		 *
+		 * We need to confirm it by incrementing the refcount
+		 * and then check the lookup again.
+		 *
+		 * atomic_long_inc_not_zero() gives us a full memory
+		 * barrier. We only really need an 'acquire' one to
+		 * protect the loads below, but we don't have that.
+		 */
+		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+			continue;
 
 		/*
-		 * Ok, we have a file pointer. However, because we do
-		 * this all locklessly under RCU, we may be racing with
-		 * that file being closed.
-		 *
 		 * Such a race can take two forms:
 		 *
 		 *  (a) the file ref already went down to zero and the
 		 *      file hasn't been reused yet or the file count
 		 *      isn't zero but the file has already been reused.
-		 */
-		file = __get_file_rcu(fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(IS_ERR(file)))
-			continue;
-
-		/*
+		 *
 		 *  (b) the file table entry has changed under us.
 		 *       Note that we don't need to re-check the 'fdt->fd'
 		 *       pointer having changed, because it always goes
@@ -991,7 +1002,8 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
+		if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+		    unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
@@ -1128,13 +1140,13 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 	 * atomic_read_acquire() pairs with atomic_dec_and_test() in
 	 * put_files_struct().
 	 */
-	if (atomic_read_acquire(&files->count) == 1) {
+	if (likely(atomic_read_acquire(&files->count) == 1)) {
 		file = files_lookup_fd_raw(files, fd);
 		if (!file || unlikely(file->f_mode & mask))
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget_files(files, fd, mask);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..80bd7789bab1 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,12 +83,17 @@ struct dentry;
 static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+	unsigned long mask = array_index_mask_nospec(fd, fdt->max_fds);
+	struct file *needs_masking;
 
-	if (fd < fdt->max_fds) {
-		fd = array_index_nospec(fd, fdt->max_fds);
-		return rcu_dereference_raw(fdt->fd[fd]);
-	}
-	return NULL;
+	/*
+	 * 'mask' is zero for an out-of-bounds fd, all ones for ok.
+	 * 'fd&mask' is 'fd' for ok, or 0 for out of bounds.
+	 *
+	 * Accessing fdt->fd[0] is ok, but needs masking of the result.
+	 */
+	needs_masking = rcu_dereference_raw(fdt->fd[fd&mask]);
+	return (struct file *)(mask & (unsigned long)needs_masking);
 }
 
 static inline struct file *files_lookup_fd_locked(struct files_struct *files, unsigned int fd)
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for 0ede61d858: will-it-scale.per_thread_ops -2.9% regression (rev2)
  2023-11-20  7:11 ` [Intel-gfx] " kernel test robot
                   ` (5 preceding siblings ...)
  (?)
@ 2023-11-26 23:26 ` Patchwork
  -1 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2023-11-26 23:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: intel-gfx

== Series Details ==

Series: 0ede61d858: will-it-scale.per_thread_ops -2.9% regression (rev2)
URL   : https://patchwork.freedesktop.org/series/126899/
State : failure

== Summary ==

Error: patch https://patchwork.freedesktop.org/api/1.0/series/126899/revisions/2/mbox/ not applied
Patch is empty.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To record the empty patch as an empty commit, run "git am --allow-empty".
To restore the original branch and stop patching, run "git am --abort".
Build failed, no error log produced



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-26 23:20     ` Linus Torvalds
  (?)
@ 2023-11-27  6:58       ` Oliver Sang
  -1 siblings, 0 replies; 30+ messages in thread
From: Oliver Sang @ 2023-11-27  6:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, oe-lkp, lkp, linux-kernel, Jann Horn,
	linux-doc, linuxppc-dev, intel-gfx, linux-fsdevel, gfs2, bpf,
	ying.huang, feng.tang, fengwei.yin, oliver.sang

hi, Linus,

On Sun, Nov 26, 2023 at 03:20:58PM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > IOW, I might have messed up some "trivial cleanup" when prepping for
> > sending it out...
> 
> Bah. Famous last words. One of the "trivial cleanups" made the code
> more "obvious" by renaming the nospec mask as just "mask".
> 
> And that trivial rename broke that patch *entirely*, because now that
> name shadowed the "fmode_t" mask argument.
> 
> Don't even ask how long it took me to go from "I *tested* this,
> dammit, now it doesn't work at all" to "Oh God, I'm so stupid".
> 
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

we applied the new patch upon 0ede61d858, and confirmed regression is gone,
even 3.4% better than 93faf426e3 now.

Tested-by: kernel test robot <oliver.sang@intel.com>

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")
  c712b4365b ("Improve __fget_files_rcu() code generation (and thus __fget_light())")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 c712b4365b5b4dbe1d1380edd37
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
    228481 ±  4%      -4.6%     217900 ±  6%     -11.7%     201857 ±  5%  meminfo.DirectMap4k
     89056            -2.0%      87309            -1.6%      87606        proc-vmstat.nr_slab_unreclaimable
     16.28            -0.7%      16.16            -1.0%      16.12        turbostat.RAMWatt
      0.01 ±  9%  +58125.6%       4.17 ±175%  +23253.5%       1.67 ±222%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    781.67 ± 10%      +6.5%     832.50 ± 19%     -14.3%     670.17 ±  4%  perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     97958 ±  7%      -9.7%      88449 ±  4%      -0.6%      97399 ±  4%  sched_debug.cpu.avg_idle.stddev
      0.00 ± 12%     +24.2%       0.00 ± 17%      -5.2%       0.00 ±  7%  sched_debug.cpu.next_balance.stddev
   6391048            -2.9%    6208584            +3.4%    6605584        will-it-scale.16.threads
    399440            -2.9%     388036            +3.4%     412848        will-it-scale.per_thread_ops
   6391048            -2.9%    6208584            +3.4%    6605584        will-it-scale.workload
     19.99 ±  4%      -2.2       17.74            +1.2       21.18 ±  2%  perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      1.27 ±  5%      +0.8        2.11 ±  3%     +31.1       32.36 ±  2%  perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
     32.69 ±  4%      +5.0       37.70           -32.7        0.00        perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      0.00           +27.9       27.85            +0.0        0.00        perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
     20.00 ±  4%      -2.3       17.75            +0.4       20.43 ±  2%  perf-profile.children.cycles-pp.fput
      0.24 ± 10%      -0.1        0.18 ±  2%      -0.1        0.18 ± 10%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      1.48 ±  5%      +0.5        1.98 ±  3%     +30.8       32.32 ±  2%  perf-profile.children.cycles-pp.__fdget
     31.85 ±  4%      +6.0       37.86           -31.8        0.00        perf-profile.children.cycles-pp.__fget_light
      0.00           +27.7       27.67            +0.0        0.00        perf-profile.children.cycles-pp.__get_file_rcu
     30.90 ±  4%     -20.6       10.35 ±  2%     -30.9        0.00        perf-profile.self.cycles-pp.__fget_light
     19.94 ±  4%      -2.4       17.53            -0.3       19.62 ±  2%  perf-profile.self.cycles-pp.fput
      9.81 ±  4%      -2.4        7.42 ±  2%      +1.7       11.51 ±  4%  perf-profile.self.cycles-pp.do_poll
      0.23 ± 11%      -0.1        0.17 ±  4%      -0.1        0.18 ± 11%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.44 ±  7%      +0.0        0.45 ±  5%      +0.1        0.52 ±  4%  perf-profile.self.cycles-pp.__poll
      0.85 ±  4%      +0.1        0.92 ±  3%     +30.3       31.17 ±  2%  perf-profile.self.cycles-pp.__fdget
      0.00           +26.5       26.48            +0.0        0.00        perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%      +8.5%  2.329e+10 ±  2%      -2.1%  2.101e+10        perf-stat.i.branch-instructions
      0.22 ± 14%      -0.0        0.19 ± 14%      -0.0        0.20 ±  3%  perf-stat.i.branch-miss-rate%
 2.424e+10 ±  2%      +4.1%  2.524e+10 ±  2%      -4.7%  2.311e+10        perf-stat.i.dTLB-loads
 1.404e+10 ±  2%      +8.7%  1.526e+10 ±  2%      -6.2%  1.316e+10        perf-stat.i.dTLB-stores
     70.87            -2.3       68.59            -1.0       69.90        perf-stat.i.iTLB-load-miss-rate%
   5267608            -5.5%    4979133 ±  2%      -0.4%    5244253        perf-stat.i.iTLB-load-misses
   2102507            +5.4%    2215725            +5.7%    2222286        perf-stat.i.iTLB-loads
     18791 ±  3%     +10.5%      20757 ±  2%      -1.8%      18446        perf-stat.i.instructions-per-iTLB-miss
    266.67 ±  2%      +6.8%     284.75 ±  2%      -4.1%     255.70        perf-stat.i.metric.M/sec
      0.01 ± 10%     -10.5%       0.01 ±  5%      -1.8%       0.01 ±  6%  perf-stat.overall.MPKI
      0.19            -0.0        0.17            +0.0        0.20        perf-stat.overall.branch-miss-rate%
      0.65            -3.1%       0.63            +6.1%       0.69        perf-stat.overall.cpi
      0.00 ±  4%      -0.0        0.00 ±  4%      +0.0        0.00 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     71.48            -2.3       69.21            -1.2       70.24        perf-stat.overall.iTLB-load-miss-rate%
     18757           +10.0%      20629            -3.2%      18161        perf-stat.overall.instructions-per-iTLB-miss
      1.54            +3.2%       1.59            -5.8%       1.45        perf-stat.overall.ipc
   4795147            +6.4%    5100406            -9.0%    4365017        perf-stat.overall.path-length
  2.14e+10 ±  2%      +8.5%  2.322e+10 ±  2%      -2.1%  2.094e+10        perf-stat.ps.branch-instructions
 2.417e+10 ±  2%      +4.1%  2.516e+10 ±  2%      -4.7%  2.303e+10        perf-stat.ps.dTLB-loads
   1.4e+10 ±  2%      +8.7%  1.522e+10 ±  2%      -6.3%  1.312e+10        perf-stat.ps.dTLB-stores
   5253923            -5.5%    4966218 ±  2%      -0.5%    5228207        perf-stat.ps.iTLB-load-misses
   2095770            +5.4%    2208605            +5.7%    2214962        perf-stat.ps.iTLB-loads
 3.065e+13            +3.3%  3.167e+13            -5.9%  2.883e+13        perf-stat.total.instructions

> 
>                   Linus


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-27  6:58       ` Oliver Sang
  0 siblings, 0 replies; 30+ messages in thread
From: Oliver Sang @ 2023-11-27  6:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, oliver.sang, feng.tang,
	ying.huang, oe-lkp, bpf, linuxppc-dev

hi, Linus,

On Sun, Nov 26, 2023 at 03:20:58PM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > IOW, I might have messed up some "trivial cleanup" when prepping for
> > sending it out...
> 
> Bah. Famous last words. One of the "trivial cleanups" made the code
> more "obvious" by renaming the nospec mask as just "mask".
> 
> And that trivial rename broke that patch *entirely*, because now that
> name shadowed the "fmode_t" mask argument.
> 
> Don't even ask how long it took me to go from "I *tested* this,
> dammit, now it doesn't work at all" to "Oh God, I'm so stupid".
> 
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

we applied the new patch upon 0ede61d858, and confirmed regression is gone,
even 3.4% better than 93faf426e3 now.

Tested-by: kernel test robot <oliver.sang@intel.com>

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")
  c712b4365b ("Improve __fget_files_rcu() code generation (and thus __fget_light())")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 c712b4365b5b4dbe1d1380edd37
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
    228481 ±  4%      -4.6%     217900 ±  6%     -11.7%     201857 ±  5%  meminfo.DirectMap4k
     89056            -2.0%      87309            -1.6%      87606        proc-vmstat.nr_slab_unreclaimable
     16.28            -0.7%      16.16            -1.0%      16.12        turbostat.RAMWatt
      0.01 ±  9%  +58125.6%       4.17 ±175%  +23253.5%       1.67 ±222%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    781.67 ± 10%      +6.5%     832.50 ± 19%     -14.3%     670.17 ±  4%  perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     97958 ±  7%      -9.7%      88449 ±  4%      -0.6%      97399 ±  4%  sched_debug.cpu.avg_idle.stddev
      0.00 ± 12%     +24.2%       0.00 ± 17%      -5.2%       0.00 ±  7%  sched_debug.cpu.next_balance.stddev
   6391048            -2.9%    6208584            +3.4%    6605584        will-it-scale.16.threads
    399440            -2.9%     388036            +3.4%     412848        will-it-scale.per_thread_ops
   6391048            -2.9%    6208584            +3.4%    6605584        will-it-scale.workload
     19.99 ±  4%      -2.2       17.74            +1.2       21.18 ±  2%  perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      1.27 ±  5%      +0.8        2.11 ±  3%     +31.1       32.36 ±  2%  perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
     32.69 ±  4%      +5.0       37.70           -32.7        0.00        perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      0.00           +27.9       27.85            +0.0        0.00        perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
     20.00 ±  4%      -2.3       17.75            +0.4       20.43 ±  2%  perf-profile.children.cycles-pp.fput
      0.24 ± 10%      -0.1        0.18 ±  2%      -0.1        0.18 ± 10%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      1.48 ±  5%      +0.5        1.98 ±  3%     +30.8       32.32 ±  2%  perf-profile.children.cycles-pp.__fdget
     31.85 ±  4%      +6.0       37.86           -31.8        0.00        perf-profile.children.cycles-pp.__fget_light
      0.00           +27.7       27.67            +0.0        0.00        perf-profile.children.cycles-pp.__get_file_rcu
     30.90 ±  4%     -20.6       10.35 ±  2%     -30.9        0.00        perf-profile.self.cycles-pp.__fget_light
     19.94 ±  4%      -2.4       17.53            -0.3       19.62 ±  2%  perf-profile.self.cycles-pp.fput
      9.81 ±  4%      -2.4        7.42 ±  2%      +1.7       11.51 ±  4%  perf-profile.self.cycles-pp.do_poll
      0.23 ± 11%      -0.1        0.17 ±  4%      -0.1        0.18 ± 11%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.44 ±  7%      +0.0        0.45 ±  5%      +0.1        0.52 ±  4%  perf-profile.self.cycles-pp.__poll
      0.85 ±  4%      +0.1        0.92 ±  3%     +30.3       31.17 ±  2%  perf-profile.self.cycles-pp.__fdget
      0.00           +26.5       26.48            +0.0        0.00        perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%      +8.5%  2.329e+10 ±  2%      -2.1%  2.101e+10        perf-stat.i.branch-instructions
      0.22 ± 14%      -0.0        0.19 ± 14%      -0.0        0.20 ±  3%  perf-stat.i.branch-miss-rate%
 2.424e+10 ±  2%      +4.1%  2.524e+10 ±  2%      -4.7%  2.311e+10        perf-stat.i.dTLB-loads
 1.404e+10 ±  2%      +8.7%  1.526e+10 ±  2%      -6.2%  1.316e+10        perf-stat.i.dTLB-stores
     70.87            -2.3       68.59            -1.0       69.90        perf-stat.i.iTLB-load-miss-rate%
   5267608            -5.5%    4979133 ±  2%      -0.4%    5244253        perf-stat.i.iTLB-load-misses
   2102507            +5.4%    2215725            +5.7%    2222286        perf-stat.i.iTLB-loads
     18791 ±  3%     +10.5%      20757 ±  2%      -1.8%      18446        perf-stat.i.instructions-per-iTLB-miss
    266.67 ±  2%      +6.8%     284.75 ±  2%      -4.1%     255.70        perf-stat.i.metric.M/sec
      0.01 ± 10%     -10.5%       0.01 ±  5%      -1.8%       0.01 ±  6%  perf-stat.overall.MPKI
      0.19            -0.0        0.17            +0.0        0.20        perf-stat.overall.branch-miss-rate%
      0.65            -3.1%       0.63            +6.1%       0.69        perf-stat.overall.cpi
      0.00 ±  4%      -0.0        0.00 ±  4%      +0.0        0.00 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     71.48            -2.3       69.21            -1.2       70.24        perf-stat.overall.iTLB-load-miss-rate%
     18757           +10.0%      20629            -3.2%      18161        perf-stat.overall.instructions-per-iTLB-miss
      1.54            +3.2%       1.59            -5.8%       1.45        perf-stat.overall.ipc
   4795147            +6.4%    5100406            -9.0%    4365017        perf-stat.overall.path-length
  2.14e+10 ±  2%      +8.5%  2.322e+10 ±  2%      -2.1%  2.094e+10        perf-stat.ps.branch-instructions
 2.417e+10 ±  2%      +4.1%  2.516e+10 ±  2%      -4.7%  2.303e+10        perf-stat.ps.dTLB-loads
   1.4e+10 ±  2%      +8.7%  1.522e+10 ±  2%      -6.3%  1.312e+10        perf-stat.ps.dTLB-stores
   5253923            -5.5%    4966218 ±  2%      -0.5%    5228207        perf-stat.ps.iTLB-load-misses
   2095770            +5.4%    2208605            +5.7%    2214962        perf-stat.ps.iTLB-loads
 3.065e+13            +3.3%  3.167e+13            -5.9%  2.883e+13        perf-stat.total.instructions

> 
>                   Linus


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-27  6:58       ` Oliver Sang
  0 siblings, 0 replies; 30+ messages in thread
From: Oliver Sang @ 2023-11-27  6:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, lkp, Jann Horn, intel-gfx, linux-doc,
	linux-kernel, fengwei.yin, gfs2, linux-fsdevel, oliver.sang,
	feng.tang, ying.huang, oe-lkp, bpf, linuxppc-dev

hi, Linus,

On Sun, Nov 26, 2023 at 03:20:58PM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > IOW, I might have messed up some "trivial cleanup" when prepping for
> > sending it out...
> 
> Bah. Famous last words. One of the "trivial cleanups" made the code
> more "obvious" by renaming the nospec mask as just "mask".
> 
> And that trivial rename broke that patch *entirely*, because now that
> name shadowed the "fmode_t" mask argument.
> 
> Don't even ask how long it took me to go from "I *tested* this,
> dammit, now it doesn't work at all" to "Oh God, I'm so stupid".
> 
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

we applied the new patch upon 0ede61d858, and confirmed regression is gone,
even 3.4% better than 93faf426e3 now.

Tested-by: kernel test robot <oliver.sang@intel.com>

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")
  c712b4365b ("Improve __fget_files_rcu() code generation (and thus __fget_light())")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 c712b4365b5b4dbe1d1380edd37
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
    228481 ±  4%      -4.6%     217900 ±  6%     -11.7%     201857 ±  5%  meminfo.DirectMap4k
     89056            -2.0%      87309            -1.6%      87606        proc-vmstat.nr_slab_unreclaimable
     16.28            -0.7%      16.16            -1.0%      16.12        turbostat.RAMWatt
      0.01 ±  9%  +58125.6%       4.17 ±175%  +23253.5%       1.67 ±222%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    781.67 ± 10%      +6.5%     832.50 ± 19%     -14.3%     670.17 ±  4%  perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     97958 ±  7%      -9.7%      88449 ±  4%      -0.6%      97399 ±  4%  sched_debug.cpu.avg_idle.stddev
      0.00 ± 12%     +24.2%       0.00 ± 17%      -5.2%       0.00 ±  7%  sched_debug.cpu.next_balance.stddev
   6391048            -2.9%    6208584            +3.4%    6605584        will-it-scale.16.threads
    399440            -2.9%     388036            +3.4%     412848        will-it-scale.per_thread_ops
   6391048            -2.9%    6208584            +3.4%    6605584        will-it-scale.workload
     19.99 ±  4%      -2.2       17.74            +1.2       21.18 ±  2%  perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      1.27 ±  5%      +0.8        2.11 ±  3%     +31.1       32.36 ±  2%  perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
     32.69 ±  4%      +5.0       37.70           -32.7        0.00        perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
      0.00           +27.9       27.85            +0.0        0.00        perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
     20.00 ±  4%      -2.3       17.75            +0.4       20.43 ±  2%  perf-profile.children.cycles-pp.fput
      0.24 ± 10%      -0.1        0.18 ±  2%      -0.1        0.18 ± 10%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      1.48 ±  5%      +0.5        1.98 ±  3%     +30.8       32.32 ±  2%  perf-profile.children.cycles-pp.__fdget
     31.85 ±  4%      +6.0       37.86           -31.8        0.00        perf-profile.children.cycles-pp.__fget_light
      0.00           +27.7       27.67            +0.0        0.00        perf-profile.children.cycles-pp.__get_file_rcu
     30.90 ±  4%     -20.6       10.35 ±  2%     -30.9        0.00        perf-profile.self.cycles-pp.__fget_light
     19.94 ±  4%      -2.4       17.53            -0.3       19.62 ±  2%  perf-profile.self.cycles-pp.fput
      9.81 ±  4%      -2.4        7.42 ±  2%      +1.7       11.51 ±  4%  perf-profile.self.cycles-pp.do_poll
      0.23 ± 11%      -0.1        0.17 ±  4%      -0.1        0.18 ± 11%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.44 ±  7%      +0.0        0.45 ±  5%      +0.1        0.52 ±  4%  perf-profile.self.cycles-pp.__poll
      0.85 ±  4%      +0.1        0.92 ±  3%     +30.3       31.17 ±  2%  perf-profile.self.cycles-pp.__fdget
      0.00           +26.5       26.48            +0.0        0.00        perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%      +8.5%  2.329e+10 ±  2%      -2.1%  2.101e+10        perf-stat.i.branch-instructions
      0.22 ± 14%      -0.0        0.19 ± 14%      -0.0        0.20 ±  3%  perf-stat.i.branch-miss-rate%
 2.424e+10 ±  2%      +4.1%  2.524e+10 ±  2%      -4.7%  2.311e+10        perf-stat.i.dTLB-loads
 1.404e+10 ±  2%      +8.7%  1.526e+10 ±  2%      -6.2%  1.316e+10        perf-stat.i.dTLB-stores
     70.87            -2.3       68.59            -1.0       69.90        perf-stat.i.iTLB-load-miss-rate%
   5267608            -5.5%    4979133 ±  2%      -0.4%    5244253        perf-stat.i.iTLB-load-misses
   2102507            +5.4%    2215725            +5.7%    2222286        perf-stat.i.iTLB-loads
     18791 ±  3%     +10.5%      20757 ±  2%      -1.8%      18446        perf-stat.i.instructions-per-iTLB-miss
    266.67 ±  2%      +6.8%     284.75 ±  2%      -4.1%     255.70        perf-stat.i.metric.M/sec
      0.01 ± 10%     -10.5%       0.01 ±  5%      -1.8%       0.01 ±  6%  perf-stat.overall.MPKI
      0.19            -0.0        0.17            +0.0        0.20        perf-stat.overall.branch-miss-rate%
      0.65            -3.1%       0.63            +6.1%       0.69        perf-stat.overall.cpi
      0.00 ±  4%      -0.0        0.00 ±  4%      +0.0        0.00 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     71.48            -2.3       69.21            -1.2       70.24        perf-stat.overall.iTLB-load-miss-rate%
     18757           +10.0%      20629            -3.2%      18161        perf-stat.overall.instructions-per-iTLB-miss
      1.54            +3.2%       1.59            -5.8%       1.45        perf-stat.overall.ipc
   4795147            +6.4%    5100406            -9.0%    4365017        perf-stat.overall.path-length
  2.14e+10 ±  2%      +8.5%  2.322e+10 ±  2%      -2.1%  2.094e+10        perf-stat.ps.branch-instructions
 2.417e+10 ±  2%      +4.1%  2.516e+10 ±  2%      -4.7%  2.303e+10        perf-stat.ps.dTLB-loads
   1.4e+10 ±  2%      +8.7%  1.522e+10 ±  2%      -6.3%  1.312e+10        perf-stat.ps.dTLB-stores
   5253923            -5.5%    4966218 ±  2%      -0.5%    5228207        perf-stat.ps.iTLB-load-misses
   2095770            +5.4%    2208605            +5.7%    2214962        perf-stat.ps.iTLB-loads
 3.065e+13            +3.3%  3.167e+13            -5.9%  2.883e+13        perf-stat.total.instructions

> 
>                   Linus


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-26 20:23   ` [Intel-gfx] " Linus Torvalds
  (?)
@ 2023-11-27 10:13     ` Christian Brauner
  -1 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-27 10:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Jann Horn,
	linux-doc, linuxppc-dev, intel-gfx, linux-fsdevel, gfs2, bpf,
	ying.huang, feng.tang, fengwei.yin

> I took a look at the code generation, and honestly, I think we're
> better off just making __fget_files_rcu() have special logic for this
> all, and not use __get_file_rcu().

My initial massaging of the patch did that btw. Then I sat there
wondering whether it would matter if we just made it possible to reuse
that code and I went through a bunch of iterations. Oh well, it seems to
matter.

> Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that

Concept looks sane to me.

> __get_file_rcu() does, and I don't get it. Both things come from
> volatile accesses, I don't see the point of those games, but I also
> didn't care, since it's no longer in a critical code path.
> 
> Christian?

Puts his completely imagined "I understand RCU head on".
SLAB_TYPESAFE_BY_RCU makes the RCU consume memory ordering that the
compiler doesn't officialy support (afaik) a bit wonky.

So the thinking was that we could have code patterns where you could
free the object and reallocate it while legitimatly passing the pointer
recheck. In that case there is no memory ordering between the allocation
and the pointer recheck because the last (re)allocation could have been
after the rcu_dereference().

To combat that all future loads were made to have a dependency on the
first load using the hidevar trick.

I guess that might only be theoretically possible but not in practice?
But then I liked that we explicitly commented on it as a reminder.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-27 10:13     ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-27 10:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: feng.tang, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, kernel test robot, ying.huang,
	oe-lkp, bpf, linuxppc-dev

> I took a look at the code generation, and honestly, I think we're
> better off just making __fget_files_rcu() have special logic for this
> all, and not use __get_file_rcu().

My initial massaging of the patch did that btw. Then I sat there
wondering whether it would matter if we just made it possible to reuse
that code and I went through a bunch of iterations. Oh well, it seems to
matter.

> Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that

Concept looks sane to me.

> __get_file_rcu() does, and I don't get it. Both things come from
> volatile accesses, I don't see the point of those games, but I also
> didn't care, since it's no longer in a critical code path.
> 
> Christian?

Puts his completely imagined "I understand RCU head on".
SLAB_TYPESAFE_BY_RCU makes the RCU consume memory ordering that the
compiler doesn't officialy support (afaik) a bit wonky.

So the thinking was that we could have code patterns where you could
free the object and reallocate it while legitimatly passing the pointer
recheck. In that case there is no memory ordering between the allocation
and the pointer recheck because the last (re)allocation could have been
after the rcu_dereference().

To combat that all future loads were made to have a dependency on the
first load using the hidevar trick.

I guess that might only be theoretically possible but not in practice?
But then I liked that we explicitly commented on it as a reminder.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-27 10:13     ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-27 10:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: feng.tang, lkp, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, kernel test robot, ying.huang,
	oe-lkp, bpf, linuxppc-dev

> I took a look at the code generation, and honestly, I think we're
> better off just making __fget_files_rcu() have special logic for this
> all, and not use __get_file_rcu().

My initial massaging of the patch did that btw. Then I sat there
wondering whether it would matter if we just made it possible to reuse
that code and I went through a bunch of iterations. Oh well, it seems to
matter.

> Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that

Concept looks sane to me.

> __get_file_rcu() does, and I don't get it. Both things come from
> volatile accesses, I don't see the point of those games, but I also
> didn't care, since it's no longer in a critical code path.
> 
> Christian?

Puts his completely imagined "I understand RCU head on".
SLAB_TYPESAFE_BY_RCU makes the RCU consume memory ordering that the
compiler doesn't officialy support (afaik) a bit wonky.

So the thinking was that we could have code patterns where you could
free the object and reallocate it while legitimatly passing the pointer
recheck. In that case there is no memory ordering between the allocation
and the pointer recheck because the last (re)allocation could have been
after the rcu_dereference().

To combat that all future loads were made to have a dependency on the
first load using the hidevar trick.

I guess that might only be theoretically possible but not in practice?
But then I liked that we explicitly commented on it as a reminder.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-26 23:20     ` Linus Torvalds
  (?)
@ 2023-11-27 10:27       ` Christian Brauner
  -1 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-27 10:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Jann Horn,
	linux-doc, linuxppc-dev, intel-gfx, linux-fsdevel, gfs2, bpf,
	ying.huang, feng.tang, fengwei.yin

> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

So I've picked up your patch (vfs.misc). It's clever alright so thanks
for the comments in there otherwise I would've stared at this for far
too long.

It's a little unpleasant because of the cast-orama going on before we
check the file pointer but I don't see that it's in any way wrong. And
given how focussed people are with __fget_* performance I think it might
even be the right thing to do.

But the cleverness means we have the same logic slightly differently
twice. Not too bad ofc but not too nice either especially because that
rcu lookup is pretty complicated already.

A few days ago I did just write a long explanatory off-list email to
someone who had questions about this and who is fairly experienced so
we're not making it easy on people. But performance or simplicity; one
can't necessarily always have both.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-27 10:27       ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-27 10:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: feng.tang, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, kernel test robot, ying.huang,
	oe-lkp, bpf, linuxppc-dev

> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

So I've picked up your patch (vfs.misc). It's clever alright so thanks
for the comments in there otherwise I would've stared at this for far
too long.

It's a little unpleasant because of the cast-orama going on before we
check the file pointer but I don't see that it's in any way wrong. And
given how focussed people are with __fget_* performance I think it might
even be the right thing to do.

But the cleverness means we have the same logic slightly differently
twice. Not too bad ofc but not too nice either especially because that
rcu lookup is pretty complicated already.

A few days ago I did just write a long explanatory off-list email to
someone who had questions about this and who is fairly experienced so
we're not making it easy on people. But performance or simplicity; one
can't necessarily always have both.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-27 10:27       ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-27 10:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: feng.tang, lkp, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, kernel test robot, ying.huang,
	oe-lkp, bpf, linuxppc-dev

> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

So I've picked up your patch (vfs.misc). It's clever alright so thanks
for the comments in there otherwise I would've stared at this for far
too long.

It's a little unpleasant because of the cast-orama going on before we
check the file pointer but I don't see that it's in any way wrong. And
given how focussed people are with __fget_* performance I think it might
even be the right thing to do.

But the cleverness means we have the same logic slightly differently
twice. Not too bad ofc but not too nice either especially because that
rcu lookup is pretty complicated already.

A few days ago I did just write a long explanatory off-list email to
someone who had questions about this and who is fairly experienced so
we're not making it easy on people. But performance or simplicity; one
can't necessarily always have both.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-27 10:27       ` [Intel-gfx] " Christian Brauner
  (?)
@ 2023-11-27 17:10         ` Linus Torvalds
  -1 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-27 17:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Jann Horn,
	linux-doc, linuxppc-dev, intel-gfx, linux-fsdevel, gfs2, bpf,
	ying.huang, feng.tang, fengwei.yin

On Mon, 27 Nov 2023 at 02:27, Christian Brauner <brauner@kernel.org> wrote:
>
> So I've picked up your patch (vfs.misc). It's clever alright so thanks
> for the comments in there otherwise I would've stared at this for far
> too long.

Note that I should probably have commented on one other thing: that
whole "just load from fd[0] is always safe, because the fd[] array
always exists".

IOW, that whole "load and mask" thing only works when you know the
array exists at all.

Doing that "just mask the index" wouldn't be valid if "size = 0" is an
option and might mean that we don't have an array at all (ie if "->fd"
itself could be NULL.

But we never have a completely empty file descriptor array, and
fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.

(The whole 'tsk->files' could be NULL, but only for kernel threads or
when exiting, so fget_task() will check for *that*, but it's a
separate thing)

So that's why it's safe to *entirely* remove the whole

                if (unlikely(fd >= fdt->max_fds))

test, and do it *all* with just "mask the index, and mask the resulting load".

Because we can *always* do that load at "fdt->fd[0]", and we want to
check the result for NULL anyway, so the "mask at the end and check
for NULL" is both natural and generates very good code.

Anyway, not a big deal, bit it might be worth noting before somebody
tries the same trick on some other array that *could* be zero-sized
and with a NULL base pointer, and where that 'array[0]' access isn't
necessarily guaranteed to be ok.

> It's a little unpleasant because of the cast-orama going on before we
> check the file pointer but I don't see that it's in any way wrong.

In my cleanup phase - which was a bit messy - I did wonder if I should
have some helper for it, since it shows up in both __fget_files_rcu()
and in files_lookup_fd_raw().

So I *could* have tried to add something like a
"masked_rcu_dereference()" that took the base pointer, the index, and
the mask, and did that whole dance.

Or I could have had just a "mask_pointer()" function, which we do
occasionally do in other places too (ie we hide data in low bits, and
then we mask them away when the pointer is used as a pointer).

But with only two users, it seemed to add more conceptual complexity
than it's worth, and I was not convinced that we'd want to expose that
pattern and have others use it.

So having a helper might clarify things, but it might also encourage
wrong users. I dunno.

I suspect the only real use for this ends up being this very special
"access the fdt->fd[] array using a file descriptor".

Anyway, that's why I largely just did it with comments, and commented
both places - and just kept the cast there in the open.

             Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-27 17:10         ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-27 17:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: feng.tang, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, kernel test robot, ying.huang,
	oe-lkp, bpf, linuxppc-dev

On Mon, 27 Nov 2023 at 02:27, Christian Brauner <brauner@kernel.org> wrote:
>
> So I've picked up your patch (vfs.misc). It's clever alright so thanks
> for the comments in there otherwise I would've stared at this for far
> too long.

Note that I should probably have commented on one other thing: that
whole "just load from fd[0] is always safe, because the fd[] array
always exists".

IOW, that whole "load and mask" thing only works when you know the
array exists at all.

Doing that "just mask the index" wouldn't be valid if "size = 0" is an
option and might mean that we don't have an array at all (ie if "->fd"
itself could be NULL.

But we never have a completely empty file descriptor array, and
fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.

(The whole 'tsk->files' could be NULL, but only for kernel threads or
when exiting, so fget_task() will check for *that*, but it's a
separate thing)

So that's why it's safe to *entirely* remove the whole

                if (unlikely(fd >= fdt->max_fds))

test, and do it *all* with just "mask the index, and mask the resulting load".

Because we can *always* do that load at "fdt->fd[0]", and we want to
check the result for NULL anyway, so the "mask at the end and check
for NULL" is both natural and generates very good code.

Anyway, not a big deal, bit it might be worth noting before somebody
tries the same trick on some other array that *could* be zero-sized
and with a NULL base pointer, and where that 'array[0]' access isn't
necessarily guaranteed to be ok.

> It's a little unpleasant because of the cast-orama going on before we
> check the file pointer but I don't see that it's in any way wrong.

In my cleanup phase - which was a bit messy - I did wonder if I should
have some helper for it, since it shows up in both __fget_files_rcu()
and in files_lookup_fd_raw().

So I *could* have tried to add something like a
"masked_rcu_dereference()" that took the base pointer, the index, and
the mask, and did that whole dance.

Or I could have had just a "mask_pointer()" function, which we do
occasionally do in other places too (ie we hide data in low bits, and
then we mask them away when the pointer is used as a pointer).

But with only two users, it seemed to add more conceptual complexity
than it's worth, and I was not convinced that we'd want to expose that
pattern and have others use it.

So having a helper might clarify things, but it might also encourage
wrong users. I dunno.

I suspect the only real use for this ends up being this very special
"access the fdt->fd[] array using a file descriptor".

Anyway, that's why I largely just did it with comments, and commented
both places - and just kept the cast there in the open.

             Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-27 17:10         ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-11-27 17:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: feng.tang, lkp, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, kernel test robot, ying.huang,
	oe-lkp, bpf, linuxppc-dev

On Mon, 27 Nov 2023 at 02:27, Christian Brauner <brauner@kernel.org> wrote:
>
> So I've picked up your patch (vfs.misc). It's clever alright so thanks
> for the comments in there otherwise I would've stared at this for far
> too long.

Note that I should probably have commented on one other thing: that
whole "just load from fd[0] is always safe, because the fd[] array
always exists".

IOW, that whole "load and mask" thing only works when you know the
array exists at all.

Doing that "just mask the index" wouldn't be valid if "size = 0" is an
option and might mean that we don't have an array at all (ie if "->fd"
itself could be NULL.

But we never have a completely empty file descriptor array, and
fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.

(The whole 'tsk->files' could be NULL, but only for kernel threads or
when exiting, so fget_task() will check for *that*, but it's a
separate thing)

So that's why it's safe to *entirely* remove the whole

                if (unlikely(fd >= fdt->max_fds))

test, and do it *all* with just "mask the index, and mask the resulting load".

Because we can *always* do that load at "fdt->fd[0]", and we want to
check the result for NULL anyway, so the "mask at the end and check
for NULL" is both natural and generates very good code.

Anyway, not a big deal, bit it might be worth noting before somebody
tries the same trick on some other array that *could* be zero-sized
and with a NULL base pointer, and where that 'array[0]' access isn't
necessarily guaranteed to be ok.

> It's a little unpleasant because of the cast-orama going on before we
> check the file pointer but I don't see that it's in any way wrong.

In my cleanup phase - which was a bit messy - I did wonder if I should
have some helper for it, since it shows up in both __fget_files_rcu()
and in files_lookup_fd_raw().

So I *could* have tried to add something like a
"masked_rcu_dereference()" that took the base pointer, the index, and
the mask, and did that whole dance.

Or I could have had just a "mask_pointer()" function, which we do
occasionally do in other places too (ie we hide data in low bits, and
then we mask them away when the pointer is used as a pointer).

But with only two users, it seemed to add more conceptual complexity
than it's worth, and I was not convinced that we'd want to expose that
pattern and have others use it.

So having a helper might clarify things, but it might also encourage
wrong users. I dunno.

I suspect the only real use for this ends up being this very special
"access the fdt->fd[] array using a file descriptor".

Anyway, that's why I largely just did it with comments, and commented
both places - and just kept the cast there in the open.

             Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
  2023-11-27 17:10         ` [Intel-gfx] " Linus Torvalds
  (?)
@ 2023-11-28 15:52           ` Christian Brauner
  -1 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-28 15:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Jann Horn,
	linux-doc, linuxppc-dev, intel-gfx, linux-fsdevel, gfs2, bpf,
	ying.huang, feng.tang, fengwei.yin

On Mon, Nov 27, 2023 at 09:10:54AM -0800, Linus Torvalds wrote:
> On Mon, 27 Nov 2023 at 02:27, Christian Brauner <brauner@kernel.org> wrote:
> >
> > So I've picked up your patch (vfs.misc). It's clever alright so thanks
> > for the comments in there otherwise I would've stared at this for far
> > too long.
> 
> Note that I should probably have commented on one other thing: that
> whole "just load from fd[0] is always safe, because the fd[] array
> always exists".

I added a comment to that effect in the code.

> 
> IOW, that whole "load and mask" thing only works when you know the
> array exists at all.
> 
> Doing that "just mask the index" wouldn't be valid if "size = 0" is an
> option and might mean that we don't have an array at all (ie if "->fd"
> itself could be NULL.
> 
> But we never have a completely empty file descriptor array, and
> fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.
> 
> (The whole 'tsk->files' could be NULL, but only for kernel threads or
> when exiting, so fget_task() will check for *that*, but it's a
> separate thing)

Yep.

> 
> So that's why it's safe to *entirely* remove the whole
> 
>                 if (unlikely(fd >= fdt->max_fds))
> 
> test, and do it *all* with just "mask the index, and mask the resulting load".

Yep.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-28 15:52           ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-28 15:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: feng.tang, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, kernel test robot, ying.huang,
	oe-lkp, bpf, linuxppc-dev

On Mon, Nov 27, 2023 at 09:10:54AM -0800, Linus Torvalds wrote:
> On Mon, 27 Nov 2023 at 02:27, Christian Brauner <brauner@kernel.org> wrote:
> >
> > So I've picked up your patch (vfs.misc). It's clever alright so thanks
> > for the comments in there otherwise I would've stared at this for far
> > too long.
> 
> Note that I should probably have commented on one other thing: that
> whole "just load from fd[0] is always safe, because the fd[] array
> always exists".

I added a comment to that effect in the code.

> 
> IOW, that whole "load and mask" thing only works when you know the
> array exists at all.
> 
> Doing that "just mask the index" wouldn't be valid if "size = 0" is an
> option and might mean that we don't have an array at all (ie if "->fd"
> itself could be NULL.
> 
> But we never have a completely empty file descriptor array, and
> fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.
> 
> (The whole 'tsk->files' could be NULL, but only for kernel threads or
> when exiting, so fget_task() will check for *that*, but it's a
> separate thing)

Yep.

> 
> So that's why it's safe to *entirely* remove the whole
> 
>                 if (unlikely(fd >= fdt->max_fds))
> 
> test, and do it *all* with just "mask the index, and mask the resulting load".

Yep.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
@ 2023-11-28 15:52           ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-11-28 15:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: feng.tang, lkp, Jann Horn, intel-gfx, linux-doc, linux-kernel,
	fengwei.yin, gfs2, linux-fsdevel, kernel test robot, ying.huang,
	oe-lkp, bpf, linuxppc-dev

On Mon, Nov 27, 2023 at 09:10:54AM -0800, Linus Torvalds wrote:
> On Mon, 27 Nov 2023 at 02:27, Christian Brauner <brauner@kernel.org> wrote:
> >
> > So I've picked up your patch (vfs.misc). It's clever alright so thanks
> > for the comments in there otherwise I would've stared at this for far
> > too long.
> 
> Note that I should probably have commented on one other thing: that
> whole "just load from fd[0] is always safe, because the fd[] array
> always exists".

I added a comment to that effect in the code.

> 
> IOW, that whole "load and mask" thing only works when you know the
> array exists at all.
> 
> Doing that "just mask the index" wouldn't be valid if "size = 0" is an
> option and might mean that we don't have an array at all (ie if "->fd"
> itself could be NULL.
> 
> But we never have a completely empty file descriptor array, and
> fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.
> 
> (The whole 'tsk->files' could be NULL, but only for kernel threads or
> when exiting, so fget_task() will check for *that*, but it's a
> separate thing)

Yep.

> 
> So that's why it's safe to *entirely* remove the whole
> 
>                 if (unlikely(fd >= fdt->max_fds))
> 
> test, and do it *all* with just "mask the index, and mask the resulting load".

Yep.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-11-28 15:53 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-20  7:11 [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression kernel test robot
2023-11-20  7:11 ` kernel test robot
2023-11-20  7:11 ` [Intel-gfx] " kernel test robot
2023-11-20  7:41 ` Mateusz Guzik
2023-11-20  7:41   ` [Intel-gfx] " Mateusz Guzik
2023-11-20  7:41   ` Mateusz Guzik
2023-11-26 20:23 ` Linus Torvalds
2023-11-26 20:23   ` Linus Torvalds
2023-11-26 20:23   ` [Intel-gfx] " Linus Torvalds
2023-11-26 23:20   ` Linus Torvalds
2023-11-26 23:20     ` Linus Torvalds
2023-11-26 23:20     ` Linus Torvalds
2023-11-27  6:58     ` Oliver Sang
2023-11-27  6:58       ` Oliver Sang
2023-11-27  6:58       ` [Intel-gfx] " Oliver Sang
2023-11-27 10:27     ` Christian Brauner
2023-11-27 10:27       ` Christian Brauner
2023-11-27 10:27       ` [Intel-gfx] " Christian Brauner
2023-11-27 17:10       ` Linus Torvalds
2023-11-27 17:10         ` Linus Torvalds
2023-11-27 17:10         ` [Intel-gfx] " Linus Torvalds
2023-11-28 15:52         ` Christian Brauner
2023-11-28 15:52           ` Christian Brauner
2023-11-28 15:52           ` [Intel-gfx] " Christian Brauner
2023-11-27 10:13   ` Christian Brauner
2023-11-27 10:13     ` Christian Brauner
2023-11-27 10:13     ` [Intel-gfx] " Christian Brauner
2023-11-26 20:50 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2023-11-26 20:50 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-11-26 23:26 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for 0ede61d858: will-it-scale.per_thread_ops -2.9% regression (rev2) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.