[RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes

* [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
@ 2022-05-12  3:04 Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation Joel Fernandes (Google)
                   ` (15 more replies)
  0 siblings, 16 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

Hello!
Please find the proof of concept version of call_rcu_lazy() attached. This
gives a lot of savings when the CPUs are relatively idle. Huge thanks to
Rushikesh Kadam from Intel for investigating it with me.

Some numbers below:

Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
The observation is that due to a 'trickle down' effect of RCU callbacks, the
system is very lightly loaded but constantly running few RCU callbacks very
often. This confuses the power management hardware that the system is active,
when it is in fact idle.

For example, when ChromeOS screen is off and user is not doing anything on the
system, we can see big power savings.
Before:
Pk%pc10 = 72.13
PkgWatt = 0.58
CorWatt = 0.04

After:
Pk%pc10 = 81.28
PkgWatt = 0.41
CorWatt = 0.03

Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
can see that the display pipeline is constantly doing RCU callback queuing due
to open/close of file descriptors associated with graphics buffers. This is
attributed to the file_free_rcu() path which this patch series also touches.

This patch series adds a simple but effective, and lockless implementation of
RCU callback batching. On memory pressure, timeout or queue growing too big, we
initiate a flush of one or more per-CPU lists.

Similar results can be achieved by increasing jiffies_till_first_fqs, however
that also has the effect of slowing down RCU. Especially I saw huge slow down
of function graph tracer when increasing that.

One drawback of this series is, if another frequent RCU callback creeps up in
the future, that's not lazy, then that will again hurt the power. However, I
believe identifying and fixing those is a more reasonable approach than slowing
RCU down for the whole system.

NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
at runtime to turn it on or off globally. It is default to on. Further, please
use the sysctls in lazy.c for further tuning of parameters that effect the
flushing.

Disclaimer 1: Don't boot your personal system on it yet anticipating power
savings, as TREE07 still causes RCU stalls and I am looking more into that, but
I believe this series should be good for general testing.

Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
of review and agreements.

Joel Fernandes (Google) (14):
  rcu: Add a lock-less lazy RCU implementation
  workqueue: Add a lazy version of queue_rcu_work()
  block/blk-ioc: Move call_rcu() to call_rcu_lazy()
  cred: Move call_rcu() to call_rcu_lazy()
  fs: Move call_rcu() to call_rcu_lazy() in some paths
  kernel: Move various core kernel usages to call_rcu_lazy()
  security: Move call_rcu() to call_rcu_lazy()
  net/core: Move call_rcu() to call_rcu_lazy()
  lib: Move call_rcu() to call_rcu_lazy()
  kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
  i915: Move call_rcu() to call_rcu_lazy()
  rcu/kfree: remove useless monitor_todo flag
  rcu/kfree: Fix kfree_rcu_shrink_count() return value
  DEBUG: Toggle rcu_lazy and tune at runtime

 block/blk-ioc.c                            |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.c |   2 +-
 fs/dcache.c                                |   4 +-
 fs/eventpoll.c                             |   2 +-
 fs/file_table.c                            |   3 +-
 fs/inode.c                                 |   2 +-
 include/linux/rcupdate.h                   |   6 +
 include/linux/sched/sysctl.h               |   4 +
 include/linux/workqueue.h                  |   1 +
 kernel/cred.c                              |   2 +-
 kernel/exit.c                              |   2 +-
 kernel/pid.c                               |   2 +-
 kernel/rcu/Kconfig                         |   8 ++
 kernel/rcu/Makefile                        |   1 +
 kernel/rcu/lazy.c                          | 153 +++++++++++++++++++++
 kernel/rcu/rcu.h                           |   5 +
 kernel/rcu/tree.c                          |  28 ++--
 kernel/sysctl.c                            |  23 ++++
 kernel/time/posix-timers.c                 |   2 +-
 kernel/workqueue.c                         |  25 ++++
 lib/radix-tree.c                           |   2 +-
 lib/xarray.c                               |   2 +-
 net/core/dst.c                             |   2 +-
 security/security.c                        |   2 +-
 security/selinux/avc.c                     |   4 +-
 25 files changed, 255 insertions(+), 34 deletions(-)
 create mode 100644 kernel/rcu/lazy.c

-- 
2.36.0.550.gb090851708-goog

^ permalink raw reply	[flat|nested] 73+ messages in thread