linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/86] Make the kernel preemptible
@ 2023-11-07 21:56 Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 01/86] Revert "riscv: support PREEMPT_DYNAMIC with static keys" Ankur Arora
                   ` (62 more replies)
  0 siblings, 63 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Hi,

We have two models of preemption: voluntary and full (and RT which is
a fuller form of full preemption.) In this series -- which is based
on Thomas' PoC (see [1]), we try to unify the two by letting the
scheduler enforce policy for the voluntary preemption models as well.

(Note that this is about preemption when executing in the kernel.
Userspace is always preemptible.)

Background
==

Why?: both of these preemption mechanisms are almost entirely disjoint.
There are four main sets of preemption points in the kernel:

 1. return to user
 2. explicit preemption points (cond_resched() and its ilk)
 3. return to kernel (tick/IPI/irq at irqexit)
 4. end of non-preemptible sections at (preempt_count() == preempt_offset)

Voluntary preemption uses mechanisms 1 and 2. Full preemption
uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*}
which can be all things to all people because they internally
contain 2, and 4.

Now since there's no ideal placement of explicit preemption points,
they tend to be randomly spread over code and accumulate over time,
as they are are added when latency problems are seen. Plus fear of
regressions makes them difficult to remove.
(Presumably, asymptotically they would spead out evenly across the
instruction stream!)

In voluntary models, the scheduler's job is to match the demand
side of preemption points (a task that needs to be scheduled) with
the supply side (a task which calls cond_resched().)

Full preemption models track preemption count so the scheduler can
always knows if it is safe to preempt and can drive preemption
itself (ex. via dynamic preemption points in 3.)

Design
==

As Thomas outlines in [1], to unify the preemption models we
want to: always have the preempt_count enabled and allow the scheduler
to drive preemption policy based on the model in effect.

Policies:

- preemption=none: run to completion
- preemption=voluntary: run to completion, unless a task of higher
  sched-class awaits
- preemption=full: optimized for low-latency. Preempt whenever a higher
  priority task awaits.

To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
scheduler to mark that a reschedule is needed, but is deferred until
the task finishes executing in the kernel -- voluntary preemption
as it were.

The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.

         ret-to-user    ret-to-kernel    preempt_count()
none           Y              N                N
voluntary      Y              Y                Y
full           Y              Y                Y


There's just one remaining issue: now that explicit preemption points are
gone, processes that spread a long time in the kernel have no way to give
up the CPU.

For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.

For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
(This would cause preemption either at ret-to-kernel, or if the task is in
a non-preemptible section, when it exits that section.)

Arguably this provides for much more consistent maximum latency (~2 tick
lengths + length of non-preemptible section) as compared to the old model
where the maximum latency depended on the dynamic distribution of
cond_resched() points.

(As a bonus it handles code that is preemptible but cannot call cond_resched()
 completely trivially: ex. long running Xen hypercalls, or this series
 which started this discussion:
 https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/)


Status
==

What works:
 - The system seems to keep ticking over with the normal scheduling policies
   (SCHED_OTHER). The support for the realtime policies is somewhat more
   half baked.)
 - The basic performance numbers seem pretty close to 6.6-rc7 baseline

What's broken:
 - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
   lazily")
 - Non-x86 architectures. It's trivial to support other archs (only need
   to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
   comments on the series.
   (From some testing on arm64, didn't find any surprises.)
 - livepatch: livepatch depends on using _cond_resched() to provide
   low-latency patching. That is obviously difficult with cond_resched()
   gone. We could get a similar effect by using a static_key in
   preempt_enable() but at least with inline locks, that might be end
   up bloating the kernel quite a bit.
 - Documentation/ and comments mention cond_resched()
 - ftrace support for need-resched-lazy is incomplete

What needs more discussion:
 - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
   only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
   change to tif_need_resched(resched_t)")
 - Tracking whether a task in userspace or in the kernel (See patch-40
   "context_tracking: add ct_state_cpu()")
 - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
   preemption")


Performance
==

Expectation:

* perf sched bench pipe

preemption               full           none

6.6-rc7              6.68 +- 0.10   6.69 +- 0.07
+series              6.69 +- 0.12   6.67 +- 0.10

This is rescheduling out of idle which should and does perform identically.

* schbench, preempt=none

  * 1 group, 16 threads each

                 6.6-rc7      +series  
                 (usecs)      (usecs)
     50.0th:         6            6  
     90.0th:         8            7  
     99.0th:        11           11  
     99.9th:        15           14  
  
  * 8 groups, 16 threads each

                6.6-rc7       +series  
                 (usecs)      (usecs)
     50.0th:         6            6  
     90.0th:         8            8  
     99.0th:        12           11  
     99.9th:        20           21  


* schbench, preempt=full

  * 1 group, 16 threads each

                6.6-rc7       +series  
                (usecs)       (usecs)
     50.0th:         6            6  
     90.0th:         8            7  
     99.0th:        11           11  
     99.9th:        14           14  


  * 8 groups, 16 threads each

                6.6-rc7       +series  
                (usecs)       (usecs)
     50.0th:         7            7  
     90.0th:         9            9  
     99.0th:        12           12  
     99.9th:        21           22  


  Not much in it either way.

* kernbench, preempt=full

  * half-load (-j 128)

           6.6-rc7                                    +series                     

  wall        149.2  +-     27.2             wall        132.8  +-     0.4
  utime      8097.1  +-     57.4             utime      8088.5  +-    14.1
  stime      1165.5  +-      9.4             stime      1159.2  +-     1.9
  %cpu       6337.6  +-   1072.8             %cpu       6959.6  +-    22.8
  csw      237618    +-   2190.6             %csw     240343    +-  1386.8


  * optimal-load (-j 1024)

           6.6-rc7                                    +series                     

  wall        137.8 +-       0.0             wall       137.7  +-       0.8
  utime     11115.0 +-    3306.1             utime    11041.7  +-    3235.0
  stime      1340.0 +-     191.3             stime     1323.1  +-     179.5
  %cpu       8846.3 +-    2830.6             %cpu      9101.3  +-    2346.7
  csw     2099910   +- 2040080.0             csw    2068210    +- 2002450.0


  The preempt=full path should effectively not see any change in
  behaviour. The optimal-loads are pretty much identical.
  For the half-load, however, the +series version does much better but that
  seems to be because of much higher run to run variability in the 6.6-rc7 load.

* kernbench, preempt=none

  * half-load (-j 128)

           6.6-rc7                                    +series                     

  wall        134.5  +-      4.2             wall        133.6  +-     2.7
  utime      8093.3  +-     39.3             utime      8099.0  +-    38.9
  stime      1175.7  +-     10.6             stime      1169.1  +-     8.4
  %cpu       6893.3  +-    233.2             %cpu       6936.3  +-   142.8
  csw      240723    +-    423.0             %csw     173152    +-  1126.8
                                             

  * optimal-load (-j 1024)

           6.6-rc7                                    +series                     

  wall        139.2 +-       0.3             wall       138.8  +-       0.2
  utime     11161.0 +-    3360.4             utime    11061.2  +-    3244.9
  stime      1357.6 +-     199.3             stime     1366.6  +-     216.3
  %cpu       9108.8 +-    2431.4             %cpu      9081.0  +-    2351.1
  csw     2078599   +- 2013320.0             csw    1970610    +- 1969030.0


  For both of these the wallclock, utime, stime etc are pretty much
  identical. The one interesting difference is that the number of
  context switches are fewer. This intuitively makes sense given that
  we reschedule threads lazily rather than rescheduling if we encounter
  a cond_resched() and there's a thread wanting to be scheduled.

  The max-load numbers (not posted here) also behave similarly.


Series
==

With that, this is how he series is laid out:

 - Patches 01-30: revert the PREEMPT_DYNAMIC code. Most of the infrastructure
   used by that is via static_calls() and this is a simpler approach which
   doesn't need any of that (and does away with cond_resched().)

   Some of the commits will be resurrected.
       089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout")
       cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors")
       5693fa74f98a ("kcsan: Use preemption model accessors")

 - Patches 31-45: contain the scheduler changes to do this. Of these
   the critical ones are:
     patch 35 "thread_info: change to tif_need_resched(resched_t)"
     patch 41 "sched: handle resched policy in resched_curr()"
     patch 43 "sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models"
     patch 44 "sched: voluntary preemption"
      (this needs more work to decide when a higher sched-policy task
       should preempt a lower sched-policy task)
     patch 45 "preempt: ARCH_NO_PREEMPT only preempts lazily"

 - Patches 47-50: contain RCU related changes. RCU now works in both
   PREEMPT_RCU=y and PREEMPT_RCU=n modes with CONFIG_PREEMPTION.
   (Until now PREEMPTION=y => PREEMPT_RCU)

 - Patches 51-56,86: contain cond_resched() related cleanups.
     patch 54 "sched: add cond_resched_stall()" adds a new cond_resched()
     interface. Pitchforks?

 - Patches 57-86: remove cond_resched() from the tree.


Also at: github.com/terminus/linux preemption-rfc


Please review.

Thanks
Ankur

[1] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/


Ankur Arora (86):
  Revert "riscv: support PREEMPT_DYNAMIC with static keys"
  Revert "sched/core: Make sched_dynamic_mutex static"
  Revert "ftrace: Use preemption model accessors for trace header
    printout"
  Revert "preempt/dynamic: Introduce preemption model accessors"
  Revert "kcsan: Use preemption model accessors"
  Revert "entry: Fix compile error in
    dynamic_irqentry_exit_cond_resched()"
  Revert "livepatch,sched: Add livepatch task switching to
    cond_resched()"
  Revert "arm64: Support PREEMPT_DYNAMIC"
  Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
  Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
    GENERIC_ENTRY"
  Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
  Revert "sched/preempt: Refactor sched_dynamic_update()"
  Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
  Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
  Revert "preempt: Restore preemption model selection configs"
  Revert "sched: Provide Kconfig support for default dynamic preempt
    mode"
  sched/preempt: remove PREEMPT_DYNAMIC from the build version
  Revert "preempt/dynamic: Fix typo in macro conditional statement"
  Revert "sched,preempt: Move preempt_dynamic to debug.c"
  Revert "static_call: Relax static_call_update() function argument
    type"
  Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
  Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
  Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
  Revert "sched: Harden PREEMPT_DYNAMIC"
  Revert "sched: Add /debug/sched_preempt"
  Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
    option"
  Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
    call"
  Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
    calls"
  Revert "preempt/dynamic: Provide cond_resched() and might_resched()
    static calls"
  Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
  x86/thread_info: add TIF_NEED_RESCHED_LAZY
  entry: handle TIF_NEED_RESCHED_LAZY
  entry/kvm: handle TIF_NEED_RESCHED_LAZY
  thread_info: accessors for TIF_NEED_RESCHED*
  thread_info: change to tif_need_resched(resched_t)
  entry: irqentry_exit only preempts TIF_NEED_RESCHED
  sched: make test_*_tsk_thread_flag() return bool
  sched: *_tsk_need_resched() now takes resched_t
  sched: handle lazy resched in set_nr_*_polling()
  context_tracking: add ct_state_cpu()
  sched: handle resched policy in resched_curr()
  sched: force preemption on tick expiration
  sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
  sched: voluntary preemption
  preempt: ARCH_NO_PREEMPT only preempts lazily
  tracing: handle lazy resched
  rcu: select PREEMPT_RCU if PREEMPT
  rcu: handle quiescent states for PREEMPT_RCU=n
  osnoise: handle quiescent states directly
  rcu: TASKS_RCU does not need to depend on PREEMPTION
  preempt: disallow !PREEMPT_COUNT or !PREEMPTION
  sched: remove CONFIG_PREEMPTION from *_needbreak()
  sched: fixup __cond_resched_*()
  sched: add cond_resched_stall()
  xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq()
  xarray: use cond_resched_xas*()
  coccinelle: script to remove cond_resched()
  treewide: x86: remove cond_resched()
  treewide: rcu: remove cond_resched()
  treewide: torture: remove cond_resched()
  treewide: bpf: remove cond_resched()
  treewide: trace: remove cond_resched()
  treewide: futex: remove cond_resched()
  treewide: printk: remove cond_resched()
  treewide: task_work: remove cond_resched()
  treewide: kernel: remove cond_resched()
  treewide: kernel: remove cond_reshed()
  treewide: mm: remove cond_resched()
  treewide: io_uring: remove cond_resched()
  treewide: ipc: remove cond_resched()
  treewide: lib: remove cond_resched()
  treewide: crypto: remove cond_resched()
  treewide: security: remove cond_resched()
  treewide: fs: remove cond_resched()
  treewide: virt: remove cond_resched()
  treewide: block: remove cond_resched()
  treewide: netfilter: remove cond_resched()
  treewide: net: remove cond_resched()
  treewide: net: remove cond_resched()
  treewide: sound: remove cond_resched()
  treewide: md: remove cond_resched()
  treewide: mtd: remove cond_resched()
  treewide: drm: remove cond_resched()
  treewide: net: remove cond_resched()
  treewide: drivers: remove cond_resched()
  sched: remove cond_resched()

 .../admin-guide/kernel-parameters.txt         |   7 -
 arch/Kconfig                                  |  42 +-
 arch/arm64/Kconfig                            |   1 -
 arch/arm64/include/asm/preempt.h              |  19 +-
 arch/arm64/kernel/entry-common.c              |  10 +-
 arch/riscv/Kconfig                            |   1 -
 arch/s390/include/asm/preempt.h               |   4 +-
 arch/x86/Kconfig                              |   1 -
 arch/x86/include/asm/preempt.h                |  50 +-
 arch/x86/include/asm/thread_info.h            |   6 +-
 arch/x86/kernel/alternative.c                 |  10 -
 arch/x86/kernel/cpu/sgx/encl.c                |  14 +-
 arch/x86/kernel/cpu/sgx/ioctl.c               |   3 -
 arch/x86/kernel/cpu/sgx/main.c                |   5 -
 arch/x86/kernel/cpu/sgx/virt.c                |   4 -
 arch/x86/kvm/lapic.c                          |   6 +-
 arch/x86/kvm/mmu/mmu.c                        |   2 +-
 arch/x86/kvm/svm/sev.c                        |   5 +-
 arch/x86/net/bpf_jit_comp.c                   |   1 -
 arch/x86/net/bpf_jit_comp32.c                 |   1 -
 arch/x86/xen/mmu_pv.c                         |   1 -
 block/blk-cgroup.c                            |   2 -
 block/blk-lib.c                               |  11 -
 block/blk-mq.c                                |   3 -
 block/blk-zoned.c                             |   6 -
 crypto/internal.h                             |   2 +-
 crypto/tcrypt.c                               |   5 -
 crypto/testmgr.c                              |  10 -
 drivers/accel/ivpu/ivpu_drv.c                 |   2 -
 drivers/accel/ivpu/ivpu_gem.c                 |   1 -
 drivers/accel/ivpu/ivpu_pm.c                  |   8 +-
 drivers/accel/qaic/qaic_data.c                |   2 -
 drivers/acpi/processor_idle.c                 |   2 +-
 drivers/auxdisplay/charlcd.c                  |  11 -
 drivers/base/power/domain.c                   |   1 -
 drivers/block/aoe/aoecmd.c                    |   3 +-
 drivers/block/brd.c                           |   1 -
 drivers/block/drbd/drbd_bitmap.c              |   4 -
 drivers/block/drbd/drbd_debugfs.c             |   1 -
 drivers/block/loop.c                          |   3 -
 drivers/block/xen-blkback/blkback.c           |   3 -
 drivers/block/zram/zram_drv.c                 |   2 -
 drivers/bluetooth/virtio_bt.c                 |   1 -
 drivers/char/hw_random/arm_smccc_trng.c       |   1 -
 drivers/char/lp.c                             |   2 -
 drivers/char/mem.c                            |   4 -
 drivers/char/mwave/3780i.c                    |   4 +-
 drivers/char/ppdev.c                          |   4 -
 drivers/char/random.c                         |   2 -
 drivers/char/virtio_console.c                 |   1 -
 drivers/crypto/virtio/virtio_crypto_core.c    |   1 -
 drivers/cxl/pci.c                             |   1 -
 drivers/dma-buf/selftest.c                    |   1 -
 drivers/dma-buf/st-dma-fence-chain.c          |   1 -
 drivers/fsi/fsi-sbefifo.c                     |  14 +-
 drivers/gpu/drm/bridge/samsung-dsim.c         |   2 +-
 drivers/gpu/drm/drm_buddy.c                   |   1 -
 drivers/gpu/drm/drm_gem.c                     |   1 -
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.c    |   1 -
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c     |   2 -
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |   6 -
 .../drm/i915/gem/selftests/i915_gem_mman.c    |   5 -
 drivers/gpu/drm/i915/gt/intel_breadcrumbs.c   |   2 +-
 drivers/gpu/drm/i915/gt/intel_gt.c            |   2 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c       |   4 -
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |   4 -
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   2 -
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |   2 -
 drivers/gpu/drm/i915/gt/selftest_migrate.c    |   2 -
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |   4 -
 drivers/gpu/drm/i915/i915_active.c            |   2 +-
 drivers/gpu/drm/i915/i915_gem_evict.c         |   2 -
 drivers/gpu/drm/i915/i915_gpu_error.c         |  18 +-
 drivers/gpu/drm/i915/intel_uncore.c           |   1 -
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 -
 drivers/gpu/drm/i915/selftests/i915_request.c |   2 -
 .../gpu/drm/i915/selftests/i915_selftest.c    |   3 -
 drivers/gpu/drm/i915/selftests/i915_vma.c     |   9 -
 .../gpu/drm/i915/selftests/igt_flush_test.c   |   2 -
 .../drm/i915/selftests/intel_memory_region.c  |   4 -
 drivers/gpu/drm/tests/drm_buddy_test.c        |   5 -
 drivers/gpu/drm/tests/drm_mm_test.c           |  29 -
 drivers/i2c/busses/i2c-bcm-iproc.c            |   9 +-
 drivers/i2c/busses/i2c-highlander.c           |   9 +-
 drivers/i2c/busses/i2c-ibm_iic.c              |  11 +-
 drivers/i2c/busses/i2c-mpc.c                  |   2 +-
 drivers/i2c/busses/i2c-mxs.c                  |   9 +-
 drivers/i2c/busses/scx200_acb.c               |   9 +-
 drivers/infiniband/core/umem.c                |   1 -
 drivers/infiniband/hw/hfi1/driver.c           |   1 -
 drivers/infiniband/hw/hfi1/firmware.c         |   2 +-
 drivers/infiniband/hw/hfi1/init.c             |   1 -
 drivers/infiniband/hw/hfi1/ruc.c              |   1 -
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c    |   5 +-
 drivers/infiniband/hw/qib/qib_init.c          |   1 -
 drivers/infiniband/sw/rxe/rxe_qp.c            |   3 +-
 drivers/infiniband/sw/rxe/rxe_task.c          |   4 +-
 drivers/input/evdev.c                         |   1 -
 drivers/input/keyboard/clps711x-keypad.c      |   2 +-
 drivers/input/misc/uinput.c                   |   1 -
 drivers/input/mousedev.c                      |   1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   2 -
 drivers/md/bcache/btree.c                     |   5 -
 drivers/md/bcache/journal.c                   |   2 -
 drivers/md/bcache/sysfs.c                     |   1 -
 drivers/md/bcache/writeback.c                 |   2 -
 drivers/md/dm-bufio.c                         |  14 -
 drivers/md/dm-cache-target.c                  |   4 -
 drivers/md/dm-crypt.c                         |   3 -
 drivers/md/dm-integrity.c                     |   3 -
 drivers/md/dm-kcopyd.c                        |   2 -
 drivers/md/dm-snap.c                          |   1 -
 drivers/md/dm-stats.c                         |   8 -
 drivers/md/dm-thin.c                          |   2 -
 drivers/md/dm-writecache.c                    |  11 -
 drivers/md/dm.c                               |   4 -
 drivers/md/md.c                               |   1 -
 drivers/md/raid1.c                            |   2 -
 drivers/md/raid10.c                           |   3 -
 drivers/md/raid5.c                            |   2 -
 drivers/media/i2c/vpx3220.c                   |   3 -
 drivers/media/pci/cobalt/cobalt-i2c.c         |   4 +-
 drivers/misc/bcm-vk/bcm_vk_dev.c              |   3 +-
 drivers/misc/bcm-vk/bcm_vk_msg.c              |   3 +-
 drivers/misc/genwqe/card_base.c               |   3 +-
 drivers/misc/genwqe/card_ddcb.c               |   6 -
 drivers/misc/genwqe/card_dev.c                |   2 -
 drivers/misc/vmw_balloon.c                    |   4 -
 drivers/mmc/host/mmc_spi.c                    |   3 -
 drivers/mtd/chips/cfi_cmdset_0001.c           |   6 -
 drivers/mtd/chips/cfi_cmdset_0002.c           |   1 -
 drivers/mtd/chips/cfi_util.c                  |   2 +-
 drivers/mtd/devices/spear_smi.c               |   2 +-
 drivers/mtd/devices/sst25l.c                  |   3 +-
 drivers/mtd/devices/st_spi_fsm.c              |   4 -
 drivers/mtd/inftlcore.c                       |   5 -
 drivers/mtd/lpddr/lpddr_cmds.c                |   6 +-
 drivers/mtd/mtd_blkdevs.c                     |   1 -
 drivers/mtd/nand/onenand/onenand_base.c       |  18 +-
 drivers/mtd/nand/onenand/onenand_samsung.c    |   8 +-
 drivers/mtd/nand/raw/diskonchip.c             |   4 +-
 drivers/mtd/nand/raw/fsmc_nand.c              |   3 +-
 drivers/mtd/nand/raw/hisi504_nand.c           |   2 +-
 drivers/mtd/nand/raw/nand_base.c              |   3 +-
 drivers/mtd/nand/raw/nand_legacy.c            |  17 +-
 drivers/mtd/spi-nor/core.c                    |   8 +-
 drivers/mtd/tests/mtd_test.c                  |   2 -
 drivers/mtd/tests/mtd_test.h                  |   2 +-
 drivers/mtd/tests/pagetest.c                  |   1 -
 drivers/mtd/tests/readtest.c                  |   2 -
 drivers/mtd/tests/torturetest.c               |   1 -
 drivers/mtd/ubi/attach.c                      |  10 -
 drivers/mtd/ubi/build.c                       |   2 -
 drivers/mtd/ubi/cdev.c                        |   4 -
 drivers/mtd/ubi/eba.c                         |   8 -
 drivers/mtd/ubi/misc.c                        |   2 -
 drivers/mtd/ubi/vtbl.c                        |   6 -
 drivers/mtd/ubi/wl.c                          |  13 -
 drivers/net/dummy.c                           |   1 -
 drivers/net/ethernet/broadcom/tg3.c           |   2 +-
 drivers/net/ethernet/intel/e1000/e1000_hw.c   |   3 -
 drivers/net/ethernet/mediatek/mtk_eth_soc.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx4/catas.c    |   2 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c      |  13 +-
 .../ethernet/mellanox/mlx4/resource_tracker.c |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   3 +-
 drivers/net/ethernet/mellanox/mlxsw/i2c.c     |   5 -
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |   2 -
 drivers/net/ethernet/pasemi/pasemi_mac.c      |   3 -
 .../ethernet/qlogic/netxen/netxen_nic_init.c  |   2 -
 .../ethernet/qlogic/qlcnic/qlcnic_83xx_init.c |   1 -
 .../net/ethernet/qlogic/qlcnic/qlcnic_init.c  |   1 -
 .../ethernet/qlogic/qlcnic/qlcnic_minidump.c  |   2 -
 drivers/net/ethernet/sfc/falcon/falcon.c      |   6 -
 drivers/net/ifb.c                             |   1 -
 drivers/net/ipvlan/ipvlan_core.c              |   1 -
 drivers/net/macvlan.c                         |   2 -
 drivers/net/mhi_net.c                         |   4 +-
 drivers/net/netdevsim/fib.c                   |   1 -
 drivers/net/virtio_net.c                      |   2 -
 drivers/net/wireguard/ratelimiter.c           |   2 -
 drivers/net/wireguard/receive.c               |   3 -
 drivers/net/wireguard/send.c                  |   4 -
 drivers/net/wireless/broadcom/b43/lo.c        |   6 +-
 drivers/net/wireless/broadcom/b43/pio.c       |   1 -
 drivers/net/wireless/broadcom/b43legacy/phy.c |   5 -
 .../broadcom/brcm80211/brcmfmac/cfg80211.c    |   1 -
 drivers/net/wireless/cisco/airo.c             |   2 -
 .../net/wireless/intel/iwlwifi/pcie/trans.c   |   2 -
 drivers/net/wireless/marvell/mwl8k.c          |   2 -
 drivers/net/wireless/mediatek/mt76/util.c     |   1 -
 drivers/net/wwan/mhi_wwan_mbim.c              |   2 +-
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c    |   3 -
 drivers/net/xen-netback/netback.c             |   1 -
 drivers/net/xen-netback/rx.c                  |   2 -
 drivers/nvdimm/btt.c                          |   2 -
 drivers/nvme/target/zns.c                     |   2 -
 drivers/parport/parport_ip32.c                |   1 -
 drivers/parport/parport_pc.c                  |   4 -
 drivers/pci/pci-sysfs.c                       |   1 -
 drivers/pci/proc.c                            |   1 -
 .../intel/speed_select_if/isst_if_mbox_pci.c  |   4 +-
 drivers/s390/cio/css.c                        |   8 -
 drivers/scsi/NCR5380.c                        |   2 -
 drivers/scsi/megaraid.c                       |   1 -
 drivers/scsi/qedi/qedi_main.c                 |   1 -
 drivers/scsi/qla2xxx/qla_nx.c                 |   2 -
 drivers/scsi/qla2xxx/qla_sup.c                |   5 -
 drivers/scsi/qla4xxx/ql4_nx.c                 |   1 -
 drivers/scsi/xen-scsifront.c                  |   2 +-
 drivers/spi/spi-lantiq-ssc.c                  |   3 +-
 drivers/spi/spi-meson-spifc.c                 |   2 +-
 drivers/spi/spi.c                             |   2 +-
 drivers/staging/rtl8723bs/core/rtw_mlme_ext.c |   2 +-
 drivers/staging/rtl8723bs/core/rtw_pwrctrl.c  |   2 -
 drivers/tee/optee/ffa_abi.c                   |   1 -
 drivers/tee/optee/smc_abi.c                   |   1 -
 drivers/tty/hvc/hvc_console.c                 |   6 +-
 drivers/tty/tty_buffer.c                      |   3 -
 drivers/tty/tty_io.c                          |   1 -
 drivers/usb/gadget/udc/max3420_udc.c          |   1 -
 drivers/usb/host/max3421-hcd.c                |   2 +-
 drivers/usb/host/xen-hcd.c                    |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c           |   2 -
 drivers/vfio/vfio_iommu_type1.c               |   7 -
 drivers/vhost/vhost.c                         |   1 -
 drivers/video/console/vgacon.c                |   4 -
 drivers/virtio/virtio_mem.c                   |   8 -
 drivers/xen/balloon.c                         |   2 -
 drivers/xen/gntdev.c                          |   2 -
 drivers/xen/xen-scsiback.c                    |   9 +-
 fs/afs/write.c                                |   2 -
 fs/btrfs/backref.c                            |   6 -
 fs/btrfs/block-group.c                        |   3 -
 fs/btrfs/ctree.c                              |   1 -
 fs/btrfs/defrag.c                             |   1 -
 fs/btrfs/disk-io.c                            |   3 -
 fs/btrfs/extent-io-tree.c                     |   5 -
 fs/btrfs/extent-tree.c                        |   8 -
 fs/btrfs/extent_io.c                          |   9 -
 fs/btrfs/file-item.c                          |   1 -
 fs/btrfs/file.c                               |   4 -
 fs/btrfs/free-space-cache.c                   |   4 -
 fs/btrfs/inode.c                              |   9 -
 fs/btrfs/ordered-data.c                       |   2 -
 fs/btrfs/qgroup.c                             |   1 -
 fs/btrfs/reflink.c                            |   2 -
 fs/btrfs/relocation.c                         |   9 -
 fs/btrfs/scrub.c                              |   3 -
 fs/btrfs/send.c                               |   1 -
 fs/btrfs/space-info.c                         |   1 -
 fs/btrfs/tests/extent-io-tests.c              |   1 -
 fs/btrfs/transaction.c                        |   3 -
 fs/btrfs/tree-log.c                           |  12 -
 fs/btrfs/uuid-tree.c                          |   1 -
 fs/btrfs/volumes.c                            |   2 -
 fs/buffer.c                                   |   1 -
 fs/cachefiles/cache.c                         |   4 +-
 fs/cachefiles/namei.c                         |   1 -
 fs/cachefiles/volume.c                        |   1 -
 fs/ceph/addr.c                                |   1 -
 fs/dax.c                                      |  16 +-
 fs/dcache.c                                   |   2 -
 fs/dlm/ast.c                                  |   1 -
 fs/dlm/dir.c                                  |   2 -
 fs/dlm/lock.c                                 |   3 -
 fs/dlm/lowcomms.c                             |   3 -
 fs/dlm/recover.c                              |   1 -
 fs/drop_caches.c                              |   1 -
 fs/erofs/utils.c                              |   1 -
 fs/erofs/zdata.c                              |   8 +-
 fs/eventpoll.c                                |   3 -
 fs/exec.c                                     |   4 -
 fs/ext4/block_validity.c                      |   2 -
 fs/ext4/dir.c                                 |   1 -
 fs/ext4/extents.c                             |   1 -
 fs/ext4/ialloc.c                              |   1 -
 fs/ext4/inode.c                               |   1 -
 fs/ext4/mballoc.c                             |  12 +-
 fs/ext4/namei.c                               |   3 -
 fs/ext4/orphan.c                              |   1 -
 fs/ext4/super.c                               |   2 -
 fs/f2fs/checkpoint.c                          |  16 +-
 fs/f2fs/compress.c                            |   1 -
 fs/f2fs/data.c                                |   3 -
 fs/f2fs/dir.c                                 |   1 -
 fs/f2fs/extent_cache.c                        |   1 -
 fs/f2fs/f2fs.h                                |   6 +-
 fs/f2fs/file.c                                |   3 -
 fs/f2fs/node.c                                |   4 -
 fs/f2fs/super.c                               |   1 -
 fs/fat/fatent.c                               |   2 -
 fs/file.c                                     |   7 +-
 fs/fs-writeback.c                             |   3 -
 fs/gfs2/aops.c                                |   1 -
 fs/gfs2/bmap.c                                |   2 -
 fs/gfs2/glock.c                               |   2 +-
 fs/gfs2/log.c                                 |   1 -
 fs/gfs2/ops_fstype.c                          |   1 -
 fs/hpfs/buffer.c                              |   8 -
 fs/hugetlbfs/inode.c                          |   3 -
 fs/inode.c                                    |   3 -
 fs/iomap/buffered-io.c                        |   7 +-
 fs/jbd2/checkpoint.c                          |   2 -
 fs/jbd2/commit.c                              |   3 -
 fs/jbd2/recovery.c                            |   2 -
 fs/jffs2/build.c                              |   6 +-
 fs/jffs2/erase.c                              |   3 -
 fs/jffs2/gc.c                                 |   2 -
 fs/jffs2/nodelist.c                           |   1 -
 fs/jffs2/nodemgmt.c                           |  11 +-
 fs/jffs2/readinode.c                          |   2 -
 fs/jffs2/scan.c                               |   4 -
 fs/jffs2/summary.c                            |   2 -
 fs/jfs/jfs_txnmgr.c                           |  14 +-
 fs/libfs.c                                    |   5 +-
 fs/mbcache.c                                  |   1 -
 fs/namei.c                                    |   1 -
 fs/netfs/io.c                                 |   1 -
 fs/nfs/delegation.c                           |   3 -
 fs/nfs/pnfs.c                                 |   2 -
 fs/nfs/write.c                                |   4 -
 fs/nilfs2/btree.c                             |   1 -
 fs/nilfs2/inode.c                             |   1 -
 fs/nilfs2/page.c                              |   4 -
 fs/nilfs2/segment.c                           |   4 -
 fs/notify/fanotify/fanotify_user.c            |   1 -
 fs/notify/fsnotify.c                          |   1 -
 fs/ntfs/attrib.c                              |   3 -
 fs/ntfs/file.c                                |   2 -
 fs/ntfs3/file.c                               |   9 -
 fs/ntfs3/frecord.c                            |   2 -
 fs/ocfs2/alloc.c                              |   4 +-
 fs/ocfs2/cluster/tcp.c                        |   8 +-
 fs/ocfs2/dlm/dlmthread.c                      |   7 +-
 fs/ocfs2/file.c                               |  10 +-
 fs/proc/base.c                                |   1 -
 fs/proc/fd.c                                  |   1 -
 fs/proc/kcore.c                               |   1 -
 fs/proc/page.c                                |   6 -
 fs/proc/task_mmu.c                            |   7 -
 fs/quota/dquot.c                              |   1 -
 fs/reiserfs/journal.c                         |   2 -
 fs/select.c                                   |   1 -
 fs/smb/client/file.c                          |   2 -
 fs/splice.c                                   |   1 -
 fs/ubifs/budget.c                             |   1 -
 fs/ubifs/commit.c                             |   1 -
 fs/ubifs/debug.c                              |   5 -
 fs/ubifs/dir.c                                |   1 -
 fs/ubifs/gc.c                                 |   5 -
 fs/ubifs/io.c                                 |   2 -
 fs/ubifs/lprops.c                             |   2 -
 fs/ubifs/lpt_commit.c                         |   3 -
 fs/ubifs/orphan.c                             |   1 -
 fs/ubifs/recovery.c                           |   4 -
 fs/ubifs/replay.c                             |   7 -
 fs/ubifs/scan.c                               |   2 -
 fs/ubifs/shrinker.c                           |   1 -
 fs/ubifs/super.c                              |   2 -
 fs/ubifs/tnc_commit.c                         |   2 -
 fs/ubifs/tnc_misc.c                           |   1 -
 fs/userfaultfd.c                              |   9 -
 fs/verity/enable.c                            |   1 -
 fs/verity/read_metadata.c                     |   1 -
 fs/xfs/scrub/common.h                         |   7 -
 fs/xfs/scrub/xfarray.c                        |   7 -
 fs/xfs/xfs_aops.c                             |   1 -
 fs/xfs/xfs_icache.c                           |   2 -
 fs/xfs/xfs_iwalk.c                            |   1 -
 include/asm-generic/preempt.h                 |  18 +-
 include/linux/console.h                       |   2 +-
 include/linux/context_tracking_state.h        |  21 +
 include/linux/entry-common.h                  |  19 +-
 include/linux/entry-kvm.h                     |   2 +-
 include/linux/kernel.h                        |  32 +-
 include/linux/livepatch.h                     |   1 -
 include/linux/livepatch_sched.h               |  29 -
 include/linux/preempt.h                       |  44 +-
 include/linux/rcupdate.h                      |  10 +-
 include/linux/rcutree.h                       |   2 +-
 include/linux/sched.h                         | 153 ++----
 include/linux/sched/cond_resched.h            |   1 -
 include/linux/sched/idle.h                    |   8 +-
 include/linux/thread_info.h                   |  29 +-
 include/linux/trace_events.h                  |   6 +-
 include/linux/vermagic.h                      |   2 +-
 include/linux/xarray.h                        |  14 +
 init/Makefile                                 |   3 +-
 io_uring/io-wq.c                              |   4 +-
 io_uring/io_uring.c                           |  21 +-
 io_uring/kbuf.c                               |   2 -
 io_uring/sqpoll.c                             |   6 +-
 io_uring/tctx.c                               |   4 +-
 ipc/msgutil.c                                 |   3 -
 ipc/sem.c                                     |   2 -
 kernel/Kconfig.preempt                        |  70 +--
 kernel/auditsc.c                              |   2 -
 kernel/bpf/Kconfig                            |   2 +-
 kernel/bpf/arraymap.c                         |   3 -
 kernel/bpf/bpf_iter.c                         |   7 +-
 kernel/bpf/btf.c                              |   9 -
 kernel/bpf/cpumap.c                           |   2 -
 kernel/bpf/hashtab.c                          |   7 -
 kernel/bpf/syscall.c                          |   3 -
 kernel/bpf/verifier.c                         |   5 -
 kernel/cgroup/rstat.c                         |   3 +-
 kernel/dma/debug.c                            |   2 -
 kernel/entry/common.c                         |  32 +-
 kernel/entry/kvm.c                            |   4 +-
 kernel/events/core.c                          |   2 +-
 kernel/futex/core.c                           |   6 +-
 kernel/futex/pi.c                             |   6 +-
 kernel/futex/requeue.c                        |   1 -
 kernel/futex/waitwake.c                       |   2 +-
 kernel/gcov/base.c                            |   1 -
 kernel/hung_task.c                            |   6 +-
 kernel/kallsyms.c                             |   4 +-
 kernel/kcsan/kcsan_test.c                     |   5 +-
 kernel/kexec_core.c                           |   6 -
 kernel/kthread.c                              |   1 -
 kernel/livepatch/core.c                       |   1 -
 kernel/livepatch/transition.c                 | 107 +---
 kernel/locking/test-ww_mutex.c                |   4 +-
 kernel/module/main.c                          |   1 -
 kernel/printk/printk.c                        |  65 +--
 kernel/ptrace.c                               |   2 -
 kernel/rcu/Kconfig                            |   4 +-
 kernel/rcu/rcuscale.c                         |   2 -
 kernel/rcu/rcutorture.c                       |   8 +-
 kernel/rcu/tasks.h                            |   5 +-
 kernel/rcu/tree.c                             |   4 +-
 kernel/rcu/tree_exp.h                         |   4 +-
 kernel/rcu/tree_plugin.h                      |   7 +-
 kernel/rcu/tree_stall.h                       |   2 +-
 kernel/scftorture.c                           |   1 -
 kernel/sched/core.c                           | 497 +++++-------------
 kernel/sched/core_sched.c                     |   2 +-
 kernel/sched/deadline.c                       |  26 +-
 kernel/sched/debug.c                          |  67 +--
 kernel/sched/fair.c                           |  54 +-
 kernel/sched/features.h                       |  18 +
 kernel/sched/idle.c                           |   6 +-
 kernel/sched/rt.c                             |  35 +-
 kernel/sched/sched.h                          |   9 +-
 kernel/softirq.c                              |   1 -
 kernel/stop_machine.c                         |   2 +-
 kernel/task_work.c                            |   1 -
 kernel/torture.c                              |   1 -
 kernel/trace/Kconfig                          |   4 +-
 kernel/trace/ftrace.c                         |   4 -
 kernel/trace/ring_buffer.c                    |   4 -
 kernel/trace/ring_buffer_benchmark.c          |  13 -
 kernel/trace/trace.c                          |  29 +-
 kernel/trace/trace_events.c                   |   1 -
 kernel/trace/trace_osnoise.c                  |  37 +-
 kernel/trace/trace_output.c                   |  16 +-
 kernel/trace/trace_selftest.c                 |   9 -
 kernel/workqueue.c                            |  10 -
 lib/crc32test.c                               |   2 -
 lib/crypto/mpi/mpi-pow.c                      |   1 -
 lib/memcpy_kunit.c                            |   5 -
 lib/random32.c                                |   1 -
 lib/rhashtable.c                              |   2 -
 lib/test_bpf.c                                |   3 -
 lib/test_lockup.c                             |   2 +-
 lib/test_maple_tree.c                         |   8 -
 lib/test_rhashtable.c                         |  10 -
 mm/backing-dev.c                              |   8 +-
 mm/compaction.c                               |  23 +-
 mm/damon/paddr.c                              |   1 -
 mm/dmapool_test.c                             |   2 -
 mm/filemap.c                                  |  11 +-
 mm/gup.c                                      |   1 -
 mm/huge_memory.c                              |   3 -
 mm/hugetlb.c                                  |  12 -
 mm/hugetlb_cgroup.c                           |   1 -
 mm/kasan/quarantine.c                         |   6 +-
 mm/kfence/kfence_test.c                       |  22 +-
 mm/khugepaged.c                               |  10 +-
 mm/kmemleak.c                                 |   8 -
 mm/ksm.c                                      |  21 +-
 mm/madvise.c                                  |   3 -
 mm/memcontrol.c                               |   4 -
 mm/memfd.c                                    |  10 +-
 mm/memory-failure.c                           |   1 -
 mm/memory.c                                   |  12 +-
 mm/memory_hotplug.c                           |   6 -
 mm/mempolicy.c                                |   1 -
 mm/migrate.c                                  |   6 -
 mm/mincore.c                                  |   1 -
 mm/mlock.c                                    |   2 -
 mm/mm_init.c                                  |  13 +-
 mm/mmap.c                                     |   1 -
 mm/mmu_gather.c                               |   2 -
 mm/mprotect.c                                 |   1 -
 mm/mremap.c                                   |   1 -
 mm/nommu.c                                    |   1 -
 mm/page-writeback.c                           |   6 +-
 mm/page_alloc.c                               |  13 +-
 mm/page_counter.c                             |   1 -
 mm/page_ext.c                                 |   1 -
 mm/page_idle.c                                |   2 -
 mm/page_io.c                                  |   2 -
 mm/page_owner.c                               |   1 -
 mm/percpu.c                                   |   5 -
 mm/rmap.c                                     |   2 -
 mm/shmem.c                                    |  19 +-
 mm/shuffle.c                                  |   6 +-
 mm/slab.c                                     |   3 -
 mm/swap_cgroup.c                              |   4 -
 mm/swapfile.c                                 |  14 -
 mm/truncate.c                                 |   4 -
 mm/userfaultfd.c                              |   3 -
 mm/util.c                                     |   1 -
 mm/vmalloc.c                                  |   5 -
 mm/vmscan.c                                   |  29 +-
 mm/vmstat.c                                   |   4 -
 mm/workingset.c                               |   1 -
 mm/z3fold.c                                   |  15 +-
 mm/zsmalloc.c                                 |   1 -
 mm/zswap.c                                    |   1 -
 net/batman-adv/tp_meter.c                     |   2 -
 net/bpf/test_run.c                            |   1 -
 net/bridge/br_netlink.c                       |   1 -
 net/core/dev.c                                |   4 -
 net/core/neighbour.c                          |   1 -
 net/core/net_namespace.c                      |   1 -
 net/core/netclassid_cgroup.c                  |   1 -
 net/core/rtnetlink.c                          |   1 -
 net/core/sock.c                               |   2 -
 net/ipv4/inet_connection_sock.c               |   3 -
 net/ipv4/inet_diag.c                          |   1 -
 net/ipv4/inet_hashtables.c                    |   1 -
 net/ipv4/inet_timewait_sock.c                 |   1 -
 net/ipv4/inetpeer.c                           |   1 -
 net/ipv4/netfilter/arp_tables.c               |   2 -
 net/ipv4/netfilter/ip_tables.c                |   3 -
 net/ipv4/nexthop.c                            |   1 -
 net/ipv4/tcp_ipv4.c                           |   2 -
 net/ipv4/udp.c                                |   2 -
 net/ipv6/fib6_rules.c                         |   1 -
 net/ipv6/netfilter/ip6_tables.c               |   2 -
 net/ipv6/udp.c                                |   2 -
 net/mptcp/mptcp_diag.c                        |   2 -
 net/mptcp/pm_netlink.c                        |   5 -
 net/mptcp/protocol.c                          |   1 -
 net/netfilter/ipset/ip_set_core.c             |   1 -
 net/netfilter/ipvs/ip_vs_est.c                |   3 -
 net/netfilter/nf_conncount.c                  |   2 -
 net/netfilter/nf_conntrack_core.c             |   3 -
 net/netfilter/nf_conntrack_ecache.c           |   3 -
 net/netfilter/nf_tables_api.c                 |   2 -
 net/netfilter/nft_set_rbtree.c                |   2 -
 net/netfilter/x_tables.c                      |   3 +-
 net/netfilter/xt_hashlimit.c                  |   1 -
 net/netlink/af_netlink.c                      |   1 -
 net/rds/ib_recv.c                             |   2 -
 net/rds/tcp.c                                 |   2 +-
 net/rds/threads.c                             |   1 -
 net/rxrpc/call_object.c                       |   2 +-
 net/sched/sch_api.c                           |   3 -
 net/sctp/socket.c                             |   1 -
 net/socket.c                                  |   2 -
 net/sunrpc/cache.c                            |  11 +-
 net/sunrpc/sched.c                            |   2 +-
 net/sunrpc/svc_xprt.c                         |   1 -
 net/sunrpc/xprtsock.c                         |   2 -
 net/tipc/core.c                               |   2 +-
 net/tipc/topsrv.c                             |   3 -
 net/unix/af_unix.c                            |   5 +-
 net/x25/af_x25.c                              |   1 -
 scripts/coccinelle/api/cond_resched.cocci     |  53 ++
 security/keys/gc.c                            |   1 -
 security/landlock/fs.c                        |   1 -
 security/selinux/ss/hashtab.h                 |   2 -
 security/selinux/ss/policydb.c                |   6 -
 security/selinux/ss/services.c                |   1 -
 security/selinux/ss/sidtab.c                  |   1 -
 sound/arm/aaci.c                              |   2 +-
 sound/core/seq/seq_virmidi.c                  |   2 -
 sound/hda/hdac_controller.c                   |   1 -
 sound/isa/sb/emu8000_patch.c                  |   5 -
 sound/isa/sb/emu8000_pcm.c                    |   2 +-
 sound/isa/wss/wss_lib.c                       |   1 -
 sound/pci/echoaudio/echoaudio_dsp.c           |   2 -
 sound/pci/ens1370.c                           |   1 -
 sound/pci/es1968.c                            |   2 +-
 sound/pci/lola/lola.c                         |   1 -
 sound/pci/mixart/mixart_hwdep.c               |   2 +-
 sound/pci/pcxhr/pcxhr_core.c                  |   5 -
 sound/pci/vx222/vx222_ops.c                   |   2 -
 sound/x86/intel_hdmi_audio.c                  |   1 -
 virt/kvm/pfncache.c                           |   2 -
 596 files changed, 881 insertions(+), 2813 deletions(-)
 delete mode 100644 include/linux/livepatch_sched.h
 delete mode 100644 include/linux/sched/cond_resched.h
 create mode 100644 scripts/coccinelle/api/cond_resched.cocci

-- 
2.31.1


^ permalink raw reply	[flat|nested] 250+ messages in thread

* [RFC PATCH 01/86] Revert "riscv: support PREEMPT_DYNAMIC with static keys"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 02/86] Revert "sched/core: Make sched_dynamic_mutex static" Ankur Arora
                   ` (61 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

As a consequence of making the scheduler responsible for driving
voluntary preemption, we can do away with explicit preemption points.

This means that most of the CONFIG_PREEMPT_DYNAMIC logic, which uses
static calls to switch between varieties of preemption points,
can be removed.

This reverts commit 4e90d0522a688371402ced1d1958ee7381b81f05.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/riscv/Kconfig            |  1 -
 include/asm-generic/preempt.h | 14 +-------------
 2 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 9c48fecc6719..4003436e6dad 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -134,7 +134,6 @@ config RISCV
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
-	select HAVE_PREEMPT_DYNAMIC_KEY if !XIP_KERNEL
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RETHOOK if !XIP_KERNEL
 	select HAVE_RSEQ
diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 51f8f3881523..b4d43a4af5f7 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -80,21 +80,9 @@ static __always_inline bool should_resched(int preempt_offset)
 
 #ifdef CONFIG_PREEMPTION
 extern asmlinkage void preempt_schedule(void);
-extern asmlinkage void preempt_schedule_notrace(void);
-
-#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-
-void dynamic_preempt_schedule(void);
-void dynamic_preempt_schedule_notrace(void);
-#define __preempt_schedule()		dynamic_preempt_schedule()
-#define __preempt_schedule_notrace()	dynamic_preempt_schedule_notrace()
-
-#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
-
 #define __preempt_schedule() preempt_schedule()
+extern asmlinkage void preempt_schedule_notrace(void);
 #define __preempt_schedule_notrace() preempt_schedule_notrace()
-
-#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
 #endif /* CONFIG_PREEMPTION */
 
 #endif /* __ASM_PREEMPT_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 02/86] Revert "sched/core: Make sched_dynamic_mutex static"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 01/86] Revert "riscv: support PREEMPT_DYNAMIC with static keys" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 23:04   ` Steven Rostedt
  2023-11-07 21:56 ` [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout" Ankur Arora
                   ` (60 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 9b8e17813aeccc29c2f9f2e6e68997a6eac2d26d.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 802551e0009b..ab773ea2cb34 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8746,7 +8746,7 @@ int sched_dynamic_mode(const char *str)
 #error "Unsupported PREEMPT_DYNAMIC mechanism"
 #endif
 
-static DEFINE_MUTEX(sched_dynamic_mutex);
+DEFINE_MUTEX(sched_dynamic_mutex);
 static bool klp_override;
 
 static void __sched_dynamic_update(int mode)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 01/86] Revert "riscv: support PREEMPT_DYNAMIC with static keys" Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 02/86] Revert "sched/core: Make sched_dynamic_mutex static" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 23:10   ` Steven Rostedt
  2023-11-07 21:56 ` [RFC PATCH 04/86] Revert "preempt/dynamic: Introduce preemption model accessors" Ankur Arora
                   ` (59 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 089c02ae2771a14af2928c59c56abfb9b885a8d7.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/trace/trace.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index abaaf516fcae..7f565f0a00da 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4392,11 +4392,17 @@ print_trace_header(struct seq_file *m, struct trace_iterator *iter)
 		   entries,
 		   total,
 		   buf->cpu,
-		   preempt_model_none()      ? "server" :
-		   preempt_model_voluntary() ? "desktop" :
-		   preempt_model_full()      ? "preempt" :
-		   preempt_model_rt()        ? "preempt_rt" :
+#if defined(CONFIG_PREEMPT_NONE)
+		   "server",
+#elif defined(CONFIG_PREEMPT_VOLUNTARY)
+		   "desktop",
+#elif defined(CONFIG_PREEMPT)
+		   "preempt",
+#elif defined(CONFIG_PREEMPT_RT)
+		   "preempt_rt",
+#else
 		   "unknown",
+#endif
 		   /* These are reserved for later use */
 		   0, 0, 0, 0);
 #ifdef CONFIG_SMP
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 04/86] Revert "preempt/dynamic: Introduce preemption model accessors"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (2 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 23:12   ` Steven Rostedt
  2023-11-07 21:56 ` [RFC PATCH 05/86] Revert "kcsan: Use " Ankur Arora
                   ` (58 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit cfe43f478b79ba45573ca22d52d0d8823be068fa.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched.h | 41 -----------------------------------------
 kernel/sched/core.c   | 12 ------------
 2 files changed, 53 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..5bdf80136e42 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2178,47 +2178,6 @@ static inline void cond_resched_rcu(void)
 #endif
 }
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-extern bool preempt_model_none(void);
-extern bool preempt_model_voluntary(void);
-extern bool preempt_model_full(void);
-
-#else
-
-static inline bool preempt_model_none(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_NONE);
-}
-static inline bool preempt_model_voluntary(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY);
-}
-static inline bool preempt_model_full(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT);
-}
-
-#endif
-
-static inline bool preempt_model_rt(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_RT);
-}
-
-/*
- * Does the preemption model allow non-cooperative preemption?
- *
- * For !CONFIG_PREEMPT_DYNAMIC kernels this is an exact match with
- * CONFIG_PREEMPTION; for CONFIG_PREEMPT_DYNAMIC this doesn't work as the
- * kernel is *built* with CONFIG_PREEMPTION=y but may run with e.g. the
- * PREEMPT_NONE model.
- */
-static inline bool preempt_model_preemptible(void)
-{
-	return preempt_model_full() || preempt_model_rt();
-}
-
 /*
  * Does a critical section need to be broken due to another
  * task waiting?: (technically does not depend on CONFIG_PREEMPTION,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab773ea2cb34..0e8764d63041 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8866,18 +8866,6 @@ static void __init preempt_dynamic_init(void)
 	}
 }
 
-#define PREEMPT_MODEL_ACCESSOR(mode) \
-	bool preempt_model_##mode(void)						 \
-	{									 \
-		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
-		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
-	}									 \
-	EXPORT_SYMBOL_GPL(preempt_model_##mode)
-
-PREEMPT_MODEL_ACCESSOR(none);
-PREEMPT_MODEL_ACCESSOR(voluntary);
-PREEMPT_MODEL_ACCESSOR(full);
-
 #else /* !CONFIG_PREEMPT_DYNAMIC */
 
 static inline void preempt_dynamic_init(void) { }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 05/86] Revert "kcsan: Use preemption model accessors"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (3 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 04/86] Revert "preempt/dynamic: Introduce preemption model accessors" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 06/86] Revert "entry: Fix compile error in dynamic_irqentry_exit_cond_resched()" Ankur Arora
                   ` (57 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 5693fa74f98afed5421ac0165e9e9291bde7d9e1.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/kcsan/kcsan_test.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/kcsan/kcsan_test.c b/kernel/kcsan/kcsan_test.c
index 0ddbdab5903d..6f46fd7998ce 100644
--- a/kernel/kcsan/kcsan_test.c
+++ b/kernel/kcsan/kcsan_test.c
@@ -1385,14 +1385,13 @@ static const void *nthreads_gen_params(const void *prev, char *desc)
 	else
 		nthreads *= 2;
 
-	if (!preempt_model_preemptible() ||
-	    !IS_ENABLED(CONFIG_KCSAN_INTERRUPT_WATCHER)) {
+	if (!IS_ENABLED(CONFIG_PREEMPT) || !IS_ENABLED(CONFIG_KCSAN_INTERRUPT_WATCHER)) {
 		/*
 		 * Without any preemption, keep 2 CPUs free for other tasks, one
 		 * of which is the main test case function checking for
 		 * completion or failure.
 		 */
-		const long min_unused_cpus = preempt_model_none() ? 2 : 0;
+		const long min_unused_cpus = IS_ENABLED(CONFIG_PREEMPT_NONE) ? 2 : 0;
 		const long min_required_cpus = 2 + min_unused_cpus;
 
 		if (num_online_cpus() < min_required_cpus) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 06/86] Revert "entry: Fix compile error in dynamic_irqentry_exit_cond_resched()"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (4 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 05/86] Revert "kcsan: Use " Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-08  7:47   ` Greg KH
  2023-11-07 21:56 ` [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()" Ankur Arora
                   ` (56 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 0a70045ed8516dfcff4b5728557e1ef3fd017c53.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/entry/common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d7ee4bc3f2ba..ba684e9853c1 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -396,7 +396,7 @@ DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
 DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
 void dynamic_irqentry_exit_cond_resched(void)
 {
-	if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
+	if (!static_key_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
 		return;
 	raw_irqentry_exit_cond_resched();
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (5 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 06/86] Revert "entry: Fix compile error in dynamic_irqentry_exit_cond_resched()" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 23:16   ` Steven Rostedt
  2023-11-07 21:56 ` [RFC PATCH 08/86] Revert "arm64: Support PREEMPT_DYNAMIC" Ankur Arora
                   ` (55 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.

Note that removing this commit reintroduces "live patches failing to
complete within a reasonable amount of time due to CPU-bound kthreads."

Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
existence of cond_resched() so this will need an alternate fix.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/livepatch.h       |   1 -
 include/linux/livepatch_sched.h |  29 ---------
 include/linux/sched.h           |  20 ++----
 kernel/livepatch/core.c         |   1 -
 kernel/livepatch/transition.c   | 107 +++++---------------------------
 kernel/sched/core.c             |  64 +++----------------
 6 files changed, 28 insertions(+), 194 deletions(-)
 delete mode 100644 include/linux/livepatch_sched.h

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 9b9b38e89563..293e29960c6e 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -13,7 +13,6 @@
 #include <linux/ftrace.h>
 #include <linux/completion.h>
 #include <linux/list.h>
-#include <linux/livepatch_sched.h>
 
 #if IS_ENABLED(CONFIG_LIVEPATCH)
 
diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
deleted file mode 100644
index 013794fb5da0..000000000000
--- a/include/linux/livepatch_sched.h
+++ /dev/null
@@ -1,29 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _LINUX_LIVEPATCH_SCHED_H_
-#define _LINUX_LIVEPATCH_SCHED_H_
-
-#include <linux/jump_label.h>
-#include <linux/static_call_types.h>
-
-#ifdef CONFIG_LIVEPATCH
-
-void __klp_sched_try_switch(void);
-
-#if !defined(CONFIG_PREEMPT_DYNAMIC) || !defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
-
-DECLARE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
-
-static __always_inline void klp_sched_try_switch(void)
-{
-	if (static_branch_unlikely(&klp_sched_try_switch_key))
-		__klp_sched_try_switch();
-}
-
-#endif /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
-
-#else /* !CONFIG_LIVEPATCH */
-static inline void klp_sched_try_switch(void) {}
-static inline void __klp_sched_try_switch(void) {}
-#endif /* CONFIG_LIVEPATCH */
-
-#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5bdf80136e42..c5b0ef1ecfe4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,7 +36,6 @@
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
 #include <linux/rv.h>
-#include <linux/livepatch_sched.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -2087,9 +2086,6 @@ extern int __cond_resched(void);
 
 #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 
-void sched_dynamic_klp_enable(void);
-void sched_dynamic_klp_disable(void);
-
 DECLARE_STATIC_CALL(cond_resched, __cond_resched);
 
 static __always_inline int _cond_resched(void)
@@ -2098,7 +2094,6 @@ static __always_inline int _cond_resched(void)
 }
 
 #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-
 extern int dynamic_cond_resched(void);
 
 static __always_inline int _cond_resched(void)
@@ -2106,25 +2101,20 @@ static __always_inline int _cond_resched(void)
 	return dynamic_cond_resched();
 }
 
-#else /* !CONFIG_PREEMPTION */
+#else
 
 static inline int _cond_resched(void)
 {
-	klp_sched_try_switch();
 	return __cond_resched();
 }
 
-#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+#endif /* CONFIG_PREEMPT_DYNAMIC */
 
-#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
+#else
 
-static inline int _cond_resched(void)
-{
-	klp_sched_try_switch();
-	return 0;
-}
+static inline int _cond_resched(void) { return 0; }
 
-#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
+#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
 
 #define cond_resched() ({			\
 	__might_resched(__FILE__, __LINE__, 0);	\
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 61328328c474..fc851455740c 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -33,7 +33,6 @@
  *
  * - klp_ftrace_handler()
  * - klp_update_patch_state()
- * - __klp_sched_try_switch()
  */
 DEFINE_MUTEX(klp_mutex);
 
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index e54c3d60a904..70bc38f27af7 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,7 +9,6 @@
 
 #include <linux/cpu.h>
 #include <linux/stacktrace.h>
-#include <linux/static_call.h>
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -27,25 +26,6 @@ static int klp_target_state = KLP_UNDEFINED;
 
 static unsigned int klp_signals_cnt;
 
-/*
- * When a livepatch is in progress, enable klp stack checking in
- * cond_resched().  This helps CPU-bound kthreads get patched.
- */
-#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
-
-#define klp_cond_resched_enable() sched_dynamic_klp_enable()
-#define klp_cond_resched_disable() sched_dynamic_klp_disable()
-
-#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
-
-DEFINE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
-EXPORT_SYMBOL(klp_sched_try_switch_key);
-
-#define klp_cond_resched_enable() static_branch_enable(&klp_sched_try_switch_key)
-#define klp_cond_resched_disable() static_branch_disable(&klp_sched_try_switch_key)
-
-#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
-
 /*
  * This work can be performed periodically to finish patching or unpatching any
  * "straggler" tasks which failed to transition in the first attempt.
@@ -194,8 +174,8 @@ void klp_update_patch_state(struct task_struct *task)
 	 * barrier (smp_rmb) for two cases:
 	 *
 	 * 1) Enforce the order of the TIF_PATCH_PENDING read and the
-	 *    klp_target_state read.  The corresponding write barriers are in
-	 *    klp_init_transition() and klp_reverse_transition().
+	 *    klp_target_state read.  The corresponding write barrier is in
+	 *    klp_init_transition().
 	 *
 	 * 2) Enforce the order of the TIF_PATCH_PENDING read and a future read
 	 *    of func->transition, if klp_ftrace_handler() is called later on
@@ -363,44 +343,6 @@ static bool klp_try_switch_task(struct task_struct *task)
 	return !ret;
 }
 
-void __klp_sched_try_switch(void)
-{
-	if (likely(!klp_patch_pending(current)))
-		return;
-
-	/*
-	 * This function is called from cond_resched() which is called in many
-	 * places throughout the kernel.  Using the klp_mutex here might
-	 * deadlock.
-	 *
-	 * Instead, disable preemption to prevent racing with other callers of
-	 * klp_try_switch_task().  Thanks to task_call_func() they won't be
-	 * able to switch this task while it's running.
-	 */
-	preempt_disable();
-
-	/*
-	 * Make sure current didn't get patched between the above check and
-	 * preempt_disable().
-	 */
-	if (unlikely(!klp_patch_pending(current)))
-		goto out;
-
-	/*
-	 * Enforce the order of the TIF_PATCH_PENDING read above and the
-	 * klp_target_state read in klp_try_switch_task().  The corresponding
-	 * write barriers are in klp_init_transition() and
-	 * klp_reverse_transition().
-	 */
-	smp_rmb();
-
-	klp_try_switch_task(current);
-
-out:
-	preempt_enable();
-}
-EXPORT_SYMBOL(__klp_sched_try_switch);
-
 /*
  * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
  * Kthreads with TIF_PATCH_PENDING set are woken up.
@@ -507,8 +449,7 @@ void klp_try_complete_transition(void)
 		return;
 	}
 
-	/* Done!  Now cleanup the data structures. */
-	klp_cond_resched_disable();
+	/* we're done, now cleanup the data structures */
 	patch = klp_transition_patch;
 	klp_complete_transition();
 
@@ -560,8 +501,6 @@ void klp_start_transition(void)
 			set_tsk_thread_flag(task, TIF_PATCH_PENDING);
 	}
 
-	klp_cond_resched_enable();
-
 	klp_signals_cnt = 0;
 }
 
@@ -617,9 +556,8 @@ void klp_init_transition(struct klp_patch *patch, int state)
 	 * see a func in transition with a task->patch_state of KLP_UNDEFINED.
 	 *
 	 * Also enforce the order of the klp_target_state write and future
-	 * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() and
-	 * __klp_sched_try_switch() don't set a task->patch_state to
-	 * KLP_UNDEFINED.
+	 * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() doesn't
+	 * set a task->patch_state to KLP_UNDEFINED.
 	 */
 	smp_wmb();
 
@@ -655,10 +593,14 @@ void klp_reverse_transition(void)
 		 klp_target_state == KLP_PATCHED ? "patching to unpatching" :
 						   "unpatching to patching");
 
+	klp_transition_patch->enabled = !klp_transition_patch->enabled;
+
+	klp_target_state = !klp_target_state;
+
 	/*
 	 * Clear all TIF_PATCH_PENDING flags to prevent races caused by
-	 * klp_update_patch_state() or __klp_sched_try_switch() running in
-	 * parallel with the reverse transition.
+	 * klp_update_patch_state() running in parallel with
+	 * klp_start_transition().
 	 */
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, task)
@@ -668,28 +610,9 @@ void klp_reverse_transition(void)
 	for_each_possible_cpu(cpu)
 		clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
 
-	/*
-	 * Make sure all existing invocations of klp_update_patch_state() and
-	 * __klp_sched_try_switch() see the cleared TIF_PATCH_PENDING before
-	 * starting the reverse transition.
-	 */
+	/* Let any remaining calls to klp_update_patch_state() complete */
 	klp_synchronize_transition();
 
-	/*
-	 * All patching has stopped, now re-initialize the global variables to
-	 * prepare for the reverse transition.
-	 */
-	klp_transition_patch->enabled = !klp_transition_patch->enabled;
-	klp_target_state = !klp_target_state;
-
-	/*
-	 * Enforce the order of the klp_target_state write and the
-	 * TIF_PATCH_PENDING writes in klp_start_transition() to ensure
-	 * klp_update_patch_state() and __klp_sched_try_switch() don't set
-	 * task->patch_state to the wrong value.
-	 */
-	smp_wmb();
-
 	klp_start_transition();
 }
 
@@ -703,9 +626,9 @@ void klp_copy_process(struct task_struct *child)
 	 * the task flag up to date with the parent here.
 	 *
 	 * The operation is serialized against all klp_*_transition()
-	 * operations by the tasklist_lock. The only exceptions are
-	 * klp_update_patch_state(current) and __klp_sched_try_switch(), but we
-	 * cannot race with them because we are current.
+	 * operations by the tasklist_lock. The only exception is
+	 * klp_update_patch_state(current), but we cannot race with
+	 * that because we are current.
 	 */
 	if (test_tsk_thread_flag(current, TIF_PATCH_PENDING))
 		set_tsk_thread_flag(child, TIF_PATCH_PENDING);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0e8764d63041..b43fda3c5733 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8597,7 +8597,6 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
 static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
 int __sched dynamic_cond_resched(void)
 {
-	klp_sched_try_switch();
 	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
 		return 0;
 	return __cond_resched();
@@ -8746,17 +8745,13 @@ int sched_dynamic_mode(const char *str)
 #error "Unsupported PREEMPT_DYNAMIC mechanism"
 #endif
 
-DEFINE_MUTEX(sched_dynamic_mutex);
-static bool klp_override;
-
-static void __sched_dynamic_update(int mode)
+void sched_dynamic_update(int mode)
 {
 	/*
 	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
 	 * the ZERO state, which is invalid.
 	 */
-	if (!klp_override)
-		preempt_dynamic_enable(cond_resched);
+	preempt_dynamic_enable(cond_resched);
 	preempt_dynamic_enable(might_resched);
 	preempt_dynamic_enable(preempt_schedule);
 	preempt_dynamic_enable(preempt_schedule_notrace);
@@ -8764,79 +8759,36 @@ static void __sched_dynamic_update(int mode)
 
 	switch (mode) {
 	case preempt_dynamic_none:
-		if (!klp_override)
-			preempt_dynamic_enable(cond_resched);
+		preempt_dynamic_enable(cond_resched);
 		preempt_dynamic_disable(might_resched);
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
-		if (mode != preempt_dynamic_mode)
-			pr_info("Dynamic Preempt: none\n");
+		pr_info("Dynamic Preempt: none\n");
 		break;
 
 	case preempt_dynamic_voluntary:
-		if (!klp_override)
-			preempt_dynamic_enable(cond_resched);
+		preempt_dynamic_enable(cond_resched);
 		preempt_dynamic_enable(might_resched);
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
-		if (mode != preempt_dynamic_mode)
-			pr_info("Dynamic Preempt: voluntary\n");
+		pr_info("Dynamic Preempt: voluntary\n");
 		break;
 
 	case preempt_dynamic_full:
-		if (!klp_override)
-			preempt_dynamic_disable(cond_resched);
+		preempt_dynamic_disable(cond_resched);
 		preempt_dynamic_disable(might_resched);
 		preempt_dynamic_enable(preempt_schedule);
 		preempt_dynamic_enable(preempt_schedule_notrace);
 		preempt_dynamic_enable(irqentry_exit_cond_resched);
-		if (mode != preempt_dynamic_mode)
-			pr_info("Dynamic Preempt: full\n");
+		pr_info("Dynamic Preempt: full\n");
 		break;
 	}
 
 	preempt_dynamic_mode = mode;
 }
 
-void sched_dynamic_update(int mode)
-{
-	mutex_lock(&sched_dynamic_mutex);
-	__sched_dynamic_update(mode);
-	mutex_unlock(&sched_dynamic_mutex);
-}
-
-#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
-
-static int klp_cond_resched(void)
-{
-	__klp_sched_try_switch();
-	return __cond_resched();
-}
-
-void sched_dynamic_klp_enable(void)
-{
-	mutex_lock(&sched_dynamic_mutex);
-
-	klp_override = true;
-	static_call_update(cond_resched, klp_cond_resched);
-
-	mutex_unlock(&sched_dynamic_mutex);
-}
-
-void sched_dynamic_klp_disable(void)
-{
-	mutex_lock(&sched_dynamic_mutex);
-
-	klp_override = false;
-	__sched_dynamic_update(preempt_dynamic_mode);
-
-	mutex_unlock(&sched_dynamic_mutex);
-}
-
-#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
-
 static int __init setup_preempt_mode(char *str)
 {
 	int mode = sched_dynamic_mode(str);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 08/86] Revert "arm64: Support PREEMPT_DYNAMIC"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (6 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 23:17   ` Steven Rostedt
  2023-11-08 15:44   ` Mark Rutland
  2023-11-07 21:56 ` [RFC PATCH 09/86] Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" Ankur Arora
                   ` (54 subsequent siblings)
  62 siblings, 2 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 1b2d3451ee50a0968cb9933f726e50b368ba5073.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/Kconfig               |  1 -
 arch/arm64/include/asm/preempt.h | 19 ++-----------------
 arch/arm64/kernel/entry-common.c | 10 +---------
 3 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 78f20e632712..856d7be2ee45 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -221,7 +221,6 @@ config ARM64
 	select HAVE_PERF_EVENTS_NMI if ARM64_PSEUDO_NMI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select HAVE_PREEMPT_DYNAMIC_KEY
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_FUNCTION_ARG_ACCESS_API
diff --git a/arch/arm64/include/asm/preempt.h b/arch/arm64/include/asm/preempt.h
index 0159b625cc7f..e83f0982b99c 100644
--- a/arch/arm64/include/asm/preempt.h
+++ b/arch/arm64/include/asm/preempt.h
@@ -2,7 +2,6 @@
 #ifndef __ASM_PREEMPT_H
 #define __ASM_PREEMPT_H
 
-#include <linux/jump_label.h>
 #include <linux/thread_info.h>
 
 #define PREEMPT_NEED_RESCHED	BIT(32)
@@ -81,24 +80,10 @@ static inline bool should_resched(int preempt_offset)
 }
 
 #ifdef CONFIG_PREEMPTION
-
 void preempt_schedule(void);
+#define __preempt_schedule() preempt_schedule()
 void preempt_schedule_notrace(void);
-
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
-void dynamic_preempt_schedule(void);
-#define __preempt_schedule()		dynamic_preempt_schedule()
-void dynamic_preempt_schedule_notrace(void);
-#define __preempt_schedule_notrace()	dynamic_preempt_schedule_notrace()
-
-#else /* CONFIG_PREEMPT_DYNAMIC */
-
-#define __preempt_schedule()		preempt_schedule()
-#define __preempt_schedule_notrace()	preempt_schedule_notrace()
-
-#endif /* CONFIG_PREEMPT_DYNAMIC */
+#define __preempt_schedule_notrace() preempt_schedule_notrace()
 #endif /* CONFIG_PREEMPTION */
 
 #endif /* __ASM_PREEMPT_H */
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 0fc94207e69a..5d9c9951562b 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -225,17 +225,9 @@ static void noinstr arm64_exit_el1_dbg(struct pt_regs *regs)
 		lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
-#define need_irq_preemption() \
-	(static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
-#else
-#define need_irq_preemption()	(IS_ENABLED(CONFIG_PREEMPTION))
-#endif
-
 static void __sched arm64_preempt_schedule_irq(void)
 {
-	if (!need_irq_preemption())
+	if (!IS_ENABLED(CONFIG_PREEMPTION))
 		return;
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 09/86] Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (7 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 08/86] Revert "arm64: Support PREEMPT_DYNAMIC" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 10/86] Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY" Ankur Arora
                   ` (53 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 99cf983cc8bca4adb461b519664c939a565cfd4d.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/Kconfig                 | 36 ++----------------------
 arch/x86/Kconfig             |  2 +-
 include/linux/entry-common.h | 10 ++-----
 include/linux/kernel.h       |  7 +----
 include/linux/sched.h        | 10 +------
 kernel/Kconfig.preempt       |  3 +-
 kernel/entry/common.c        | 11 --------
 kernel/sched/core.c          | 53 ++----------------------------------
 8 files changed, 11 insertions(+), 121 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 12d51495caec..3eb64363b48d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1395,41 +1395,11 @@ config HAVE_STATIC_CALL_INLINE
 
 config HAVE_PREEMPT_DYNAMIC
 	bool
-
-config HAVE_PREEMPT_DYNAMIC_CALL
-	bool
 	depends on HAVE_STATIC_CALL
-	select HAVE_PREEMPT_DYNAMIC
 	help
-	  An architecture should select this if it can handle the preemption
-	  model being selected at boot time using static calls.
-
-	  Where an architecture selects HAVE_STATIC_CALL_INLINE, any call to a
-	  preemption function will be patched directly.
-
-	  Where an architecture does not select HAVE_STATIC_CALL_INLINE, any
-	  call to a preemption function will go through a trampoline, and the
-	  trampoline will be patched.
-
-	  It is strongly advised to support inline static call to avoid any
-	  overhead.
-
-config HAVE_PREEMPT_DYNAMIC_KEY
-	bool
-	depends on HAVE_ARCH_JUMP_LABEL
-	select HAVE_PREEMPT_DYNAMIC
-	help
-	  An architecture should select this if it can handle the preemption
-	  model being selected at boot time using static keys.
-
-	  Each preemption function will be given an early return based on a
-	  static key. This should have slightly lower overhead than non-inline
-	  static calls, as this effectively inlines each trampoline into the
-	  start of its callee. This may avoid redundant work, and may
-	  integrate better with CFI schemes.
-
-	  This will have greater overhead than using inline static calls as
-	  the call to the preemption function cannot be entirely elided.
+	  Select this if the architecture support boot time preempt setting
+	  on top of static calls. It is strongly advised to support inline
+	  static call to avoid any overhead.
 
 config ARCH_WANT_LD_ORPHAN_WARN
 	bool
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..ec71c232af32 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -270,7 +270,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_STATIC_CALL
 	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
-	select HAVE_PREEMPT_DYNAMIC_CALL
+	select HAVE_PREEMPT_DYNAMIC
 	select HAVE_RSEQ
 	select HAVE_RUST			if X86_64
 	select HAVE_SYSCALL_TRACEPOINTS
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index d95ab85f96ba..a382716ea7b2 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -416,19 +416,13 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
  */
 void raw_irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched
 #define irqentry_exit_cond_resched_dynamic_disabled	NULL
 DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
 #define irqentry_exit_cond_resched()	static_call(irqentry_exit_cond_resched)()
-#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
-void dynamic_irqentry_exit_cond_resched(void);
-#define irqentry_exit_cond_resched()	dynamic_irqentry_exit_cond_resched()
-#endif
-#else /* CONFIG_PREEMPT_DYNAMIC */
+#else
 #define irqentry_exit_cond_resched()	raw_irqentry_exit_cond_resched()
-#endif /* CONFIG_PREEMPT_DYNAMIC */
+#endif
 
 /**
  * irqentry_exit - Handle return from exception that used irqentry_enter()
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index cee8fe87e9f4..cdce553479b4 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -101,7 +101,7 @@ struct user;
 extern int __cond_resched(void);
 # define might_resched() __cond_resched()
 
-#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+#elif defined(CONFIG_PREEMPT_DYNAMIC)
 
 extern int __cond_resched(void);
 
@@ -112,11 +112,6 @@ static __always_inline void might_resched(void)
 	static_call_mod(might_resched)();
 }
 
-#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-
-extern int dynamic_might_resched(void);
-# define might_resched() dynamic_might_resched()
-
 #else
 
 # define might_resched() do { } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5b0ef1ecfe4..66f520954de5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2084,7 +2084,7 @@ static inline int test_tsk_need_resched(struct task_struct *tsk)
 #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
 extern int __cond_resched(void);
 
-#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+#ifdef CONFIG_PREEMPT_DYNAMIC
 
 DECLARE_STATIC_CALL(cond_resched, __cond_resched);
 
@@ -2093,14 +2093,6 @@ static __always_inline int _cond_resched(void)
 	return static_call_mod(cond_resched)();
 }
 
-#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-extern int dynamic_cond_resched(void);
-
-static __always_inline int _cond_resched(void)
-{
-	return dynamic_cond_resched();
-}
-
 #else
 
 static inline int _cond_resched(void)
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a821..ce77f0265660 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -96,9 +96,8 @@ config PREEMPTION
 config PREEMPT_DYNAMIC
 	bool "Preemption behaviour defined on boot"
 	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
-	select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
 	select PREEMPT_BUILD
-	default y if HAVE_PREEMPT_DYNAMIC_CALL
+	default y
 	help
 	  This option allows to define the preemption model on the kernel
 	  command line parameter and thus override the default preemption
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index ba684e9853c1..38593049c40c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -4,7 +4,6 @@
 #include <linux/entry-common.h>
 #include <linux/resume_user_mode.h>
 #include <linux/highmem.h>
-#include <linux/jump_label.h>
 #include <linux/kmsan.h>
 #include <linux/livepatch.h>
 #include <linux/audit.h>
@@ -390,17 +389,7 @@ void raw_irqentry_exit_cond_resched(void)
 	}
 }
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
-#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
-void dynamic_irqentry_exit_cond_resched(void)
-{
-	if (!static_key_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
-		return;
-	raw_irqentry_exit_cond_resched();
-}
-#endif
 #endif
 
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b43fda3c5733..51c992105bc0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6885,32 +6885,22 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
 	 */
 	if (likely(!preemptible()))
 		return;
+
 	preempt_schedule_common();
 }
 NOKPROBE_SYMBOL(preempt_schedule);
 EXPORT_SYMBOL(preempt_schedule);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #ifndef preempt_schedule_dynamic_enabled
 #define preempt_schedule_dynamic_enabled	preempt_schedule
 #define preempt_schedule_dynamic_disabled	NULL
 #endif
 DEFINE_STATIC_CALL(preempt_schedule, preempt_schedule_dynamic_enabled);
 EXPORT_STATIC_CALL_TRAMP(preempt_schedule);
-#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-static DEFINE_STATIC_KEY_TRUE(sk_dynamic_preempt_schedule);
-void __sched notrace dynamic_preempt_schedule(void)
-{
-	if (!static_branch_unlikely(&sk_dynamic_preempt_schedule))
-		return;
-	preempt_schedule();
-}
-NOKPROBE_SYMBOL(dynamic_preempt_schedule);
-EXPORT_SYMBOL(dynamic_preempt_schedule);
-#endif
 #endif
 
+
 /**
  * preempt_schedule_notrace - preempt_schedule called by tracing
  *
@@ -6964,24 +6954,12 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
 EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #ifndef preempt_schedule_notrace_dynamic_enabled
 #define preempt_schedule_notrace_dynamic_enabled	preempt_schedule_notrace
 #define preempt_schedule_notrace_dynamic_disabled	NULL
 #endif
 DEFINE_STATIC_CALL(preempt_schedule_notrace, preempt_schedule_notrace_dynamic_enabled);
 EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
-#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-static DEFINE_STATIC_KEY_TRUE(sk_dynamic_preempt_schedule_notrace);
-void __sched notrace dynamic_preempt_schedule_notrace(void)
-{
-	if (!static_branch_unlikely(&sk_dynamic_preempt_schedule_notrace))
-		return;
-	preempt_schedule_notrace();
-}
-NOKPROBE_SYMBOL(dynamic_preempt_schedule_notrace);
-EXPORT_SYMBOL(dynamic_preempt_schedule_notrace);
-#endif
 #endif
 
 #endif /* CONFIG_PREEMPTION */
@@ -8583,7 +8561,6 @@ EXPORT_SYMBOL(__cond_resched);
 #endif
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define cond_resched_dynamic_enabled	__cond_resched
 #define cond_resched_dynamic_disabled	((void *)&__static_call_return0)
 DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
@@ -8593,25 +8570,6 @@ EXPORT_STATIC_CALL_TRAMP(cond_resched);
 #define might_resched_dynamic_disabled	((void *)&__static_call_return0)
 DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
 EXPORT_STATIC_CALL_TRAMP(might_resched);
-#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
-int __sched dynamic_cond_resched(void)
-{
-	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
-		return 0;
-	return __cond_resched();
-}
-EXPORT_SYMBOL(dynamic_cond_resched);
-
-static DEFINE_STATIC_KEY_FALSE(sk_dynamic_might_resched);
-int __sched dynamic_might_resched(void)
-{
-	if (!static_branch_unlikely(&sk_dynamic_might_resched))
-		return 0;
-	return __cond_resched();
-}
-EXPORT_SYMBOL(dynamic_might_resched);
-#endif
 #endif
 
 /*
@@ -8735,15 +8693,8 @@ int sched_dynamic_mode(const char *str)
 	return -EINVAL;
 }
 
-#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define preempt_dynamic_enable(f)	static_call_update(f, f##_dynamic_enabled)
 #define preempt_dynamic_disable(f)	static_call_update(f, f##_dynamic_disabled)
-#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-#define preempt_dynamic_enable(f)	static_key_enable(&sk_dynamic_##f.key)
-#define preempt_dynamic_disable(f)	static_key_disable(&sk_dynamic_##f.key)
-#else
-#error "Unsupported PREEMPT_DYNAMIC mechanism"
-#endif
 
 void sched_dynamic_update(int mode)
 {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 10/86] Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (8 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 09/86] Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 11/86] Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers" Ankur Arora
                   ` (52 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 33c64734be3461222a8aa27d3dadc477ebca62de.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/Kconfig        | 1 +
 kernel/sched/core.c | 2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 3eb64363b48d..afe6785fd3e2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1396,6 +1396,7 @@ config HAVE_STATIC_CALL_INLINE
 config HAVE_PREEMPT_DYNAMIC
 	bool
 	depends on HAVE_STATIC_CALL
+	depends on GENERIC_ENTRY
 	help
 	  Select this if the architecture support boot time preempt setting
 	  on top of static calls. It is strongly advised to support inline
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51c992105bc0..686e89d4ebb7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8636,9 +8636,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 
-#ifdef CONFIG_GENERIC_ENTRY
 #include <linux/entry-common.h>
-#endif
 
 /*
  * SC:cond_resched
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 11/86] Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (9 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 10/86] Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 12/86] Revert "sched/preempt: Refactor sched_dynamic_update()" Ankur Arora
                   ` (51 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 4624a14f4daa8ab4578d274555fd8847254ce339.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/entry-common.h |  9 +++------
 kernel/entry/common.c        | 12 ++++++++----
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index a382716ea7b2..6567e99e079e 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -414,14 +414,11 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
  *
  * Conditional reschedule with additional sanity checks.
  */
-void raw_irqentry_exit_cond_resched(void);
+void irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched
+#define irqentry_exit_cond_resched_dynamic_enabled	irqentry_exit_cond_resched
 #define irqentry_exit_cond_resched_dynamic_disabled	NULL
-DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
-#define irqentry_exit_cond_resched()	static_call(irqentry_exit_cond_resched)()
-#else
-#define irqentry_exit_cond_resched()	raw_irqentry_exit_cond_resched()
+DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
 #endif
 
 /**
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 38593049c40c..b0b7be0705e0 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -377,7 +377,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	return ret;
 }
 
-void raw_irqentry_exit_cond_resched(void)
+void irqentry_exit_cond_resched(void)
 {
 	if (!preempt_count()) {
 		/* Sanity check RCU and thread stack */
@@ -389,7 +389,7 @@ void raw_irqentry_exit_cond_resched(void)
 	}
 }
 #ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
+DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
 #endif
 
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
@@ -417,9 +417,13 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}
 
 		instrumentation_begin();
-		if (IS_ENABLED(CONFIG_PREEMPTION))
+		if (IS_ENABLED(CONFIG_PREEMPTION)) {
+#ifdef CONFIG_PREEMPT_DYNAMIC
+			static_call(irqentry_exit_cond_resched)();
+#else
 			irqentry_exit_cond_resched();
-
+#endif
+		}
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 12/86] Revert "sched/preempt: Refactor sched_dynamic_update()"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (10 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 11/86] Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 21:56 ` [RFC PATCH 13/86] Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later" Ankur Arora
                   ` (50 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 8a69fe0be143b0a1af829f85f0e9a1ae7d6a04db.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/preempt.h | 10 +++---
 include/linux/entry-common.h   |  2 --
 kernel/sched/core.c            | 59 +++++++++++++---------------------
 3 files changed, 26 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 2d13f25b1bd8..495faed1c76c 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -109,18 +109,16 @@ static __always_inline bool should_resched(int preempt_offset)
 extern asmlinkage void preempt_schedule(void);
 extern asmlinkage void preempt_schedule_thunk(void);
 
-#define preempt_schedule_dynamic_enabled	preempt_schedule_thunk
-#define preempt_schedule_dynamic_disabled	NULL
+#define __preempt_schedule_func preempt_schedule_thunk
 
 extern asmlinkage void preempt_schedule_notrace(void);
 extern asmlinkage void preempt_schedule_notrace_thunk(void);
 
-#define preempt_schedule_notrace_dynamic_enabled	preempt_schedule_notrace_thunk
-#define preempt_schedule_notrace_dynamic_disabled	NULL
+#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 
-DECLARE_STATIC_CALL(preempt_schedule, preempt_schedule_dynamic_enabled);
+DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
 
 #define __preempt_schedule() \
 do { \
@@ -128,7 +126,7 @@ do { \
 	asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : ASM_CALL_CONSTRAINT); \
 } while (0)
 
-DECLARE_STATIC_CALL(preempt_schedule_notrace, preempt_schedule_notrace_dynamic_enabled);
+DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
 
 #define __preempt_schedule_notrace() \
 do { \
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 6567e99e079e..49e9fe9489b6 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -416,8 +416,6 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
  */
 void irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#define irqentry_exit_cond_resched_dynamic_enabled	irqentry_exit_cond_resched
-#define irqentry_exit_cond_resched_dynamic_disabled	NULL
 DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 686e89d4ebb7..2268d9e23635 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6892,11 +6892,7 @@ NOKPROBE_SYMBOL(preempt_schedule);
 EXPORT_SYMBOL(preempt_schedule);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#ifndef preempt_schedule_dynamic_enabled
-#define preempt_schedule_dynamic_enabled	preempt_schedule
-#define preempt_schedule_dynamic_disabled	NULL
-#endif
-DEFINE_STATIC_CALL(preempt_schedule, preempt_schedule_dynamic_enabled);
+DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
 EXPORT_STATIC_CALL_TRAMP(preempt_schedule);
 #endif
 
@@ -6954,11 +6950,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
 EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#ifndef preempt_schedule_notrace_dynamic_enabled
-#define preempt_schedule_notrace_dynamic_enabled	preempt_schedule_notrace
-#define preempt_schedule_notrace_dynamic_disabled	NULL
-#endif
-DEFINE_STATIC_CALL(preempt_schedule_notrace, preempt_schedule_notrace_dynamic_enabled);
+DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
 EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
 #endif
 
@@ -8561,13 +8553,9 @@ EXPORT_SYMBOL(__cond_resched);
 #endif
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
-#define cond_resched_dynamic_enabled	__cond_resched
-#define cond_resched_dynamic_disabled	((void *)&__static_call_return0)
 DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
 EXPORT_STATIC_CALL_TRAMP(cond_resched);
 
-#define might_resched_dynamic_enabled	__cond_resched
-#define might_resched_dynamic_disabled	((void *)&__static_call_return0)
 DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
 EXPORT_STATIC_CALL_TRAMP(might_resched);
 #endif
@@ -8691,46 +8679,43 @@ int sched_dynamic_mode(const char *str)
 	return -EINVAL;
 }
 
-#define preempt_dynamic_enable(f)	static_call_update(f, f##_dynamic_enabled)
-#define preempt_dynamic_disable(f)	static_call_update(f, f##_dynamic_disabled)
-
 void sched_dynamic_update(int mode)
 {
 	/*
 	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
 	 * the ZERO state, which is invalid.
 	 */
-	preempt_dynamic_enable(cond_resched);
-	preempt_dynamic_enable(might_resched);
-	preempt_dynamic_enable(preempt_schedule);
-	preempt_dynamic_enable(preempt_schedule_notrace);
-	preempt_dynamic_enable(irqentry_exit_cond_resched);
+	static_call_update(cond_resched, __cond_resched);
+	static_call_update(might_resched, __cond_resched);
+	static_call_update(preempt_schedule, __preempt_schedule_func);
+	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
 
 	switch (mode) {
 	case preempt_dynamic_none:
-		preempt_dynamic_enable(cond_resched);
-		preempt_dynamic_disable(might_resched);
-		preempt_dynamic_disable(preempt_schedule);
-		preempt_dynamic_disable(preempt_schedule_notrace);
-		preempt_dynamic_disable(irqentry_exit_cond_resched);
+		static_call_update(cond_resched, __cond_resched);
+		static_call_update(might_resched, (void *)&__static_call_return0);
+		static_call_update(preempt_schedule, NULL);
+		static_call_update(preempt_schedule_notrace, NULL);
+		static_call_update(irqentry_exit_cond_resched, NULL);
 		pr_info("Dynamic Preempt: none\n");
 		break;
 
 	case preempt_dynamic_voluntary:
-		preempt_dynamic_enable(cond_resched);
-		preempt_dynamic_enable(might_resched);
-		preempt_dynamic_disable(preempt_schedule);
-		preempt_dynamic_disable(preempt_schedule_notrace);
-		preempt_dynamic_disable(irqentry_exit_cond_resched);
+		static_call_update(cond_resched, __cond_resched);
+		static_call_update(might_resched, __cond_resched);
+		static_call_update(preempt_schedule, NULL);
+		static_call_update(preempt_schedule_notrace, NULL);
+		static_call_update(irqentry_exit_cond_resched, NULL);
 		pr_info("Dynamic Preempt: voluntary\n");
 		break;
 
 	case preempt_dynamic_full:
-		preempt_dynamic_disable(cond_resched);
-		preempt_dynamic_disable(might_resched);
-		preempt_dynamic_enable(preempt_schedule);
-		preempt_dynamic_enable(preempt_schedule_notrace);
-		preempt_dynamic_enable(irqentry_exit_cond_resched);
+		static_call_update(cond_resched, (void *)&__static_call_return0);
+		static_call_update(might_resched, (void *)&__static_call_return0);
+		static_call_update(preempt_schedule, __preempt_schedule_func);
+		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
 		pr_info("Dynamic Preempt: full\n");
 		break;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 13/86] Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (11 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 12/86] Revert "sched/preempt: Refactor sched_dynamic_update()" Ankur Arora
@ 2023-11-07 21:56 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 14/86] Revert "preempt/dynamic: Fix setup_preempt_mode() return value" Ankur Arora
                   ` (49 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 4c7485584d48f60b1e742c7c6a3a1fa503d48d97.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 272 ++++++++++++++++++++++----------------------
 1 file changed, 136 insertions(+), 136 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2268d9e23635..f8bbddd729db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6956,6 +6956,142 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
 
 #endif /* CONFIG_PREEMPTION */
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+#include <linux/entry-common.h>
+
+/*
+ * SC:cond_resched
+ * SC:might_resched
+ * SC:preempt_schedule
+ * SC:preempt_schedule_notrace
+ * SC:irqentry_exit_cond_resched
+ *
+ *
+ * NONE:
+ *   cond_resched               <- __cond_resched
+ *   might_resched              <- RET0
+ *   preempt_schedule           <- NOP
+ *   preempt_schedule_notrace   <- NOP
+ *   irqentry_exit_cond_resched <- NOP
+ *
+ * VOLUNTARY:
+ *   cond_resched               <- __cond_resched
+ *   might_resched              <- __cond_resched
+ *   preempt_schedule           <- NOP
+ *   preempt_schedule_notrace   <- NOP
+ *   irqentry_exit_cond_resched <- NOP
+ *
+ * FULL:
+ *   cond_resched               <- RET0
+ *   might_resched              <- RET0
+ *   preempt_schedule           <- preempt_schedule
+ *   preempt_schedule_notrace   <- preempt_schedule_notrace
+ *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ */
+
+enum {
+	preempt_dynamic_undefined = -1,
+	preempt_dynamic_none,
+	preempt_dynamic_voluntary,
+	preempt_dynamic_full,
+};
+
+int preempt_dynamic_mode = preempt_dynamic_undefined;
+
+int sched_dynamic_mode(const char *str)
+{
+	if (!strcmp(str, "none"))
+		return preempt_dynamic_none;
+
+	if (!strcmp(str, "voluntary"))
+		return preempt_dynamic_voluntary;
+
+	if (!strcmp(str, "full"))
+		return preempt_dynamic_full;
+
+	return -EINVAL;
+}
+
+void sched_dynamic_update(int mode)
+{
+	/*
+	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
+	 * the ZERO state, which is invalid.
+	 */
+	static_call_update(cond_resched, __cond_resched);
+	static_call_update(might_resched, __cond_resched);
+	static_call_update(preempt_schedule, __preempt_schedule_func);
+	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+
+	switch (mode) {
+	case preempt_dynamic_none:
+		static_call_update(cond_resched, __cond_resched);
+		static_call_update(might_resched, (void *)&__static_call_return0);
+		static_call_update(preempt_schedule, NULL);
+		static_call_update(preempt_schedule_notrace, NULL);
+		static_call_update(irqentry_exit_cond_resched, NULL);
+		pr_info("Dynamic Preempt: none\n");
+		break;
+
+	case preempt_dynamic_voluntary:
+		static_call_update(cond_resched, __cond_resched);
+		static_call_update(might_resched, __cond_resched);
+		static_call_update(preempt_schedule, NULL);
+		static_call_update(preempt_schedule_notrace, NULL);
+		static_call_update(irqentry_exit_cond_resched, NULL);
+		pr_info("Dynamic Preempt: voluntary\n");
+		break;
+
+	case preempt_dynamic_full:
+		static_call_update(cond_resched, (void *)&__static_call_return0);
+		static_call_update(might_resched, (void *)&__static_call_return0);
+		static_call_update(preempt_schedule, __preempt_schedule_func);
+		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+		pr_info("Dynamic Preempt: full\n");
+		break;
+	}
+
+	preempt_dynamic_mode = mode;
+}
+
+static int __init setup_preempt_mode(char *str)
+{
+	int mode = sched_dynamic_mode(str);
+	if (mode < 0) {
+		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
+		return 0;
+	}
+
+	sched_dynamic_update(mode);
+	return 1;
+}
+__setup("preempt=", setup_preempt_mode);
+
+static void __init preempt_dynamic_init(void)
+{
+	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
+		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
+			sched_dynamic_update(preempt_dynamic_none);
+		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
+			sched_dynamic_update(preempt_dynamic_voluntary);
+		} else {
+			/* Default static call setting, nothing to do */
+			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
+			preempt_dynamic_mode = preempt_dynamic_full;
+			pr_info("Dynamic Preempt: full\n");
+		}
+	}
+}
+
+#else /* !CONFIG_PREEMPT_DYNAMIC */
+
+static inline void preempt_dynamic_init(void) { }
+
+#endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
+
 /*
  * This is the entry point to schedule() from kernel preemption
  * off of irq context.
@@ -8622,142 +8758,6 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_rwlock_write);
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-#include <linux/entry-common.h>
-
-/*
- * SC:cond_resched
- * SC:might_resched
- * SC:preempt_schedule
- * SC:preempt_schedule_notrace
- * SC:irqentry_exit_cond_resched
- *
- *
- * NONE:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- RET0
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
- *
- * VOLUNTARY:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- __cond_resched
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
- *
- * FULL:
- *   cond_resched               <- RET0
- *   might_resched              <- RET0
- *   preempt_schedule           <- preempt_schedule
- *   preempt_schedule_notrace   <- preempt_schedule_notrace
- *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
- */
-
-enum {
-	preempt_dynamic_undefined = -1,
-	preempt_dynamic_none,
-	preempt_dynamic_voluntary,
-	preempt_dynamic_full,
-};
-
-int preempt_dynamic_mode = preempt_dynamic_undefined;
-
-int sched_dynamic_mode(const char *str)
-{
-	if (!strcmp(str, "none"))
-		return preempt_dynamic_none;
-
-	if (!strcmp(str, "voluntary"))
-		return preempt_dynamic_voluntary;
-
-	if (!strcmp(str, "full"))
-		return preempt_dynamic_full;
-
-	return -EINVAL;
-}
-
-void sched_dynamic_update(int mode)
-{
-	/*
-	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
-	 * the ZERO state, which is invalid.
-	 */
-	static_call_update(cond_resched, __cond_resched);
-	static_call_update(might_resched, __cond_resched);
-	static_call_update(preempt_schedule, __preempt_schedule_func);
-	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
-
-	switch (mode) {
-	case preempt_dynamic_none:
-		static_call_update(cond_resched, __cond_resched);
-		static_call_update(might_resched, (void *)&__static_call_return0);
-		static_call_update(preempt_schedule, NULL);
-		static_call_update(preempt_schedule_notrace, NULL);
-		static_call_update(irqentry_exit_cond_resched, NULL);
-		pr_info("Dynamic Preempt: none\n");
-		break;
-
-	case preempt_dynamic_voluntary:
-		static_call_update(cond_resched, __cond_resched);
-		static_call_update(might_resched, __cond_resched);
-		static_call_update(preempt_schedule, NULL);
-		static_call_update(preempt_schedule_notrace, NULL);
-		static_call_update(irqentry_exit_cond_resched, NULL);
-		pr_info("Dynamic Preempt: voluntary\n");
-		break;
-
-	case preempt_dynamic_full:
-		static_call_update(cond_resched, (void *)&__static_call_return0);
-		static_call_update(might_resched, (void *)&__static_call_return0);
-		static_call_update(preempt_schedule, __preempt_schedule_func);
-		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: full\n");
-		break;
-	}
-
-	preempt_dynamic_mode = mode;
-}
-
-static int __init setup_preempt_mode(char *str)
-{
-	int mode = sched_dynamic_mode(str);
-	if (mode < 0) {
-		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
-		return 0;
-	}
-
-	sched_dynamic_update(mode);
-	return 1;
-}
-__setup("preempt=", setup_preempt_mode);
-
-static void __init preempt_dynamic_init(void)
-{
-	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
-		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
-			sched_dynamic_update(preempt_dynamic_none);
-		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
-			sched_dynamic_update(preempt_dynamic_voluntary);
-		} else {
-			/* Default static call setting, nothing to do */
-			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
-			preempt_dynamic_mode = preempt_dynamic_full;
-			pr_info("Dynamic Preempt: full\n");
-		}
-	}
-}
-
-#else /* !CONFIG_PREEMPT_DYNAMIC */
-
-static inline void preempt_dynamic_init(void) { }
-
-#endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
-
 /**
  * yield - yield the current processor to other threads.
  *
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 14/86] Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (12 preceding siblings ...)
  2023-11-07 21:56 ` [RFC PATCH 13/86] Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 23:20   ` Steven Rostedt
  2023-11-07 21:57 ` [RFC PATCH 15/86] Revert "preempt: Restore preemption model selection configs" Ankur Arora
                   ` (48 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 9ed20bafc85806ca6c97c9128cec46c3ef80ae86.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8bbddd729db..50e1133cacc9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7062,11 +7062,11 @@ static int __init setup_preempt_mode(char *str)
 	int mode = sched_dynamic_mode(str);
 	if (mode < 0) {
 		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
-		return 0;
+		return 1;
 	}
 
 	sched_dynamic_update(mode);
-	return 1;
+	return 0;
 }
 __setup("preempt=", setup_preempt_mode);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 15/86] Revert "preempt: Restore preemption model selection configs"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (13 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 14/86] Revert "preempt/dynamic: Fix setup_preempt_mode() return value" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 16/86] Revert "sched: Provide Kconfig support for default dynamic preempt mode" Ankur Arora
                   ` (47 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This is a partial revert of commit a8b76910e465d718effce0cad306a21fa4f3526b.

There have been some structural changes to init/Makefile so we leave it
be.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/kernel.h   |  2 +-
 include/linux/vermagic.h |  2 +-
 kernel/Kconfig.preempt   | 42 ++++++++++++++++++++--------------------
 kernel/sched/core.c      |  6 +++---
 4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index cdce553479b4..b9121007fd0b 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -96,7 +96,7 @@
 struct completion;
 struct user;
 
-#ifdef CONFIG_PREEMPT_VOLUNTARY_BUILD
+#ifdef CONFIG_PREEMPT_VOLUNTARY
 
 extern int __cond_resched(void);
 # define might_resched() __cond_resched()
diff --git a/include/linux/vermagic.h b/include/linux/vermagic.h
index a54046bf37e5..e710e3762c52 100644
--- a/include/linux/vermagic.h
+++ b/include/linux/vermagic.h
@@ -15,7 +15,7 @@
 #else
 #define MODULE_VERMAGIC_SMP ""
 #endif
-#ifdef CONFIG_PREEMPT_BUILD
+#ifdef CONFIG_PREEMPT
 #define MODULE_VERMAGIC_PREEMPT "preempt "
 #elif defined(CONFIG_PREEMPT_RT)
 #define MODULE_VERMAGIC_PREEMPT "preempt_rt "
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index ce77f0265660..60f1bfc3c7b2 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -1,23 +1,12 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-config PREEMPT_NONE_BUILD
-	bool
-
-config PREEMPT_VOLUNTARY_BUILD
-	bool
-
-config PREEMPT_BUILD
-	bool
-	select PREEMPTION
-	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
-
 choice
 	prompt "Preemption Model"
-	default PREEMPT_NONE
+	default PREEMPT_NONE_BEHAVIOUR
 
-config PREEMPT_NONE
+config PREEMPT_NONE_BEHAVIOUR
 	bool "No Forced Preemption (Server)"
-	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
+	select PREEMPT_NONE if !PREEMPT_DYNAMIC
 	help
 	  This is the traditional Linux preemption model, geared towards
 	  throughput. It will still provide good latencies most of the
@@ -29,10 +18,10 @@ config PREEMPT_NONE
 	  raw processing power of the kernel, irrespective of scheduling
 	  latencies.
 
-config PREEMPT_VOLUNTARY
+config PREEMPT_VOLUNTARY_BEHAVIOUR
 	bool "Voluntary Kernel Preemption (Desktop)"
 	depends on !ARCH_NO_PREEMPT
-	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
+	select PREEMPT_VOLUNTARY if !PREEMPT_DYNAMIC
 	help
 	  This option reduces the latency of the kernel by adding more
 	  "explicit preemption points" to the kernel code. These new
@@ -48,10 +37,10 @@ config PREEMPT_VOLUNTARY
 
 	  Select this if you are building a kernel for a desktop system.
 
-config PREEMPT
+config PREEMPT_BEHAVIOUR
 	bool "Preemptible Kernel (Low-Latency Desktop)"
 	depends on !ARCH_NO_PREEMPT
-	select PREEMPT_BUILD
+	select PREEMPT
 	help
 	  This option reduces the latency of the kernel by making
 	  all kernel code (that is not executing in a critical section)
@@ -69,7 +58,7 @@ config PREEMPT
 
 config PREEMPT_RT
 	bool "Fully Preemptible Kernel (Real-Time)"
-	depends on EXPERT && ARCH_SUPPORTS_RT
+	depends on EXPERT && ARCH_SUPPORTS_RT && !PREEMPT_DYNAMIC
 	select PREEMPTION
 	help
 	  This option turns the kernel into a real-time kernel by replacing
@@ -86,6 +75,17 @@ config PREEMPT_RT
 
 endchoice
 
+config PREEMPT_NONE
+	bool
+
+config PREEMPT_VOLUNTARY
+	bool
+
+config PREEMPT
+	bool
+	select PREEMPTION
+	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
+
 config PREEMPT_COUNT
        bool
 
@@ -95,8 +95,8 @@ config PREEMPTION
 
 config PREEMPT_DYNAMIC
 	bool "Preemption behaviour defined on boot"
-	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
-	select PREEMPT_BUILD
+	depends on HAVE_PREEMPT_DYNAMIC
+	select PREEMPT
 	default y
 	help
 	  This option allows to define the preemption model on the kernel
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 50e1133cacc9..d3828d90bf84 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7073,13 +7073,13 @@ __setup("preempt=", setup_preempt_mode);
 static void __init preempt_dynamic_init(void)
 {
 	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
-		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
+		if (IS_ENABLED(CONFIG_PREEMPT_NONE_BEHAVIOUR)) {
 			sched_dynamic_update(preempt_dynamic_none);
-		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
+		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY_BEHAVIOUR)) {
 			sched_dynamic_update(preempt_dynamic_voluntary);
 		} else {
 			/* Default static call setting, nothing to do */
-			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
+			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT_BEHAVIOUR));
 			preempt_dynamic_mode = preempt_dynamic_full;
 			pr_info("Dynamic Preempt: full\n");
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 16/86] Revert "sched: Provide Kconfig support for default dynamic preempt mode"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (14 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 15/86] Revert "preempt: Restore preemption model selection configs" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 17/86] sched/preempt: remove PREEMPT_DYNAMIC from the build version Ankur Arora
                   ` (46 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit c597bfddc9e9e8a63817252b67c3ca0e544ace26.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/Kconfig.preempt | 32 +++++++++-----------------------
 kernel/sched/core.c    | 29 +++--------------------------
 2 files changed, 12 insertions(+), 49 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 60f1bfc3c7b2..5876e30c5740 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -2,11 +2,10 @@
 
 choice
 	prompt "Preemption Model"
-	default PREEMPT_NONE_BEHAVIOUR
+	default PREEMPT_NONE
 
-config PREEMPT_NONE_BEHAVIOUR
+config PREEMPT_NONE
 	bool "No Forced Preemption (Server)"
-	select PREEMPT_NONE if !PREEMPT_DYNAMIC
 	help
 	  This is the traditional Linux preemption model, geared towards
 	  throughput. It will still provide good latencies most of the
@@ -18,10 +17,9 @@ config PREEMPT_NONE_BEHAVIOUR
 	  raw processing power of the kernel, irrespective of scheduling
 	  latencies.
 
-config PREEMPT_VOLUNTARY_BEHAVIOUR
+config PREEMPT_VOLUNTARY
 	bool "Voluntary Kernel Preemption (Desktop)"
 	depends on !ARCH_NO_PREEMPT
-	select PREEMPT_VOLUNTARY if !PREEMPT_DYNAMIC
 	help
 	  This option reduces the latency of the kernel by adding more
 	  "explicit preemption points" to the kernel code. These new
@@ -37,10 +35,12 @@ config PREEMPT_VOLUNTARY_BEHAVIOUR
 
 	  Select this if you are building a kernel for a desktop system.
 
-config PREEMPT_BEHAVIOUR
+config PREEMPT
 	bool "Preemptible Kernel (Low-Latency Desktop)"
 	depends on !ARCH_NO_PREEMPT
-	select PREEMPT
+	select PREEMPTION
+	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
+	select PREEMPT_DYNAMIC if HAVE_PREEMPT_DYNAMIC
 	help
 	  This option reduces the latency of the kernel by making
 	  all kernel code (that is not executing in a critical section)
@@ -58,7 +58,7 @@ config PREEMPT_BEHAVIOUR
 
 config PREEMPT_RT
 	bool "Fully Preemptible Kernel (Real-Time)"
-	depends on EXPERT && ARCH_SUPPORTS_RT && !PREEMPT_DYNAMIC
+	depends on EXPERT && ARCH_SUPPORTS_RT
 	select PREEMPTION
 	help
 	  This option turns the kernel into a real-time kernel by replacing
@@ -75,17 +75,6 @@ config PREEMPT_RT
 
 endchoice
 
-config PREEMPT_NONE
-	bool
-
-config PREEMPT_VOLUNTARY
-	bool
-
-config PREEMPT
-	bool
-	select PREEMPTION
-	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
-
 config PREEMPT_COUNT
        bool
 
@@ -94,10 +83,7 @@ config PREEMPTION
        select PREEMPT_COUNT
 
 config PREEMPT_DYNAMIC
-	bool "Preemption behaviour defined on boot"
-	depends on HAVE_PREEMPT_DYNAMIC
-	select PREEMPT
-	default y
+	bool
 	help
 	  This option allows to define the preemption model on the kernel
 	  command line parameter and thus override the default preemption
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d3828d90bf84..12f255e038ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6991,13 +6991,12 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
  */
 
 enum {
-	preempt_dynamic_undefined = -1,
-	preempt_dynamic_none,
+	preempt_dynamic_none = 0,
 	preempt_dynamic_voluntary,
 	preempt_dynamic_full,
 };
 
-int preempt_dynamic_mode = preempt_dynamic_undefined;
+int preempt_dynamic_mode = preempt_dynamic_full;
 
 int sched_dynamic_mode(const char *str)
 {
@@ -7070,27 +7069,7 @@ static int __init setup_preempt_mode(char *str)
 }
 __setup("preempt=", setup_preempt_mode);
 
-static void __init preempt_dynamic_init(void)
-{
-	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
-		if (IS_ENABLED(CONFIG_PREEMPT_NONE_BEHAVIOUR)) {
-			sched_dynamic_update(preempt_dynamic_none);
-		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY_BEHAVIOUR)) {
-			sched_dynamic_update(preempt_dynamic_voluntary);
-		} else {
-			/* Default static call setting, nothing to do */
-			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT_BEHAVIOUR));
-			preempt_dynamic_mode = preempt_dynamic_full;
-			pr_info("Dynamic Preempt: full\n");
-		}
-	}
-}
-
-#else /* !CONFIG_PREEMPT_DYNAMIC */
-
-static inline void preempt_dynamic_init(void) { }
-
-#endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
+#endif /* CONFIG_PREEMPT_DYNAMIC */
 
 /*
  * This is the entry point to schedule() from kernel preemption
@@ -9966,8 +9945,6 @@ void __init sched_init(void)
 
 	init_uclamp();
 
-	preempt_dynamic_init();
-
 	scheduler_running = 1;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 17/86] sched/preempt: remove PREEMPT_DYNAMIC from the build version
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (15 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 16/86] Revert "sched: Provide Kconfig support for default dynamic preempt mode" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 18/86] Revert "preempt/dynamic: Fix typo in macro conditional statement" Ankur Arora
                   ` (45 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

As the PREEMPT_DYNAMIC logic is going away, also remove PREEMPT_DYNAMIC
from the generated build version and go back to the original string.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 init/Makefile | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/init/Makefile b/init/Makefile
index ec557ada3c12..385fd80fa2ef 100644
--- a/init/Makefile
+++ b/init/Makefile
@@ -24,8 +24,7 @@ mounts-$(CONFIG_BLK_DEV_INITRD)	+= do_mounts_initrd.o
 #
 
 smp-flag-$(CONFIG_SMP)			:= SMP
-preempt-flag-$(CONFIG_PREEMPT_BUILD)	:= PREEMPT
-preempt-flag-$(CONFIG_PREEMPT_DYNAMIC)	:= PREEMPT_DYNAMIC
+preempt-flag-$(CONFIG_PREEMPT)          := PREEMPT
 preempt-flag-$(CONFIG_PREEMPT_RT)	:= PREEMPT_RT
 
 build-version = $(or $(KBUILD_BUILD_VERSION), $(build-version-auto))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 18/86] Revert "preempt/dynamic: Fix typo in macro conditional statement"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (16 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 17/86] sched/preempt: remove PREEMPT_DYNAMIC from the build version Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 19/86] Revert "sched,preempt: Move preempt_dynamic to debug.c" Ankur Arora
                   ` (44 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 0c89d87d1d43d9fa268d1dc489518564d58bf497.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/entry/common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index b0b7be0705e0..d866c49dc015 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -418,7 +418,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 
 		instrumentation_begin();
 		if (IS_ENABLED(CONFIG_PREEMPTION)) {
-#ifdef CONFIG_PREEMPT_DYNAMIC
+#ifdef CONFIG_PREEMT_DYNAMIC
 			static_call(irqentry_exit_cond_resched)();
 #else
 			irqentry_exit_cond_resched();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 19/86] Revert "sched,preempt: Move preempt_dynamic to debug.c"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (17 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 18/86] Revert "preempt/dynamic: Fix typo in macro conditional statement" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 20/86] Revert "static_call: Relax static_call_update() function argument type" Ankur Arora
                   ` (43 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 1011dcce99f8026d48fdd7b9cc259e32a8b472be.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c  | 77 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/debug.c | 67 +-------------------------------------
 kernel/sched/sched.h |  6 ----
 3 files changed, 75 insertions(+), 75 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12f255e038ed..abc95dfe0ab4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6996,9 +6996,9 @@ enum {
 	preempt_dynamic_full,
 };
 
-int preempt_dynamic_mode = preempt_dynamic_full;
+static int preempt_dynamic_mode = preempt_dynamic_full;
 
-int sched_dynamic_mode(const char *str)
+static int sched_dynamic_mode(const char *str)
 {
 	if (!strcmp(str, "none"))
 		return preempt_dynamic_none;
@@ -7012,7 +7012,7 @@ int sched_dynamic_mode(const char *str)
 	return -EINVAL;
 }
 
-void sched_dynamic_update(int mode)
+static void sched_dynamic_update(int mode)
 {
 	/*
 	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
@@ -7069,8 +7069,79 @@ static int __init setup_preempt_mode(char *str)
 }
 __setup("preempt=", setup_preempt_mode);
 
+#ifdef CONFIG_SCHED_DEBUG
+
+static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos)
+{
+	char buf[16];
+	int mode;
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+	mode = sched_dynamic_mode(strstrip(buf));
+	if (mode < 0)
+		return mode;
+
+	sched_dynamic_update(mode);
+
+	*ppos += cnt;
+
+	return cnt;
+}
+
+static int sched_dynamic_show(struct seq_file *m, void *v)
+{
+	static const char * preempt_modes[] = {
+		"none", "voluntary", "full"
+	};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(preempt_modes); i++) {
+		if (preempt_dynamic_mode == i)
+			seq_puts(m, "(");
+		seq_puts(m, preempt_modes[i]);
+		if (preempt_dynamic_mode == i)
+			seq_puts(m, ")");
+
+		seq_puts(m, " ");
+	}
+
+	seq_puts(m, "\n");
+	return 0;
+}
+
+static int sched_dynamic_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_dynamic_show, NULL);
+}
+
+static const struct file_operations sched_dynamic_fops = {
+	.open		= sched_dynamic_open,
+	.write		= sched_dynamic_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+extern struct dentry *debugfs_sched;
+
+static __init int sched_init_debug_dynamic(void)
+{
+	debugfs_create_file("sched_preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
+	return 0;
+}
+late_initcall(sched_init_debug_dynamic);
+
+#endif /* CONFIG_SCHED_DEBUG */
 #endif /* CONFIG_PREEMPT_DYNAMIC */
 
+
 /*
  * This is the entry point to schedule() from kernel preemption
  * off of irq context.
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4c3d0d9f3db6..67d6c35fc5a4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -216,68 +216,6 @@ static const struct file_operations sched_scaling_fops = {
 
 #endif /* SMP */
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
-				   size_t cnt, loff_t *ppos)
-{
-	char buf[16];
-	int mode;
-
-	if (cnt > 15)
-		cnt = 15;
-
-	if (copy_from_user(&buf, ubuf, cnt))
-		return -EFAULT;
-
-	buf[cnt] = 0;
-	mode = sched_dynamic_mode(strstrip(buf));
-	if (mode < 0)
-		return mode;
-
-	sched_dynamic_update(mode);
-
-	*ppos += cnt;
-
-	return cnt;
-}
-
-static int sched_dynamic_show(struct seq_file *m, void *v)
-{
-	static const char * preempt_modes[] = {
-		"none", "voluntary", "full"
-	};
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(preempt_modes); i++) {
-		if (preempt_dynamic_mode == i)
-			seq_puts(m, "(");
-		seq_puts(m, preempt_modes[i]);
-		if (preempt_dynamic_mode == i)
-			seq_puts(m, ")");
-
-		seq_puts(m, " ");
-	}
-
-	seq_puts(m, "\n");
-	return 0;
-}
-
-static int sched_dynamic_open(struct inode *inode, struct file *filp)
-{
-	return single_open(filp, sched_dynamic_show, NULL);
-}
-
-static const struct file_operations sched_dynamic_fops = {
-	.open		= sched_dynamic_open,
-	.write		= sched_dynamic_write,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
-#endif /* CONFIG_PREEMPT_DYNAMIC */
-
 __read_mostly bool sched_debug_verbose;
 
 #ifdef CONFIG_SMP
@@ -333,7 +271,7 @@ static const struct file_operations sched_debug_fops = {
 	.release	= seq_release,
 };
 
-static struct dentry *debugfs_sched;
+struct dentry *debugfs_sched;
 
 static __init int sched_init_debug(void)
 {
@@ -343,9 +281,6 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
 	debugfs_create_file_unsafe("verbose", 0644, debugfs_sched, &sched_debug_verbose, &sched_verbose_fops);
-#ifdef CONFIG_PREEMPT_DYNAMIC
-	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
-#endif
 
 	debugfs_create_u32("base_slice_ns", 0644, debugfs_sched, &sysctl_sched_base_slice);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04846272409c..9e1329a4e890 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3274,12 +3274,6 @@ extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *w
 
 extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-extern int preempt_dynamic_mode;
-extern int sched_dynamic_mode(const char *str);
-extern void sched_dynamic_update(int mode);
-#endif
-
 static inline void update_current_exec_runtime(struct task_struct *curr,
 						u64 now, u64 delta_exec)
 {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 20/86] Revert "static_call: Relax static_call_update() function argument type"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (18 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 19/86] Revert "sched,preempt: Move preempt_dynamic to debug.c" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 21/86] Revert "sched/core: Use -EINVAL in sched_dynamic_mode()" Ankur Arora
                   ` (42 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This is a partial revert of commit 9432bbd969c667fc9c4b1c140c5a745ff2a7b540.

We keep the static_call_update() type matching logic which is used
elsewhere.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index abc95dfe0ab4..e0bbc2b0b11e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7027,25 +7027,25 @@ static void sched_dynamic_update(int mode)
 	switch (mode) {
 	case preempt_dynamic_none:
 		static_call_update(cond_resched, __cond_resched);
-		static_call_update(might_resched, (void *)&__static_call_return0);
-		static_call_update(preempt_schedule, NULL);
-		static_call_update(preempt_schedule_notrace, NULL);
-		static_call_update(irqentry_exit_cond_resched, NULL);
+		static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0);
+		static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL);
+		static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL);
+		static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL);
 		pr_info("Dynamic Preempt: none\n");
 		break;
 
 	case preempt_dynamic_voluntary:
 		static_call_update(cond_resched, __cond_resched);
 		static_call_update(might_resched, __cond_resched);
-		static_call_update(preempt_schedule, NULL);
-		static_call_update(preempt_schedule_notrace, NULL);
-		static_call_update(irqentry_exit_cond_resched, NULL);
+		static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL);
+		static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL);
+		static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL);
 		pr_info("Dynamic Preempt: voluntary\n");
 		break;
 
 	case preempt_dynamic_full:
-		static_call_update(cond_resched, (void *)&__static_call_return0);
-		static_call_update(might_resched, (void *)&__static_call_return0);
+		static_call_update(cond_resched, (typeof(&__cond_resched)) __static_call_return0);
+		static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0);
 		static_call_update(preempt_schedule, __preempt_schedule_func);
 		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
 		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 21/86] Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (19 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 20/86] Revert "static_call: Relax static_call_update() function argument type" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 22/86] Revert "sched/core: Stop using magic values " Ankur Arora
                   ` (41 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit c4681f3f1cfcfde0c95ff72f0bdb43f9ffd7f00e.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e0bbc2b0b11e..673de11272fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7009,7 +7009,7 @@ static int sched_dynamic_mode(const char *str)
 	if (!strcmp(str, "full"))
 		return preempt_dynamic_full;
 
-	return -EINVAL;
+	return -1;
 }
 
 static void sched_dynamic_update(int mode)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 22/86] Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (20 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 21/86] Revert "sched/core: Use -EINVAL in sched_dynamic_mode()" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 23/86] Revert "sched,x86: Allow !PREEMPT_DYNAMIC" Ankur Arora
                   ` (40 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 7e1b2eb74928b2478fd0630ce6c664334b480d00.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 673de11272fa..bbd19b8ff3e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7001,13 +7001,13 @@ static int preempt_dynamic_mode = preempt_dynamic_full;
 static int sched_dynamic_mode(const char *str)
 {
 	if (!strcmp(str, "none"))
-		return preempt_dynamic_none;
+		return 0;
 
 	if (!strcmp(str, "voluntary"))
-		return preempt_dynamic_voluntary;
+		return 1;
 
 	if (!strcmp(str, "full"))
-		return preempt_dynamic_full;
+		return 2;
 
 	return -1;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 23/86] Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (21 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 22/86] Revert "sched/core: Stop using magic values " Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 24/86] Revert "sched: Harden PREEMPT_DYNAMIC" Ankur Arora
                   ` (39 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit c5e6fc08feb2b88dc5dac2f3c817e1c2a4cafda4.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/preempt.h | 24 ++++++------------------
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 495faed1c76c..49d2f0396be4 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -111,13 +111,6 @@ extern asmlinkage void preempt_schedule_thunk(void);
 
 #define __preempt_schedule_func preempt_schedule_thunk
 
-extern asmlinkage void preempt_schedule_notrace(void);
-extern asmlinkage void preempt_schedule_notrace_thunk(void);
-
-#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk
-
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
 DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
 
 #define __preempt_schedule() \
@@ -126,6 +119,11 @@ do { \
 	asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : ASM_CALL_CONSTRAINT); \
 } while (0)
 
+extern asmlinkage void preempt_schedule_notrace(void);
+extern asmlinkage void preempt_schedule_notrace_thunk(void);
+
+#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk
+
 DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
 
 #define __preempt_schedule_notrace() \
@@ -134,16 +132,6 @@ do { \
 	asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : ASM_CALL_CONSTRAINT); \
 } while (0)
 
-#else /* PREEMPT_DYNAMIC */
-
-#define __preempt_schedule() \
-	asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT);
-
-#define __preempt_schedule_notrace() \
-	asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT);
-
-#endif /* PREEMPT_DYNAMIC */
-
-#endif /* PREEMPTION */
+#endif
 
 #endif /* __ASM_PREEMPT_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 24/86] Revert "sched: Harden PREEMPT_DYNAMIC"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (22 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 23/86] Revert "sched,x86: Allow !PREEMPT_DYNAMIC" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 25/86] Revert "sched: Add /debug/sched_preempt" Ankur Arora
                   ` (38 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit ef72661e28c64ad610f89acc2832ec67b27ba438.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/preempt.h | 4 ++--
 include/linux/kernel.h         | 2 +-
 include/linux/sched.h          | 2 +-
 kernel/sched/core.c            | 8 ++++----
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 49d2f0396be4..967879366d27 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -115,7 +115,7 @@ DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
 
 #define __preempt_schedule() \
 do { \
-	__STATIC_CALL_MOD_ADDRESSABLE(preempt_schedule); \
+	__ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule)); \
 	asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : ASM_CALL_CONSTRAINT); \
 } while (0)
 
@@ -128,7 +128,7 @@ DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
 
 #define __preempt_schedule_notrace() \
 do { \
-	__STATIC_CALL_MOD_ADDRESSABLE(preempt_schedule_notrace); \
+	__ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule_notrace)); \
 	asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : ASM_CALL_CONSTRAINT); \
 } while (0)
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index b9121007fd0b..5f99720d0cca 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -109,7 +109,7 @@ DECLARE_STATIC_CALL(might_resched, __cond_resched);
 
 static __always_inline void might_resched(void)
 {
-	static_call_mod(might_resched)();
+	static_call(might_resched)();
 }
 
 #else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 66f520954de5..2b1f3008c90e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2090,7 +2090,7 @@ DECLARE_STATIC_CALL(cond_resched, __cond_resched);
 
 static __always_inline int _cond_resched(void)
 {
-	return static_call_mod(cond_resched)();
+	return static_call(cond_resched)();
 }
 
 #else
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bbd19b8ff3e9..7ea22244c540 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6893,7 +6893,7 @@ EXPORT_SYMBOL(preempt_schedule);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
-EXPORT_STATIC_CALL_TRAMP(preempt_schedule);
+EXPORT_STATIC_CALL(preempt_schedule);
 #endif
 
 
@@ -6951,7 +6951,7 @@ EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
+EXPORT_STATIC_CALL(preempt_schedule_notrace);
 #endif
 
 #endif /* CONFIG_PREEMPTION */
@@ -8740,10 +8740,10 @@ EXPORT_SYMBOL(__cond_resched);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
-EXPORT_STATIC_CALL_TRAMP(cond_resched);
+EXPORT_STATIC_CALL(cond_resched);
 
 DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
-EXPORT_STATIC_CALL_TRAMP(might_resched);
+EXPORT_STATIC_CALL(might_resched);
 #endif
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 25/86] Revert "sched: Add /debug/sched_preempt"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (23 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 24/86] Revert "sched: Harden PREEMPT_DYNAMIC" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 26/86] Revert "preempt/dynamic: Support dynamic preempt with preempt= boot option" Ankur Arora
                   ` (37 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit e59e10f8ef63d42fbb99776a5a112841e798b3b5.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 137 +++-----------------------------------------
 1 file changed, 9 insertions(+), 128 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ea22244c540..b8dacc7feb47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6989,156 +6989,37 @@ EXPORT_STATIC_CALL(preempt_schedule_notrace);
  *   preempt_schedule_notrace   <- preempt_schedule_notrace
  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
  */
-
-enum {
-	preempt_dynamic_none = 0,
-	preempt_dynamic_voluntary,
-	preempt_dynamic_full,
-};
-
-static int preempt_dynamic_mode = preempt_dynamic_full;
-
-static int sched_dynamic_mode(const char *str)
+static int __init setup_preempt_mode(char *str)
 {
-	if (!strcmp(str, "none"))
-		return 0;
-
-	if (!strcmp(str, "voluntary"))
-		return 1;
-
-	if (!strcmp(str, "full"))
-		return 2;
-
-	return -1;
-}
-
-static void sched_dynamic_update(int mode)
-{
-	/*
-	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
-	 * the ZERO state, which is invalid.
-	 */
-	static_call_update(cond_resched, __cond_resched);
-	static_call_update(might_resched, __cond_resched);
-	static_call_update(preempt_schedule, __preempt_schedule_func);
-	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
-
-	switch (mode) {
-	case preempt_dynamic_none:
+	if (!strcmp(str, "none")) {
 		static_call_update(cond_resched, __cond_resched);
 		static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0);
 		static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL);
 		static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL);
 		static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL);
-		pr_info("Dynamic Preempt: none\n");
-		break;
-
-	case preempt_dynamic_voluntary:
+		pr_info("Dynamic Preempt: %s\n", str);
+	} else if (!strcmp(str, "voluntary")) {
 		static_call_update(cond_resched, __cond_resched);
 		static_call_update(might_resched, __cond_resched);
 		static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL);
 		static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL);
 		static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL);
-		pr_info("Dynamic Preempt: voluntary\n");
-		break;
-
-	case preempt_dynamic_full:
+		pr_info("Dynamic Preempt: %s\n", str);
+	} else if (!strcmp(str, "full")) {
 		static_call_update(cond_resched, (typeof(&__cond_resched)) __static_call_return0);
 		static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0);
 		static_call_update(preempt_schedule, __preempt_schedule_func);
 		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
 		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: full\n");
-		break;
-	}
-
-	preempt_dynamic_mode = mode;
-}
-
-static int __init setup_preempt_mode(char *str)
-{
-	int mode = sched_dynamic_mode(str);
-	if (mode < 0) {
-		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
+		pr_info("Dynamic Preempt: %s\n", str);
+	} else {
+		pr_warn("Dynamic Preempt: Unsupported preempt mode %s, default to full\n", str);
 		return 1;
 	}
-
-	sched_dynamic_update(mode);
 	return 0;
 }
 __setup("preempt=", setup_preempt_mode);
 
-#ifdef CONFIG_SCHED_DEBUG
-
-static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
-				   size_t cnt, loff_t *ppos)
-{
-	char buf[16];
-	int mode;
-
-	if (cnt > 15)
-		cnt = 15;
-
-	if (copy_from_user(&buf, ubuf, cnt))
-		return -EFAULT;
-
-	buf[cnt] = 0;
-	mode = sched_dynamic_mode(strstrip(buf));
-	if (mode < 0)
-		return mode;
-
-	sched_dynamic_update(mode);
-
-	*ppos += cnt;
-
-	return cnt;
-}
-
-static int sched_dynamic_show(struct seq_file *m, void *v)
-{
-	static const char * preempt_modes[] = {
-		"none", "voluntary", "full"
-	};
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(preempt_modes); i++) {
-		if (preempt_dynamic_mode == i)
-			seq_puts(m, "(");
-		seq_puts(m, preempt_modes[i]);
-		if (preempt_dynamic_mode == i)
-			seq_puts(m, ")");
-
-		seq_puts(m, " ");
-	}
-
-	seq_puts(m, "\n");
-	return 0;
-}
-
-static int sched_dynamic_open(struct inode *inode, struct file *filp)
-{
-	return single_open(filp, sched_dynamic_show, NULL);
-}
-
-static const struct file_operations sched_dynamic_fops = {
-	.open		= sched_dynamic_open,
-	.write		= sched_dynamic_write,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
-extern struct dentry *debugfs_sched;
-
-static __init int sched_init_debug_dynamic(void)
-{
-	debugfs_create_file("sched_preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
-	return 0;
-}
-late_initcall(sched_init_debug_dynamic);
-
-#endif /* CONFIG_SCHED_DEBUG */
 #endif /* CONFIG_PREEMPT_DYNAMIC */
 
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 26/86] Revert "preempt/dynamic: Support dynamic preempt with preempt= boot option"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (24 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 25/86] Revert "sched: Add /debug/sched_preempt" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 27/86] Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static call" Ankur Arora
                   ` (36 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 826bfeb37bb4302ee6042f330c4c0c757152bdb8.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 68 +--------------------------------------------
 1 file changed, 1 insertion(+), 67 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8dacc7feb47..51df0b62f519 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6954,75 +6954,9 @@ DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
 EXPORT_STATIC_CALL(preempt_schedule_notrace);
 #endif
 
+
 #endif /* CONFIG_PREEMPTION */
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-#include <linux/entry-common.h>
-
-/*
- * SC:cond_resched
- * SC:might_resched
- * SC:preempt_schedule
- * SC:preempt_schedule_notrace
- * SC:irqentry_exit_cond_resched
- *
- *
- * NONE:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- RET0
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
- *
- * VOLUNTARY:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- __cond_resched
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
- *
- * FULL:
- *   cond_resched               <- RET0
- *   might_resched              <- RET0
- *   preempt_schedule           <- preempt_schedule
- *   preempt_schedule_notrace   <- preempt_schedule_notrace
- *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
- */
-static int __init setup_preempt_mode(char *str)
-{
-	if (!strcmp(str, "none")) {
-		static_call_update(cond_resched, __cond_resched);
-		static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0);
-		static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL);
-		static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL);
-		static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL);
-		pr_info("Dynamic Preempt: %s\n", str);
-	} else if (!strcmp(str, "voluntary")) {
-		static_call_update(cond_resched, __cond_resched);
-		static_call_update(might_resched, __cond_resched);
-		static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL);
-		static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL);
-		static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL);
-		pr_info("Dynamic Preempt: %s\n", str);
-	} else if (!strcmp(str, "full")) {
-		static_call_update(cond_resched, (typeof(&__cond_resched)) __static_call_return0);
-		static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0);
-		static_call_update(preempt_schedule, __preempt_schedule_func);
-		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: %s\n", str);
-	} else {
-		pr_warn("Dynamic Preempt: Unsupported preempt mode %s, default to full\n", str);
-		return 1;
-	}
-	return 0;
-}
-__setup("preempt=", setup_preempt_mode);
-
-#endif /* CONFIG_PREEMPT_DYNAMIC */
-
-
 /*
  * This is the entry point to schedule() from kernel preemption
  * off of irq context.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 27/86] Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static call"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (25 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 26/86] Revert "preempt/dynamic: Support dynamic preempt with preempt= boot option" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 28/86] Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static calls" Ankur Arora
                   ` (35 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 40607ee97e4eec5655cc0f76a720bdc4c63a6434.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/entry-common.h |  4 ----
 kernel/entry/common.c        | 10 +---------
 2 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 49e9fe9489b6..fb2e349a17d2 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -2,7 +2,6 @@
 #ifndef __LINUX_ENTRYCOMMON_H
 #define __LINUX_ENTRYCOMMON_H
 
-#include <linux/static_call_types.h>
 #include <linux/ptrace.h>
 #include <linux/syscalls.h>
 #include <linux/seccomp.h>
@@ -415,9 +414,6 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
  * Conditional reschedule with additional sanity checks.
  */
 void irqentry_exit_cond_resched(void);
-#ifdef CONFIG_PREEMPT_DYNAMIC
-DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
-#endif
 
 /**
  * irqentry_exit - Handle return from exception that used irqentry_enter()
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d866c49dc015..194c349b8be7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -388,9 +388,6 @@ void irqentry_exit_cond_resched(void)
 			preempt_schedule_irq();
 	}
 }
-#ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
-#endif
 
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
@@ -417,13 +414,8 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}
 
 		instrumentation_begin();
-		if (IS_ENABLED(CONFIG_PREEMPTION)) {
-#ifdef CONFIG_PREEMT_DYNAMIC
-			static_call(irqentry_exit_cond_resched)();
-#else
+		if (IS_ENABLED(CONFIG_PREEMPTION))
 			irqentry_exit_cond_resched();
-#endif
-		}
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 28/86] Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static calls"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (26 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 27/86] Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static call" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 29/86] Revert "preempt/dynamic: Provide cond_resched() and might_resched() " Ankur Arora
                   ` (34 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 2c9a98d3bc808717ab63ad928a2b568967775388.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/preempt.h | 34 ++++++++--------------------------
 kernel/sched/core.c            | 12 ------------
 2 files changed, 8 insertions(+), 38 deletions(-)

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 967879366d27..a7bbe15145a5 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -7,7 +7,6 @@
 #include <asm/current.h>
 
 #include <linux/thread_info.h>
-#include <linux/static_call_types.h>
 
 /* We use the MSB mostly because its available */
 #define PREEMPT_NEED_RESCHED	0x80000000
@@ -105,33 +104,16 @@ static __always_inline bool should_resched(int preempt_offset)
 }
 
 #ifdef CONFIG_PREEMPTION
+  extern asmlinkage void preempt_schedule_thunk(void);
+# define __preempt_schedule() \
+	asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT)
 
-extern asmlinkage void preempt_schedule(void);
-extern asmlinkage void preempt_schedule_thunk(void);
-
-#define __preempt_schedule_func preempt_schedule_thunk
-
-DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
-
-#define __preempt_schedule() \
-do { \
-	__ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule)); \
-	asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : ASM_CALL_CONSTRAINT); \
-} while (0)
-
-extern asmlinkage void preempt_schedule_notrace(void);
-extern asmlinkage void preempt_schedule_notrace_thunk(void);
-
-#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk
-
-DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-
-#define __preempt_schedule_notrace() \
-do { \
-	__ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule_notrace)); \
-	asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : ASM_CALL_CONSTRAINT); \
-} while (0)
+  extern asmlinkage void preempt_schedule(void);
+  extern asmlinkage void preempt_schedule_notrace_thunk(void);
+# define __preempt_schedule_notrace() \
+	asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT)
 
+  extern asmlinkage void preempt_schedule_notrace(void);
 #endif
 
 #endif /* __ASM_PREEMPT_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51df0b62f519..2e191992109b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6891,12 +6891,6 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
 NOKPROBE_SYMBOL(preempt_schedule);
 EXPORT_SYMBOL(preempt_schedule);
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
-EXPORT_STATIC_CALL(preempt_schedule);
-#endif
-
-
 /**
  * preempt_schedule_notrace - preempt_schedule called by tracing
  *
@@ -6949,12 +6943,6 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
 }
 EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-EXPORT_STATIC_CALL(preempt_schedule_notrace);
-#endif
-
-
 #endif /* CONFIG_PREEMPTION */
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 29/86] Revert "preempt/dynamic: Provide cond_resched() and might_resched() static calls"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (27 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 28/86] Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static calls" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 30/86] Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC" Ankur Arora
                   ` (33 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit b965f1ddb47daa5b8b2e2bc9c921431236830367.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/kernel.h | 22 +++-------------------
 include/linux/sched.h  | 27 +++------------------------
 kernel/sched/core.c    | 14 +++-----------
 3 files changed, 9 insertions(+), 54 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5f99720d0cca..cf077cd69643 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -30,7 +30,6 @@
 #include <linux/printk.h>
 #include <linux/build_bug.h>
 #include <linux/sprintf.h>
-#include <linux/static_call_types.h>
 #include <linux/instruction_pointer.h>
 #include <asm/byteorder.h>
 
@@ -97,26 +96,11 @@ struct completion;
 struct user;
 
 #ifdef CONFIG_PREEMPT_VOLUNTARY
-
-extern int __cond_resched(void);
-# define might_resched() __cond_resched()
-
-#elif defined(CONFIG_PREEMPT_DYNAMIC)
-
-extern int __cond_resched(void);
-
-DECLARE_STATIC_CALL(might_resched, __cond_resched);
-
-static __always_inline void might_resched(void)
-{
-	static_call(might_resched)();
-}
-
+extern int _cond_resched(void);
+# define might_resched() _cond_resched()
 #else
-
 # define might_resched() do { } while (0)
-
-#endif /* CONFIG_PREEMPT_* */
+#endif
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 extern void __might_resched(const char *file, int line, unsigned int offsets);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b1f3008c90e..95d47783ff6e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2081,32 +2081,11 @@ static inline int test_tsk_need_resched(struct task_struct *tsk)
  * value indicates whether a reschedule was done in fact.
  * cond_resched_lock() will drop the spinlock before scheduling,
  */
-#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
-extern int __cond_resched(void);
-
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-DECLARE_STATIC_CALL(cond_resched, __cond_resched);
-
-static __always_inline int _cond_resched(void)
-{
-	return static_call(cond_resched)();
-}
-
+#ifndef CONFIG_PREEMPTION
+extern int _cond_resched(void);
 #else
-
-static inline int _cond_resched(void)
-{
-	return __cond_resched();
-}
-
-#endif /* CONFIG_PREEMPT_DYNAMIC */
-
-#else
-
 static inline int _cond_resched(void) { return 0; }
-
-#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
+#endif
 
 #define cond_resched() ({			\
 	__might_resched(__FILE__, __LINE__, 0);	\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e191992109b..5a0bf43975d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8515,8 +8515,8 @@ SYSCALL_DEFINE0(sched_yield)
 	return 0;
 }
 
-#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
-int __sched __cond_resched(void)
+#ifndef CONFIG_PREEMPTION
+int __sched _cond_resched(void)
 {
 	if (should_resched(0)) {
 		preempt_schedule_common();
@@ -8538,15 +8538,7 @@ int __sched __cond_resched(void)
 #endif
 	return 0;
 }
-EXPORT_SYMBOL(__cond_resched);
-#endif
-
-#ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
-EXPORT_STATIC_CALL(cond_resched);
-
-DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
-EXPORT_STATIC_CALL(might_resched);
+EXPORT_SYMBOL(_cond_resched);
 #endif
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 30/86] Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (28 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 29/86] Revert "preempt/dynamic: Provide cond_resched() and might_resched() " Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 31/86] x86/thread_info: add TIF_NEED_RESCHED_LAZY Ankur Arora
                   ` (32 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

This reverts commit 6ef869e0647439af0fc28dde162d33320d4e1dd7.

Also remove the CONFIG_PREEMPT_DYNAMIC guarded inclusion of
linux/entry-common.h which seems to have been missed somewhere.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 .../admin-guide/kernel-parameters.txt         |  7 -------
 arch/Kconfig                                  |  9 ---------
 arch/x86/Kconfig                              |  1 -
 kernel/Kconfig.preempt                        | 19 -------------------
 kernel/sched/core.c                           |  6 ------
 5 files changed, 42 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0a1731a0f0ef..93b60558a78f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4577,13 +4577,6 @@
 			Format: {"off"}
 			Disable Hardware Transactional Memory
 
-	preempt=	[KNL]
-			Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
-			none - Limited to cond_resched() calls
-			voluntary - Limited to cond_resched() and might_sleep() calls
-			full - Any section that isn't explicitly preempt disabled
-			       can be preempted anytime.
-
 	print-fatal-signals=
 			[KNL] debug: print fatal signals
 
diff --git a/arch/Kconfig b/arch/Kconfig
index afe6785fd3e2..05ce60036ecc 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1393,15 +1393,6 @@ config HAVE_STATIC_CALL_INLINE
 	depends on HAVE_STATIC_CALL
 	select OBJTOOL
 
-config HAVE_PREEMPT_DYNAMIC
-	bool
-	depends on HAVE_STATIC_CALL
-	depends on GENERIC_ENTRY
-	help
-	  Select this if the architecture support boot time preempt setting
-	  on top of static calls. It is strongly advised to support inline
-	  static call to avoid any overhead.
-
 config ARCH_WANT_LD_ORPHAN_WARN
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ec71c232af32..76e418bf469d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -270,7 +270,6 @@ config X86
 	select HAVE_STACK_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_STATIC_CALL
 	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
-	select HAVE_PREEMPT_DYNAMIC
 	select HAVE_RSEQ
 	select HAVE_RUST			if X86_64
 	select HAVE_SYSCALL_TRACEPOINTS
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 5876e30c5740..715e7aebb9d8 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -40,7 +40,6 @@ config PREEMPT
 	depends on !ARCH_NO_PREEMPT
 	select PREEMPTION
 	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
-	select PREEMPT_DYNAMIC if HAVE_PREEMPT_DYNAMIC
 	help
 	  This option reduces the latency of the kernel by making
 	  all kernel code (that is not executing in a critical section)
@@ -82,24 +81,6 @@ config PREEMPTION
        bool
        select PREEMPT_COUNT
 
-config PREEMPT_DYNAMIC
-	bool
-	help
-	  This option allows to define the preemption model on the kernel
-	  command line parameter and thus override the default preemption
-	  model defined during compile time.
-
-	  The feature is primarily interesting for Linux distributions which
-	  provide a pre-built kernel binary to reduce the number of kernel
-	  flavors they offer while still offering different usecases.
-
-	  The runtime overhead is negligible with HAVE_STATIC_CALL_INLINE enabled
-	  but if runtime patching is not available for the specific architecture
-	  then the potential overhead should be considered.
-
-	  Interesting if you want the same pre-built kernel should be used for
-	  both Server and Desktop workloads.
-
 config SCHED_CORE
 	bool "Core Scheduling for SMT"
 	depends on SCHED_SMT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5a0bf43975d4..e30007c11722 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -65,12 +65,6 @@
 #include <linux/wait_api.h>
 #include <linux/workqueue_api.h>
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-# ifdef CONFIG_GENERIC_ENTRY
-#  include <linux/entry-common.h>
-# endif
-#endif
-
 #include <uapi/linux/sched/types.h>
 
 #include <asm/irq_regs.h>
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 31/86] x86/thread_info: add TIF_NEED_RESCHED_LAZY
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (29 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 30/86] Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC" Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 23:26   ` Steven Rostedt
  2023-11-07 21:57 ` [RFC PATCH 32/86] entry: handle TIF_NEED_RESCHED_LAZY Ankur Arora
                   ` (31 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Add a new flag, TIF_NEED_RESCHED_LAZY which with TIF_NEED_RESCHED
gives the scheduler two levels of rescheduling priority:
TIF_NEED_RESCHED means that rescheduling happens at the next
opportunity; TIF_NEED_RESCHED_LAZY is used to note that a
reschedule is needed but does not impose any other constraints
on the scheduler.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/thread_info.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index d63b02940747..114d12120051 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -81,8 +81,9 @@ struct thread_info {
 #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
-#define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_LAZY	4	/* Lazy rescheduling */
+#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
+#define TIF_SSBD		6	/* Speculative store bypass disable */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -104,6 +105,7 @@ struct thread_info {
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 32/86] entry: handle TIF_NEED_RESCHED_LAZY
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (30 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 31/86] x86/thread_info: add TIF_NEED_RESCHED_LAZY Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 33/86] entry/kvm: " Ankur Arora
                   ` (30 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

The scheduling policy for TIF_NEED_RESCHED_LAZY is to run to
completion. Scheduling in exit_to_user_mode_loop() satisfies that.

Scheduling while exiting to userspace, also guarantees that the task
being scheduled away is entirely clear of any kernel encumbrances
that cannot span across preemption.

Ordinarily we don't need this extra protection: the preempt_count
check is always available. However, cases where preempt_count might
not be wholly dependable (ARCH_NO_PREEMPT configurations) will make
use of this.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/entry-common.h | 2 +-
 kernel/entry/common.c        | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fb2e349a17d2..7a56440442df 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -59,7 +59,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 194c349b8be7..0d055c39690b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -154,7 +154,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_UPROBE)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 33/86] entry/kvm: handle TIF_NEED_RESCHED_LAZY
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (31 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 32/86] entry: handle TIF_NEED_RESCHED_LAZY Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 34/86] thread_info: accessors for TIF_NEED_RESCHED* Ankur Arora
                   ` (29 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Executing in xfer_to_guest_mode_work() we are free of kernel
entanglements that cannot span preemption.

So, handle TIF_NEED_RESCHED_LAZY alongside TIF_NEED_RESCHED.

Also, while we at it, remove the explicit check for need_resched()
in the exit condition as that is already covered in the loop
condition.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/entry-kvm.h | 2 +-
 kernel/entry/kvm.c        | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 6813171afccb..674a622c91be 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@
 
 #define XFER_TO_GUEST_MODE_WORK						\
 	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
-	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
 
 struct kvm_vcpu;
 
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd..8485f63863af 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 			return -EINTR;
 		}
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
@@ -24,7 +24,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 			return ret;
 
 		ti_work = read_thread_flags();
-	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
+	} while (ti_work & XFER_TO_GUEST_MODE_WORK);
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 34/86] thread_info: accessors for TIF_NEED_RESCHED*
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (32 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 33/86] entry/kvm: " Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  8:58   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 35/86] thread_info: change to tif_need_resched(resched_t) Ankur Arora
                   ` (28 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Add tif_resched() which will be used as an accessor for TIF_NEED_RESCHED
and TIF_NEED_RESCHED_LAZY. The intent is to force the caller to make an
explicit choice of how eagerly they want a reschedule.

This interface will be used almost entirely from core kernel code, so
forcing a choice shouldn't be too onerous.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

---

1) Adding an enum for an interface that doesn't do all that much, seems
   to be overkill. This could have been an int/bool etc, but that seemed
   much less clear and thus more error prone.

2) Also there's no fallback path for architectures that don't define
   define TIF_NEED_RESCHD_LAZY. That's because arch support is easy
   to add (modulo ARCH_NO_PREEMPT, discussed in a different patch)
   so it will be simple to do that instead of thinking through what
   seemed like a slightly convoluted alternative model.

---
 include/linux/thread_info.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 9ea0b28068f4..4eb22b13bf64 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,27 @@ enum syscall_work_bit {
 
 #include <asm/thread_info.h>
 
+#ifndef TIF_NEED_RESCHED_LAZY
+#error "Arch needs to define TIF_NEED_RESCHED_LAZY"
+#endif
+
+#define TIF_NEED_RESCHED_LAZY_OFFSET	(TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
+
+typedef enum {
+	RESCHED_eager = 0,
+	RESCHED_lazy = TIF_NEED_RESCHED_LAZY_OFFSET,
+} resched_t;
+
+static inline int tif_resched(resched_t r)
+{
+	return TIF_NEED_RESCHED + r;
+}
+
+static inline int _tif_resched(resched_t r)
+{
+	return 1 << tif_resched(r);
+}
+
 #ifdef __KERNEL__
 
 #ifndef arch_set_restart_data
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 35/86] thread_info: change to tif_need_resched(resched_t)
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (33 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 34/86] thread_info: accessors for TIF_NEED_RESCHED* Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:00   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 36/86] entry: irqentry_exit only preempts TIF_NEED_RESCHED Ankur Arora
                   ` (27 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

tif_need_resched() now takes a parameter specifying the resched
type: RESCHED_lazy for when we allow the running task to run to
completion before eventually scheduling at a userspace boundary
and, RESCHED_eager for the next safe preemption point.

need_resched(), which is used by non-core code now checks for
presence of either of the need-resched bits. Also given that
need_resched() (and tif_need_resched() to a lesser extent), is
used extensively in the kernel so it is worth noting the common
uses and how they will change:

 - idle: we always want to schedule out of idle whenever there is
   any work. So the appropriate check is for both the conditions.
   (Currently we use need_resched() most places and the interfaces
   defined in sched/idle.h use tif_need_resched().)

   However, as discussed in later commits it is critical that
   when scheduling out of idle, we always reschedule with
   RESCHED_eager (which maps to TIF_NEED_RESCHED.) This suggests
   that idle code everywhere should instead just do:

        while (!tif_need_resched(RESCHED_eager) { ... }

   or similar. That is true, but we have a lot of idle code and it
   does not seem to make sense to expose scheduler implementation
   details all over.

 - uses in conjunction with preempt_count(): we only ever want to
   fold or make preemption decisions based on TIF_NEED_RESCHED, not
   TIF_NEED_RESCHED_LAZY.  So, related logic needs to use
   tif_need_resched(RESCHED_eager).

 - code that relinquishes resources temporarily (locks, irq, etc)
   checks for should_resched() and would preempt if TIF_NEED_RESCHED
   were set due to the (preempt_count() == offset) check.
   The hand-rolled versions, typically check for need_resched()
   which is a wider check.

   In either case the final arbiter is preempt_schedule() which
   checks via preemptible() does the more narrow check.

   Would it make sense to schedule out for both the need-resched
   flags?

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/s390/include/asm/preempt.h | 4 ++--
 drivers/acpi/processor_idle.c   | 2 +-
 include/asm-generic/preempt.h   | 4 ++--
 include/linux/preempt.h         | 2 +-
 include/linux/sched.h           | 4 +++-
 include/linux/sched/idle.h      | 8 ++++----
 include/linux/thread_info.h     | 8 ++++----
 kernel/sched/idle.c             | 2 +-
 kernel/trace/trace.c            | 2 +-
 9 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/arch/s390/include/asm/preempt.h b/arch/s390/include/asm/preempt.h
index bf15da0fedbc..4dddefae1387 100644
--- a/arch/s390/include/asm/preempt.h
+++ b/arch/s390/include/asm/preempt.h
@@ -114,13 +114,13 @@ static inline void __preempt_count_sub(int val)
 
 static inline bool __preempt_count_dec_and_test(void)
 {
-	return !--S390_lowcore.preempt_count && tif_need_resched();
+	return !--S390_lowcore.preempt_count && tif_need_resched(RESCHED_eager);
 }
 
 static inline bool should_resched(int preempt_offset)
 {
 	return unlikely(preempt_count() == preempt_offset &&
-			tif_need_resched());
+			tif_need_resched(RESCHED_eager));
 }
 
 #endif /* CONFIG_HAVE_MARCH_Z196_FEATURES */
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 3a34a8c425fe..1a69f082833e 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -108,7 +108,7 @@ static const struct dmi_system_id processor_power_dmi_table[] = {
  */
 static void __cpuidle acpi_safe_halt(void)
 {
-	if (!tif_need_resched()) {
+	if (!need_resched()) {
 		raw_safe_halt();
 		raw_local_irq_disable();
 	}
diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index b4d43a4af5f7..4f4abcc5981d 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -66,7 +66,7 @@ static __always_inline bool __preempt_count_dec_and_test(void)
 	 * operations; we cannot use PREEMPT_NEED_RESCHED because it might get
 	 * lost.
 	 */
-	return !--*preempt_count_ptr() && tif_need_resched();
+	return !--*preempt_count_ptr() && tif_need_resched(RESCHED_eager);
 }
 
 /*
@@ -75,7 +75,7 @@ static __always_inline bool __preempt_count_dec_and_test(void)
 static __always_inline bool should_resched(int preempt_offset)
 {
 	return unlikely(preempt_count() == preempt_offset &&
-			tif_need_resched());
+			tif_need_resched(RESCHED_eager));
 }
 
 #ifdef CONFIG_PREEMPTION
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 1424670df161..0abc6a673c41 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -301,7 +301,7 @@ do { \
 } while (0)
 #define preempt_fold_need_resched() \
 do { \
-	if (tif_need_resched()) \
+	if (tif_need_resched(RESCHED_eager)) \
 		set_preempt_need_resched(); \
 } while (0)
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 95d47783ff6e..5f0d7341cb88 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2172,9 +2172,11 @@ static inline int rwlock_needbreak(rwlock_t *lock)
 
 static __always_inline bool need_resched(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(tif_need_resched(RESCHED_eager) ||
+			tif_need_resched(RESCHED_lazy));
 }
 
+
 /*
  * Wrappers for p->thread_info->cpu access. No-op on UP.
  */
diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 478084f9105e..719416fe8ddc 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -63,7 +63,7 @@ static __always_inline bool __must_check current_set_polling_and_test(void)
 	 */
 	smp_mb__after_atomic();
 
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 
 static __always_inline bool __must_check current_clr_polling_and_test(void)
@@ -76,7 +76,7 @@ static __always_inline bool __must_check current_clr_polling_and_test(void)
 	 */
 	smp_mb__after_atomic();
 
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 
 #else
@@ -85,11 +85,11 @@ static inline void __current_clr_polling(void) { }
 
 static inline bool __must_check current_set_polling_and_test(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 static inline bool __must_check current_clr_polling_and_test(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 #endif
 
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 4eb22b13bf64..be5333a2c832 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -200,17 +200,17 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti
 
 #ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
 
-static __always_inline bool tif_need_resched(void)
+static __always_inline bool tif_need_resched(resched_t r)
 {
-	return arch_test_bit(TIF_NEED_RESCHED,
+	return arch_test_bit(tif_resched(r),
 			     (unsigned long *)(&current_thread_info()->flags));
 }
 
 #else
 
-static __always_inline bool tif_need_resched(void)
+static __always_inline bool tif_need_resched(resched_t r)
 {
-	return test_bit(TIF_NEED_RESCHED,
+	return test_bit(tif_resched(r),
 			(unsigned long *)(&current_thread_info()->flags));
 }
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5007b25c5bc6..d4a55448e459 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -57,7 +57,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 	ct_cpuidle_enter();
 
 	raw_local_irq_enable();
-	while (!tif_need_resched() &&
+	while (!need_resched() &&
 	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
 		cpu_relax();
 	raw_local_irq_disable();
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7f565f0a00da..7f067ad9cf50 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2720,7 +2720,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
 	if (softirq_count() >> (SOFTIRQ_SHIFT + 1))
 		trace_flags |= TRACE_FLAG_BH_OFF;
 
-	if (tif_need_resched())
+	if (tif_need_resched(RESCHED_eager))
 		trace_flags |= TRACE_FLAG_NEED_RESCHED;
 	if (test_preempt_need_resched())
 		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 36/86] entry: irqentry_exit only preempts TIF_NEED_RESCHED
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (34 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 35/86] thread_info: change to tif_need_resched(resched_t) Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:01   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 37/86] sched: make test_*_tsk_thread_flag() return bool Ankur Arora
                   ` (26 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

The scheduling policy for RESCHED_lazy (TIF_NEED_RESCHED_LAZY) is
to let anything running in the kernel run to completion.
Accordingly, while deciding whether to call preempt_schedule_irq()
narrow the check to tif_need_resched(RESCHED_eager).

Also add a comment about why we need to check at all, given that we
have aleady checked the preempt_count().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/entry/common.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 0d055c39690b..6433e6c77185 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -384,7 +384,15 @@ void irqentry_exit_cond_resched(void)
 		rcu_irq_exit_check_preempt();
 		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
 			WARN_ON_ONCE(!on_thread_stack());
-		if (need_resched())
+
+		/*
+		 * If the scheduler really wants us to preempt while returning
+		 * to kernel, it would set TIF_NEED_RESCHED.
+		 * On some archs the flag gets folded in preempt_count, and
+		 * thus would be covered in the conditional above, but not all
+		 * archs do that, so check explicitly.
+		 */
+		if (tif_need_resched(RESCHED_eager))
 			preempt_schedule_irq();
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 37/86] sched: make test_*_tsk_thread_flag() return bool
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (35 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 36/86] entry: irqentry_exit only preempts TIF_NEED_RESCHED Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:02   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 38/86] sched: *_tsk_need_resched() now takes resched_t Ankur Arora
                   ` (25 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

All users of test_*_tsk_thread_flag() treat the result value
as boolean. This is also true for the underlying test_and_*_bit()
operations.

Change the return type to bool.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5f0d7341cb88..12d0626601a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2045,17 +2045,17 @@ static inline void update_tsk_thread_flag(struct task_struct *tsk, int flag,
 	update_ti_thread_flag(task_thread_info(tsk), flag, value);
 }
 
-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_ti_thread_flag(task_thread_info(tsk), flag);
 }
@@ -2070,7 +2070,7 @@ static inline void clear_tsk_need_resched(struct task_struct *tsk)
 	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
 }
 
-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
 {
 	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 38/86] sched: *_tsk_need_resched() now takes resched_t
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (36 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 37/86] sched: make test_*_tsk_thread_flag() return bool Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:03   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 39/86] sched: handle lazy resched in set_nr_*_polling() Ankur Arora
                   ` (24 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

*_tsk_need_resched() need to test for the specific need-resched
flag.

The only users are RCU and the scheduler. For RCU we always want
to schedule at the earliest opportunity and that is always
RESCHED_eager.

For the scheduler, keep everything as RESCHED_eager for now.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched.h    | 17 ++++++++++++-----
 kernel/rcu/tree.c        |  4 ++--
 kernel/rcu/tree_exp.h    |  4 ++--
 kernel/rcu/tree_plugin.h |  4 ++--
 kernel/rcu/tree_stall.h  |  2 +-
 kernel/sched/core.c      |  9 +++++----
 kernel/sched/deadline.c  |  4 ++--
 kernel/sched/fair.c      |  2 +-
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        |  4 ++--
 10 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12d0626601a0..6dd206b2ef50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2060,19 +2060,26 @@ static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
 	return test_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline void set_tsk_need_resched(struct task_struct *tsk)
+static inline void set_tsk_need_resched(struct task_struct *tsk, resched_t lazy)
 {
-	set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+	set_tsk_thread_flag(tsk, tif_resched(lazy));
 }
 
 static inline void clear_tsk_need_resched(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+	clear_tsk_thread_flag(tsk, tif_resched(RESCHED_eager));
+	clear_tsk_thread_flag(tsk, tif_resched(RESCHED_lazy));
 }
 
-static inline bool test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk, resched_t lazy)
 {
-	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
+	return unlikely(test_tsk_thread_flag(tsk, tif_resched(lazy)));
+}
+
+static inline bool test_tsk_need_resched_any(struct task_struct *tsk)
+{
+	return test_tsk_need_resched(tsk, RESCHED_eager) ||
+			test_tsk_need_resched(tsk, RESCHED_lazy);
 }
 
 /*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cb1caefa8bd0..a7776ae78900 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2231,7 +2231,7 @@ void rcu_sched_clock_irq(int user)
 	if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
 		/* Idle and userspace execution already are quiescent states. */
 		if (!rcu_is_cpu_rrupt_from_idle() && !user) {
-			set_tsk_need_resched(current);
+			set_tsk_need_resched(current, RESCHED_eager);
 			set_preempt_need_resched();
 		}
 		__this_cpu_write(rcu_data.rcu_urgent_qs, false);
@@ -2379,7 +2379,7 @@ static __latent_entropy void rcu_core(void)
 	if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && (!(preempt_count() & PREEMPT_MASK))) {
 		rcu_preempt_deferred_qs(current);
 	} else if (rcu_preempt_need_deferred_qs(current)) {
-		set_tsk_need_resched(current);
+		set_tsk_need_resched(current, RESCHED_eager);
 		set_preempt_need_resched();
 	}
 
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 8239b39d945b..a4a23ac1115b 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -755,7 +755,7 @@ static void rcu_exp_handler(void *unused)
 			rcu_report_exp_rdp(rdp);
 		} else {
 			WRITE_ONCE(rdp->cpu_no_qs.b.exp, true);
-			set_tsk_need_resched(t);
+			set_tsk_need_resched(t, RESCHED_eager);
 			set_preempt_need_resched();
 		}
 		return;
@@ -856,7 +856,7 @@ static void rcu_exp_need_qs(void)
 	__this_cpu_write(rcu_data.cpu_no_qs.b.exp, true);
 	/* Store .exp before .rcu_urgent_qs. */
 	smp_store_release(this_cpu_ptr(&rcu_data.rcu_urgent_qs), true);
-	set_tsk_need_resched(current);
+	set_tsk_need_resched(current, RESCHED_eager);
 	set_preempt_need_resched();
 }
 
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 41021080ad25..f87191e008ff 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -658,7 +658,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			// Also if no expediting and no possible deboosting,
 			// slow is OK.  Plus nohz_full CPUs eventually get
 			// tick enabled.
-			set_tsk_need_resched(current);
+			set_tsk_need_resched(current, RESCHED_eager);
 			set_preempt_need_resched();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
 			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
@@ -725,7 +725,7 @@ static void rcu_flavor_sched_clock_irq(int user)
 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
 		/* No QS, force context switch if deferred. */
 		if (rcu_preempt_need_deferred_qs(t)) {
-			set_tsk_need_resched(t);
+			set_tsk_need_resched(t, RESCHED_eager);
 			set_preempt_need_resched();
 		}
 	} else if (rcu_preempt_need_deferred_qs(t)) {
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 6f06dc12904a..b74b7b04cf35 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -705,7 +705,7 @@ static void print_cpu_stall(unsigned long gps)
 	 * progress and it could be we're stuck in kernel space without context
 	 * switches for an entirely unreasonable amount of time.
 	 */
-	set_tsk_need_resched(current);
+	set_tsk_need_resched(current, RESCHED_eager);
 	set_preempt_need_resched();
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e30007c11722..e2215c417323 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -927,7 +927,7 @@ static bool set_nr_if_polling(struct task_struct *p)
 #else
 static inline bool set_nr_and_not_polling(struct task_struct *p)
 {
-	set_tsk_need_resched(p);
+	set_tsk_need_resched(p, RESCHED_eager);
 	return true;
 }
 
@@ -1039,13 +1039,13 @@ void resched_curr(struct rq *rq)
 
 	lockdep_assert_rq_held(rq);
 
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(curr, RESCHED_eager))
 		return;
 
 	cpu = cpu_of(rq);
 
 	if (cpu == smp_processor_id()) {
-		set_tsk_need_resched(curr);
+		set_tsk_need_resched(curr, RESCHED_eager);
 		set_preempt_need_resched();
 		return;
 	}
@@ -2223,7 +2223,8 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+	if (task_on_rq_queued(rq->curr) &&
+	    test_tsk_need_resched(rq->curr, RESCHED_eager))
 		rq_clock_skip_update(rq);
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 58b542bf2893..e6815c3bd2f0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1953,7 +1953,7 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 	 * let us try to decide what's the best thing to do...
 	 */
 	if ((p->dl.deadline == rq->curr->dl.deadline) &&
-	    !test_tsk_need_resched(rq->curr))
+	    !test_tsk_need_resched(rq->curr, RESCHED_eager))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
 }
@@ -2467,7 +2467,7 @@ static void pull_dl_task(struct rq *this_rq)
 static void task_woken_dl(struct rq *rq, struct task_struct *p)
 {
 	if (!task_on_cpu(rq, p) &&
-	    !test_tsk_need_resched(rq->curr) &&
+	    !test_tsk_need_resched(rq->curr, RESCHED_eager) &&
 	    p->nr_cpus_allowed > 1 &&
 	    dl_task(rq->curr) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..4d86c618ffa2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8087,7 +8087,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(curr, RESCHED_eager))
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index d4a55448e459..eacd204e2879 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -329,7 +329,7 @@ static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *timer)
 	struct idle_timer *it = container_of(timer, struct idle_timer, timer);
 
 	WRITE_ONCE(it->done, 1);
-	set_tsk_need_resched(current);
+	set_tsk_need_resched(current, RESCHED_eager);
 
 	return HRTIMER_NORESTART;
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0597ba0f85ff..a79ce6746dd0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1735,7 +1735,7 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr, RESCHED_eager))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -2466,7 +2466,7 @@ static void pull_rt_task(struct rq *this_rq)
 static void task_woken_rt(struct rq *rq, struct task_struct *p)
 {
 	bool need_to_push = !task_on_cpu(rq, p) &&
-			    !test_tsk_need_resched(rq->curr) &&
+			    !test_tsk_need_resched(rq->curr, RESCHED_eager) &&
 			    p->nr_cpus_allowed > 1 &&
 			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
 			    (rq->curr->nr_cpus_allowed < 2 ||
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 39/86] sched: handle lazy resched in set_nr_*_polling()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (37 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 38/86] sched: *_tsk_need_resched() now takes resched_t Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:15   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 40/86] context_tracking: add ct_state_cpu() Ankur Arora
                   ` (23 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

To trigger a reschedule on a target runqueue a few things need
to happen first:

  1. set_tsk_need_resched(target_rq->curr, RESCHED_eager)
  2. ensure that the target CPU sees the need-resched bit
  3. preempt_fold_need_resched()

Most of this is done via some combination of: resched_curr(),
set_nr_if_polling(), and set_nr_and_not_polling().

Update the last two to also handle TIF_NEED_RESCHED_LAZY.

One thing to note is that TIF_NEED_RESCHED_LAZY has run to completion
semantics, so unlike TIF_NEED_RESCHED, we don't need to ensure that
the caller sees it, and of course there is no preempt folding.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e2215c417323..01df5ac2982c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -892,14 +892,15 @@ static inline void hrtick_rq_init(struct rq *rq)
 
 #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
 /*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
  * this avoids any races wrt polling state changes and thereby avoids
  * spurious IPIs.
  */
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
 {
 	struct thread_info *ti = task_thread_info(p);
-	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+
+	return !(fetch_or(&ti->flags, _tif_resched(rs)) & _TIF_POLLING_NRFLAG);
 }
 
 /*
@@ -916,7 +917,7 @@ static bool set_nr_if_polling(struct task_struct *p)
 	for (;;) {
 		if (!(val & _TIF_POLLING_NRFLAG))
 			return false;
-		if (val & _TIF_NEED_RESCHED)
+		if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			return true;
 		if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
 			break;
@@ -925,9 +926,9 @@ static bool set_nr_if_polling(struct task_struct *p)
 }
 
 #else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
 {
-	set_tsk_need_resched(p, RESCHED_eager);
+	set_tsk_need_resched(p, rs);
 	return true;
 }
 
@@ -1050,7 +1051,7 @@ void resched_curr(struct rq *rq)
 		return;
 	}
 
-	if (set_nr_and_not_polling(curr))
+	if (set_nr_and_not_polling(curr, RESCHED_eager))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_idle_without_ipi(cpu);
@@ -1126,7 +1127,7 @@ static void wake_up_idle_cpu(int cpu)
 	if (cpu == smp_processor_id())
 		return;
 
-	if (set_nr_and_not_polling(rq->idle))
+	if (set_nr_and_not_polling(rq->idle, RESCHED_eager))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_idle_without_ipi(cpu);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 40/86] context_tracking: add ct_state_cpu()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (38 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 39/86] sched: handle lazy resched in set_nr_*_polling() Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:16   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 41/86] sched: handle resched policy in resched_curr() Ankur Arora
                   ` (22 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

While making up its mind about whether to reschedule a target
runqueue eagerly or lazily, resched_curr() needs to know if the
target is executing in the kernel or in userspace.

Add ct_state_cpu().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

---
Using context-tracking for this seems like overkill. Is there a better
way to achieve this? One problem with depending on user_enter() is that
it happens much too late for our purposes. From the scheduler's
point-of-view the exit state has effectively transitioned once the
task exits the exit_to_user_loop() so we will see stale state
while the task is done with exit_to_user_loop() but has not yet
executed user_enter().

---
 include/linux/context_tracking_state.h | 21 +++++++++++++++++++++
 kernel/Kconfig.preempt                 |  1 +
 2 files changed, 22 insertions(+)

diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index bbff5f7f8803..6a8f1c7ba105 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -53,6 +53,13 @@ static __always_inline int __ct_state(void)
 {
 	return raw_atomic_read(this_cpu_ptr(&context_tracking.state)) & CT_STATE_MASK;
 }
+
+static __always_inline int __ct_state_cpu(int cpu)
+{
+	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+
+	return atomic_read(&ct->state) & CT_STATE_MASK;
+}
 #endif
 
 #ifdef CONFIG_CONTEXT_TRACKING_IDLE
@@ -139,6 +146,20 @@ static __always_inline int ct_state(void)
 	return ret;
 }
 
+static __always_inline int ct_state_cpu(int cpu)
+{
+	int ret;
+
+	if (!context_tracking_enabled_cpu(cpu))
+		return CONTEXT_DISABLED;
+
+	preempt_disable();
+	ret = __ct_state_cpu(cpu);
+	preempt_enable();
+
+	return ret;
+}
+
 #else
 static __always_inline bool context_tracking_enabled(void) { return false; }
 static __always_inline bool context_tracking_enabled_cpu(int cpu) { return false; }
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 715e7aebb9d8..aa87b5cd3ecc 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,6 +80,7 @@ config PREEMPT_COUNT
 config PREEMPTION
        bool
        select PREEMPT_COUNT
+       select CONTEXT_TRACKING_USER
 
 config SCHED_CORE
 	bool "Core Scheduling for SMT"
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 41/86] sched: handle resched policy in resched_curr()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (39 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 40/86] context_tracking: add ct_state_cpu() Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:36   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 42/86] sched: force preemption on tick expiration Ankur Arora
                   ` (21 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

One of the last ports of call before rescheduling is triggered
is resched_curr().

It's task is to set TIF_NEED_RESCHED and, if running locally,
either fold it in the preempt_count, or send a resched-IPI so
the target CPU folds it in.
To handle TIF_NEED_RESCHED_LAZY -- since the reschedule is not
imminent -- it only needs to set the appropriate bit.

Move all of underlying mechanism in __resched_curr(). And, define
resched_curr() which handles the policy on when we want to set
which need-resched variant.

For now the approach is to run to completion (TIF_NEED_RESCHED_LAZY)
with the following exceptions where we always want to reschedule
at the next preemptible point (TIF_NEED_RESCHED):

 - idle: if we are polling in idle, then set_nr_if_polling() will do
   the right thing. When not polling, we force TIF_NEED_RESCHED
   and send a resched-IPI if needed.

 - the target CPU is in userspace: run to completion semantics are
   only for kernel tasks

 - running under the full preemption model

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 80 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 70 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01df5ac2982c..f65bf3ce0e9d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
 }
 
 /*
- * resched_curr - mark rq's current task 'to be rescheduled now'.
+ * __resched_curr - mark rq's current task 'to be rescheduled'.
  *
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
+ * On UP this means the setting of the need_resched flag, on SMP, for
+ * eager resched it might also involve a cross-CPU call to trigger
+ * the scheduler on the target CPU.
  */
-void resched_curr(struct rq *rq)
+void __resched_curr(struct rq *rq, resched_t rs)
 {
 	struct task_struct *curr = rq->curr;
 	int cpu;
@@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
 	cpu = cpu_of(rq);
 
 	if (cpu == smp_processor_id()) {
-		set_tsk_need_resched(curr, RESCHED_eager);
-		set_preempt_need_resched();
+		set_tsk_need_resched(curr, rs);
+		if (rs == RESCHED_eager)
+			set_preempt_need_resched();
 		return;
 	}
 
-	if (set_nr_and_not_polling(curr, RESCHED_eager))
-		smp_send_reschedule(cpu);
-	else
+	if (set_nr_and_not_polling(curr, rs)) {
+		if (rs == RESCHED_eager)
+			smp_send_reschedule(cpu);
+	} else if (rs == RESCHED_eager)
 		trace_sched_wake_idle_without_ipi(cpu);
 }
 
+/*
+ * resched_curr - mark rq's current task 'to be rescheduled' eagerly
+ * or lazily according to the current policy.
+ *
+ * Always schedule eagerly, if:
+ *
+ *  - running under full preemption
+ *
+ *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
+ *    force TIF_NEED_RESCHED to be set and send a resched IPI.
+ *    (the polling case has already set TIF_NEED_RESCHED via
+ *     set_nr_if_polling()).
+ *
+ *  - in userspace: run to completion semantics are only for kernel tasks
+ *
+ * Otherwise (regardless of priority), run to completion.
+ */
+void resched_curr(struct rq *rq)
+{
+	resched_t rs = RESCHED_lazy;
+	int context;
+
+	if (IS_ENABLED(CONFIG_PREEMPT) ||
+	    (rq->curr->sched_class == &idle_sched_class)) {
+		rs = RESCHED_eager;
+		goto resched;
+	}
+
+	/*
+	 * We might race with the target CPU while checking its ct_state:
+	 *
+	 * 1. The task might have just entered the kernel, but has not yet
+	 * called user_exit(). We will see stale state (CONTEXT_USER) and
+	 * send an unnecessary resched-IPI.
+	 *
+	 * 2. The user task is through with exit_to_user_mode_loop() but has
+	 * not yet called user_enter().
+	 *
+	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
+	 * schedule it lazily. There's obviously nothing that will handle
+	 * this need-resched bit until the thread enters the kernel next.
+	 *
+	 * The scheduler will still do tick accounting, but a potentially
+	 * higher priority task waited to be scheduled for a user tick,
+	 * instead of execution time in the kernel.
+	 */
+	context = ct_state_cpu(cpu_of(rq));
+	if ((context == CONTEXT_USER) ||
+	    (context == CONTEXT_GUEST)) {
+
+		rs = RESCHED_eager;
+		goto resched;
+	}
+
+resched:
+	__resched_curr(rq, rs);
+}
+
 void resched_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 42/86] sched: force preemption on tick expiration
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (40 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 41/86] sched: handle resched policy in resched_curr() Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:56   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 43/86] sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models Ankur Arora
                   ` (20 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

The kernel can have long running tasks which don't pass through
preemption points for prolonged periods and so will never see
a scheduler's polite TIF_NEED_RESCHED_LAZY.

Force a reschedule at the next tick by upgrading to TIF_NEED_RESCHED,
which will get folded into the preempt_count and a reschedule at the
next safe preemption point.

TODO: deadline scheduler.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/fair.c  | 32 +++++++++++++++++++++++---------
 kernel/sched/rt.c    |  7 ++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d86c618ffa2..fe7e5e9b2207 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,11 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
  */
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void update_deadline(struct cfs_rq *cfs_rq,
+			    struct sched_entity *se, bool tick)
 {
+	struct rq *rq = rq_of(cfs_rq);
+
 	if ((s64)(se->vruntime - se->deadline) < 0)
 		return;
 
@@ -1033,13 +1036,19 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
 
+	if (cfs_rq->nr_running < 2)
+		return;
+
 	/*
-	 * The task has consumed its request, reschedule.
+	 * The task has consumed its request, reschedule; eagerly
+	 * if it ignored our last lazy reschedule.
 	 */
-	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
-		clear_buddies(cfs_rq, se);
-	}
+	if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+		__resched_curr(rq, RESCHED_eager);
+	else
+		resched_curr(rq);
+
+	clear_buddies(cfs_rq, se);
 }
 
 #include "pelt.h"
@@ -1147,7 +1156,7 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 /*
  * Update the current task's runtime statistics.
  */
-static void update_curr(struct cfs_rq *cfs_rq)
+static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
 {
 	struct sched_entity *curr = cfs_rq->curr;
 	u64 now = rq_clock_task(rq_of(cfs_rq));
@@ -1174,7 +1183,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
-	update_deadline(cfs_rq, curr);
+	update_deadline(cfs_rq, curr, tick);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -1188,6 +1197,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	__update_curr(cfs_rq, false);
+}
+
 static void update_curr_fair(struct rq *rq)
 {
 	update_curr(cfs_rq_of(&rq->curr->se));
@@ -5309,7 +5323,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Update run-time statistics of the 'current'.
 	 */
-	update_curr(cfs_rq);
+	__update_curr(cfs_rq, true);
 
 	/*
 	 * Ensure that runnable average is periodically updated.
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a79ce6746dd0..5fdb93f1b87e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2664,7 +2664,12 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	for_each_sched_rt_entity(rt_se) {
 		if (rt_se->run_list.prev != rt_se->run_list.next) {
 			requeue_task_rt(rq, p, 0);
-			resched_curr(rq);
+
+			if (test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+				__resched_curr(rq, RESCHED_eager);
+			else
+				resched_curr(rq);
+
 			return;
 		}
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9e1329a4e890..e29a8897f573 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2434,6 +2434,7 @@ extern void init_sched_fair_class(void);
 
 extern void reweight_task(struct task_struct *p, int prio);
 
+extern void __resched_curr(struct rq *rq, resched_t rs);
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 43/86] sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (41 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 42/86] sched: force preemption on tick expiration Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  9:58   ` Peter Zijlstra
  2023-11-07 21:57 ` [RFC PATCH 44/86] sched: voluntary preemption Ankur Arora
                   ` (19 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

The scheduler uses PREEMPT_COUNT and PREEMPTION to drive
preemption: the first to demarcate non-preemptible sections and
the second for the actual mechanics of preemption.

Enable both for voluntary preemption models.

In addition, define a new scheduler feature FORCE_PREEMPT which
can now be used to distinguish between voluntary and full
preemption models at runtime.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 init/Makefile           |  2 +-
 kernel/Kconfig.preempt  | 12 ++++++++----
 kernel/entry/common.c   |  3 +--
 kernel/sched/core.c     | 26 +++++++++++---------------
 kernel/sched/features.h |  6 ++++++
 5 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/init/Makefile b/init/Makefile
index 385fd80fa2ef..99e480f24cf3 100644
--- a/init/Makefile
+++ b/init/Makefile
@@ -24,7 +24,7 @@ mounts-$(CONFIG_BLK_DEV_INITRD)	+= do_mounts_initrd.o
 #
 
 smp-flag-$(CONFIG_SMP)			:= SMP
-preempt-flag-$(CONFIG_PREEMPT)          := PREEMPT
+preempt-flag-$(CONFIG_PREEMPTION)       := PREEMPT_DYNAMIC
 preempt-flag-$(CONFIG_PREEMPT_RT)	:= PREEMPT_RT
 
 build-version = $(or $(KBUILD_BUILD_VERSION), $(build-version-auto))
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index aa87b5cd3ecc..074fe5e253b5 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -6,20 +6,23 @@ choice
 
 config PREEMPT_NONE
 	bool "No Forced Preemption (Server)"
+	select PREEMPTION
 	help
 	  This is the traditional Linux preemption model, geared towards
 	  throughput. It will still provide good latencies most of the
-	  time, but there are no guarantees and occasional longer delays
-	  are possible.
+	  time, but occasional delays are possible.
 
 	  Select this option if you are building a kernel for a server or
 	  scientific/computation system, or if you want to maximize the
 	  raw processing power of the kernel, irrespective of scheduling
-	  latencies.
+	  latencies. Unless your architecture actively disables preemption,
+	  you can always switch to one of the other preemption models
+	  at runtime.
 
 config PREEMPT_VOLUNTARY
 	bool "Voluntary Kernel Preemption (Desktop)"
 	depends on !ARCH_NO_PREEMPT
+	select PREEMPTION
 	help
 	  This option reduces the latency of the kernel by adding more
 	  "explicit preemption points" to the kernel code. These new
@@ -53,7 +56,8 @@ config PREEMPT
 
 	  Select this if you are building a kernel for a desktop or
 	  embedded system with latency requirements in the milliseconds
-	  range.
+	  range. You can always switch to one of lower preemption options
+	  at runtime.
 
 config PREEMPT_RT
 	bool "Fully Preemptible Kernel (Real-Time)"
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6433e6c77185..f7f2efabb5b5 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -422,8 +422,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}
 
 		instrumentation_begin();
-		if (IS_ENABLED(CONFIG_PREEMPTION))
-			irqentry_exit_cond_resched();
+		irqentry_exit_cond_resched();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f65bf3ce0e9d..2a50a64255c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1065,7 +1065,7 @@ void __resched_curr(struct rq *rq, resched_t rs)
  *
  * Always schedule eagerly, if:
  *
- *  - running under full preemption
+ *  - running under full preemption (sched_feat(FORCE_PREEMPT))
  *
  *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
  *    force TIF_NEED_RESCHED to be set and send a resched IPI.
@@ -1081,7 +1081,7 @@ void resched_curr(struct rq *rq)
 	resched_t rs = RESCHED_lazy;
 	int context;
 
-	if (IS_ENABLED(CONFIG_PREEMPT) ||
+	if (sched_feat(FORCE_PREEMPT) ||
 	    (rq->curr->sched_class == &idle_sched_class)) {
 		rs = RESCHED_eager;
 		goto resched;
@@ -1108,7 +1108,6 @@ void resched_curr(struct rq *rq)
 	context = ct_state_cpu(cpu_of(rq));
 	if ((context == CONTEXT_USER) ||
 	    (context == CONTEXT_GUEST)) {
-
 		rs = RESCHED_eager;
 		goto resched;
 	}
@@ -6597,20 +6596,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  *
  *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
  *
- *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
- *      paths. For example, see arch/x86/entry_64.S.
+ *   2. TIF_NEED_RESCHED flag is checked on interrupt and TIF_NEED_RESCHED[_LAZY]
+ *      flags on userspace return paths. For example, see arch/x86/entry_64.S.
  *
- *      To drive preemption between tasks, the scheduler sets the flag in timer
- *      interrupt handler scheduler_tick().
+ *      To drive preemption between tasks, the scheduler sets one of these
+ *      flags in timer interrupt handler scheduler_tick().
  *
  *   3. Wakeups don't really cause entry into schedule(). They add a
  *      task to the run-queue and that's it.
  *
- *      Now, if the new task added to the run-queue preempts the current
- *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
- *      called on the nearest possible occasion:
- *
- *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):
+ *      - Now, if the new task added to the run-queue preempts the current
+ *        task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
+ *        called on the nearest possible occasion:
  *
  *         - in syscall or exception context, at the next outmost
  *           preempt_enable(). (this might be as soon as the wake_up()'s
@@ -6619,10 +6616,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  *         - in IRQ context, return from interrupt-handler to
  *           preemptible context
  *
- *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
- *         then at the next:
+ *      - If the new task preempts the current task, but the scheduling
+ *        policy is only preempt voluntarily, then at the next:
  *
- *          - cond_resched() call
  *          - explicit schedule() call
  *          - return from syscall or exception to user-space
  *          - return from interrupt-handler to user-space
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index f770168230ae..9b4c2967b2b7 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,3 +89,9 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(HZ_BW, true)
+
+#if defined(CONFIG_PREEMPT)
+SCHED_FEAT(FORCE_PREEMPT, true)
+#else
+SCHED_FEAT(FORCE_PREEMPT, false)
+#endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 44/86] sched: voluntary preemption
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (42 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 43/86] sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 45/86] preempt: ARCH_NO_PREEMPT only preempts lazily Ankur Arora
                   ` (18 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

The no preemption model allows running to completion in kernel context.
For voluntary preemption, allow preemption by higher scheduling
classes.

To do this resched_curr() now takes a parameter that specifies if the
resched is for a scheduler class above the runqueue's current task.
And reschedules eagerly, if so.

Also define scheduler feature PREEMPT_PRIORITY which can be used to
toggle voluntary preemption model at runtime.

TODO: Both RT, deadline work but I'm almost certainly not doing all the
right things for both.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/Kconfig.preempt    | 19 ++++++-------------
 kernel/sched/core.c       | 28 +++++++++++++++++-----------
 kernel/sched/core_sched.c |  2 +-
 kernel/sched/deadline.c   | 22 +++++++++++-----------
 kernel/sched/fair.c       | 18 +++++++++---------
 kernel/sched/features.h   |  5 +++++
 kernel/sched/idle.c       |  2 +-
 kernel/sched/rt.c         | 26 +++++++++++++-------------
 kernel/sched/sched.h      |  2 +-
 9 files changed, 64 insertions(+), 60 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 074fe5e253b5..e16114b679e3 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -20,23 +20,16 @@ config PREEMPT_NONE
 	  at runtime.
 
 config PREEMPT_VOLUNTARY
-	bool "Voluntary Kernel Preemption (Desktop)"
+	bool "Voluntary Kernel Preemption"
 	depends on !ARCH_NO_PREEMPT
 	select PREEMPTION
 	help
-	  This option reduces the latency of the kernel by adding more
-	  "explicit preemption points" to the kernel code. These new
-	  preemption points have been selected to reduce the maximum
-	  latency of rescheduling, providing faster application reactions,
-	  at the cost of slightly lower throughput.
+	  This option reduces the latency of the kernel by allowing
+	  processes in higher scheduling policy classes preempt ones
+	  lower down.
 
-	  This allows reaction to interactive events by allowing a
-	  low priority process to voluntarily preempt itself even if it
-	  is in kernel mode executing a system call. This allows
-	  applications to run more 'smoothly' even when the system is
-	  under load.
-
-	  Select this if you are building a kernel for a desktop system.
+	  Higher priority processes in the same scheduling policy class
+	  do not preempt others in the same class.
 
 config PREEMPT
 	bool "Preemptible Kernel (Low-Latency Desktop)"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a50a64255c6..3fa78e8afb7d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -256,7 +256,7 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	if (!(flags & DEQUEUE_SAVE) && rq->nr_running == 1 &&
 	    rq->core->core_forceidle_count && rq->curr == rq->idle)
-		resched_curr(rq);
+		resched_curr(rq, false);
 }
 
 static int sched_task_is_throttled(struct task_struct *p, int cpu)
@@ -1074,9 +1074,12 @@ void __resched_curr(struct rq *rq, resched_t rs)
  *
  *  - in userspace: run to completion semantics are only for kernel tasks
  *
- * Otherwise (regardless of priority), run to completion.
+ *  - running under voluntary preemption (sched_feat(PREEMPT_PRIORITY))
+ *    and a task from a sched_class above wants the CPU
+ *
+ * Otherwise, run to completion.
  */
-void resched_curr(struct rq *rq)
+void resched_curr(struct rq *rq, bool above)
 {
 	resched_t rs = RESCHED_lazy;
 	int context;
@@ -1112,6 +1115,9 @@ void resched_curr(struct rq *rq)
 		goto resched;
 	}
 
+	if (sched_feat(PREEMPT_PRIORITY) && above)
+		rs = RESCHED_eager;
+
 resched:
 	__resched_curr(rq, rs);
 }
@@ -1123,7 +1129,7 @@ void resched_cpu(int cpu)
 
 	raw_spin_rq_lock_irqsave(rq, flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
-		resched_curr(rq);
+		resched_curr(rq, true);
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 }
 
@@ -2277,7 +2283,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 	if (p->sched_class == rq->curr->sched_class)
 		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
 	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
-		resched_curr(rq);
+		resched_curr(rq, true);
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
@@ -2764,7 +2770,7 @@ int push_cpu_stop(void *arg)
 		deactivate_task(rq, p, 0);
 		set_task_cpu(p, lowest_rq->cpu);
 		activate_task(lowest_rq, p, 0);
-		resched_curr(lowest_rq);
+		resched_curr(lowest_rq, true);
 	}
 
 	double_unlock_balance(rq, lowest_rq);
@@ -3999,7 +4005,7 @@ void wake_up_if_idle(int cpu)
 	if (is_idle_task(rcu_dereference(rq->curr))) {
 		guard(rq_lock_irqsave)(rq);
 		if (is_idle_task(rq->curr))
-			resched_curr(rq);
+			resched_curr(rq, true);
 	}
 }
 
@@ -6333,7 +6339,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			continue;
 		}
 
-		resched_curr(rq_i);
+		resched_curr(rq_i, false);
 	}
 
 out_set_next:
@@ -6388,7 +6394,7 @@ static bool try_steal_cookie(int this, int that)
 		set_task_cpu(p, this);
 		activate_task(dst, p, 0);
 
-		resched_curr(dst);
+		resched_curr(dst, false);
 
 		success = true;
 		break;
@@ -8743,7 +8749,7 @@ int __sched yield_to(struct task_struct *p, bool preempt)
 		 * fairness.
 		 */
 		if (preempt && rq != p_rq)
-			resched_curr(p_rq);
+			resched_curr(p_rq, true);
 	}
 
 out_unlock:
@@ -10300,7 +10306,7 @@ void sched_move_task(struct task_struct *tsk)
 		 * throttled one but it's still the running task. Trigger a
 		 * resched to make sure that task can still run.
 		 */
-		resched_curr(rq);
+		resched_curr(rq, true);
 	}
 
 unlock:
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index a57fd8f27498..32f234f2a210 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -89,7 +89,7 @@ static unsigned long sched_core_update_cookie(struct task_struct *p,
 	 * next scheduling edge, rather than always forcing a reschedule here.
 	 */
 	if (task_on_cpu(rq, p))
-		resched_curr(rq);
+		resched_curr(rq, false);
 
 	task_rq_unlock(rq, p, &rf);
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e6815c3bd2f0..ecb47b5e9588 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1177,7 +1177,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	if (dl_task(rq->curr))
 		check_preempt_curr_dl(rq, p, 0);
 	else
-		resched_curr(rq);
+		resched_curr(rq, false);
 
 #ifdef CONFIG_SMP
 	/*
@@ -1367,7 +1367,7 @@ static void update_curr_dl(struct rq *rq)
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
 
 		if (!is_leftmost(curr, &rq->dl))
-			resched_curr(rq);
+			resched_curr(rq, false);
 	}
 
 	/*
@@ -1914,7 +1914,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	    cpudl_find(&rq->rd->cpudl, p, NULL))
 		return;
 
-	resched_curr(rq);
+	resched_curr(rq, false);
 }
 
 static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
@@ -1943,7 +1943,7 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
 	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
-		resched_curr(rq);
+		resched_curr(rq, false);
 		return;
 	}
 
@@ -2307,7 +2307,7 @@ static int push_dl_task(struct rq *rq)
 	if (dl_task(rq->curr) &&
 	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
 	    rq->curr->nr_cpus_allowed > 1) {
-		resched_curr(rq);
+		resched_curr(rq, false);
 		return 0;
 	}
 
@@ -2353,7 +2353,7 @@ static int push_dl_task(struct rq *rq)
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
 
-	resched_curr(later_rq);
+	resched_curr(later_rq, false);
 
 	double_unlock_balance(rq, later_rq);
 
@@ -2457,7 +2457,7 @@ static void pull_dl_task(struct rq *this_rq)
 	}
 
 	if (resched)
-		resched_curr(this_rq);
+		resched_curr(this_rq, false);
 }
 
 /*
@@ -2654,7 +2654,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		if (dl_task(rq->curr))
 			check_preempt_curr_dl(rq, p, 0);
 		else
-			resched_curr(rq);
+			resched_curr(rq, false);
 	} else {
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 	}
@@ -2687,7 +2687,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 		 * runqueue.
 		 */
 		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline))
-			resched_curr(rq);
+			resched_curr(rq, false);
 	} else {
 		/*
 		 * Current may not be deadline in case p was throttled but we
@@ -2697,14 +2697,14 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 		 */
 		if (!dl_task(rq->curr) ||
 		    dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
-			resched_curr(rq);
+			resched_curr(rq, false);
 	}
 #else
 	/*
 	 * We don't know if p has a earlier or later deadline, so let's blindly
 	 * set a (maybe not needed) rescheduling point.
 	 */
-	resched_curr(rq);
+	resched_curr(rq, false);
 #endif
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe7e5e9b2207..448fe36e7bbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1046,7 +1046,7 @@ static void update_deadline(struct cfs_rq *cfs_rq,
 	if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
 		__resched_curr(rq, RESCHED_eager);
 	else
-		resched_curr(rq);
+		resched_curr(rq, false);
 
 	clear_buddies(cfs_rq, se);
 }
@@ -5337,7 +5337,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 * validating it and just reschedule.
 	 */
 	if (queued) {
-		resched_curr(rq_of(cfs_rq));
+		resched_curr(rq_of(cfs_rq), false);
 		return;
 	}
 	/*
@@ -5483,7 +5483,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	 * hierarchy can be throttled
 	 */
 	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
-		resched_curr(rq_of(cfs_rq));
+		resched_curr(rq_of(cfs_rq), false);
 }
 
 static __always_inline
@@ -5743,7 +5743,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	/* Determine whether we need to wake up potentially idle CPU: */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
-		resched_curr(rq);
+		resched_curr(rq, false);
 }
 
 #ifdef CONFIG_SMP
@@ -6448,7 +6448,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 
 		if (delta < 0) {
 			if (task_current(rq, p))
-				resched_curr(rq);
+				resched_curr(rq, false);
 			return;
 		}
 		hrtick_start(rq, delta);
@@ -8143,7 +8143,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
-	resched_curr(rq);
+	resched_curr(rq, false);
 }
 
 #ifdef CONFIG_SMP
@@ -12294,7 +12294,7 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
 	 */
 	if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
-		resched_curr(rq);
+		resched_curr(rq, false);
 }
 
 /*
@@ -12459,7 +12459,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 	 */
 	if (task_current(rq, p)) {
 		if (p->prio > oldprio)
-			resched_curr(rq);
+			resched_curr(rq, false);
 	} else
 		check_preempt_curr(rq, p, 0);
 }
@@ -12561,7 +12561,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
 		 * if we can still preempt the current task.
 		 */
 		if (task_current(rq, p))
-			resched_curr(rq);
+			resched_curr(rq, false);
 		else
 			check_preempt_curr(rq, p, 0);
 	}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9b4c2967b2b7..9bf30732b03f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -92,6 +92,11 @@ SCHED_FEAT(HZ_BW, true)
 
 #if defined(CONFIG_PREEMPT)
 SCHED_FEAT(FORCE_PREEMPT, true)
+SCHED_FEAT(PREEMPT_PRIORITY, true)
+#elif defined(CONFIG_PREEMPT_VOLUNTARY)
+SCHED_FEAT(FORCE_PREEMPT, false)
+SCHED_FEAT(PREEMPT_PRIORITY, true)
 #else
 SCHED_FEAT(FORCE_PREEMPT, false)
+SCHED_FEAT(PREEMPT_PRIORITY, false)
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index eacd204e2879..3ef039869be9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -403,7 +403,7 @@ balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  */
 static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	resched_curr(rq);
+	resched_curr(rq, true);
 }
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 5fdb93f1b87e..8d87e42d30d8 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -589,7 +589,7 @@ static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 			enqueue_rt_entity(rt_se, 0);
 
 		if (rt_rq->highest_prio.curr < curr->prio)
-			resched_curr(rq);
+			resched_curr(rq, false);
 	}
 }
 
@@ -682,7 +682,7 @@ static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 		return;
 
 	enqueue_top_rt_rq(rt_rq);
-	resched_curr(rq);
+	resched_curr(rq, false);
 }
 
 static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
@@ -1076,7 +1076,7 @@ static void update_curr_rt(struct rq *rq)
 			rt_rq->rt_time += delta_exec;
 			exceeded = sched_rt_runtime_exceeded(rt_rq);
 			if (exceeded)
-				resched_curr(rq);
+				resched_curr(rq, false);
 			raw_spin_unlock(&rt_rq->rt_runtime_lock);
 			if (exceeded)
 				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
@@ -1691,7 +1691,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 	 * to try and push the current task away:
 	 */
 	requeue_task_rt(rq, p, 1);
-	resched_curr(rq);
+	resched_curr(rq, false);
 }
 
 static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
@@ -1718,7 +1718,7 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (p->prio < rq->curr->prio) {
-		resched_curr(rq);
+		resched_curr(rq, false);
 		return;
 	}
 
@@ -2074,7 +2074,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * just reschedule current.
 	 */
 	if (unlikely(next_task->prio < rq->curr->prio)) {
-		resched_curr(rq);
+		resched_curr(rq, false);
 		return 0;
 	}
 
@@ -2162,7 +2162,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	deactivate_task(rq, next_task, 0);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
-	resched_curr(lowest_rq);
+	resched_curr(lowest_rq, false);
 	ret = 1;
 
 	double_unlock_balance(rq, lowest_rq);
@@ -2456,7 +2456,7 @@ static void pull_rt_task(struct rq *this_rq)
 	}
 
 	if (resched)
-		resched_curr(this_rq);
+		resched_curr(this_rq, false);
 }
 
 /*
@@ -2555,7 +2555,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
 		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
-			resched_curr(rq);
+			resched_curr(rq, false);
 	}
 }
 
@@ -2583,11 +2583,11 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * then reschedule.
 		 */
 		if (p->prio > rq->rt.highest_prio.curr)
-			resched_curr(rq);
+			resched_curr(rq, false);
 #else
 		/* For UP simply resched on drop of prio */
 		if (oldprio < p->prio)
-			resched_curr(rq);
+			resched_curr(rq, false);
 #endif /* CONFIG_SMP */
 	} else {
 		/*
@@ -2596,7 +2596,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * then reschedule.
 		 */
 		if (p->prio < rq->curr->prio)
-			resched_curr(rq);
+			resched_curr(rq, false);
 	}
 }
 
@@ -2668,7 +2668,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 			if (test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
 				__resched_curr(rq, RESCHED_eager);
 			else
-				resched_curr(rq);
+				resched_curr(rq, false);
 
 			return;
 		}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e29a8897f573..9a745dd7482f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2435,7 +2435,7 @@ extern void init_sched_fair_class(void);
 extern void reweight_task(struct task_struct *p, int prio);
 
 extern void __resched_curr(struct rq *rq, resched_t rs);
-extern void resched_curr(struct rq *rq);
+extern void resched_curr(struct rq *rq, bool above);
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 45/86] preempt: ARCH_NO_PREEMPT only preempts lazily
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (43 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 44/86] sched: voluntary preemption Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  0:07   ` Steven Rostedt
  2023-11-07 21:57 ` [RFC PATCH 46/86] tracing: handle lazy resched Ankur Arora
                   ` (17 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Note: this commit is badly broken. Only here for discussion.

Configurations with ARCH_NO_PREEMPT support preempt_count, but might
not be tested well enough under PREEMPTION to support it might not
be demarcating the necessary non-preemptible sections.

One way to handle this is by limiting them to PREEMPT_NONE mode, not
doing any tick enforcement and limiting preemption to happen only at
user boundary.

Unfortunately, this is only a partial solution because eager
rescheduling could still happen (say, due to RCU wanting an
expedited quiescent period.) And, because we do not trust the
preempt_count accounting, this would mean preemption inside an
unmarked critical section.

I suppose we could disable that (say by selecting PREEMPTION=n),
but then the only avenue for driving scheduling between kernel
contexts (when there is no ongoing userspace work) would be
explicit calls to schedule().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c     | 12 ++++++++++--
 kernel/sched/features.h |  7 +++++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3fa78e8afb7d..bf5df2b866df 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1059,6 +1059,14 @@ void __resched_curr(struct rq *rq, resched_t rs)
 		trace_sched_wake_idle_without_ipi(cpu);
 }
 
+#ifndef CONFIG_ARCH_NO_PREEMPT
+#define force_preempt() sched_feat(FORCE_PREEMPT)
+#define preempt_priority() sched_feat(PREEMPT_PRIORITY)
+#else
+#define force_preempt() false
+#define preempt_priority() false
+#endif
+
 /*
  * resched_curr - mark rq's current task 'to be rescheduled' eagerly
  * or lazily according to the current policy.
@@ -1084,7 +1092,7 @@ void resched_curr(struct rq *rq, bool above)
 	resched_t rs = RESCHED_lazy;
 	int context;
 
-	if (sched_feat(FORCE_PREEMPT) ||
+	if (force_preempt() ||
 	    (rq->curr->sched_class == &idle_sched_class)) {
 		rs = RESCHED_eager;
 		goto resched;
@@ -1115,7 +1123,7 @@ void resched_curr(struct rq *rq, bool above)
 		goto resched;
 	}
 
-	if (sched_feat(PREEMPT_PRIORITY) && above)
+	if (preempt_priority() && above)
 		rs = RESCHED_eager;
 
 resched:
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9bf30732b03f..2575d018b181 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,6 +90,12 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(HZ_BW, true)
 
+#ifndef CONFIG_ARCH_NO_PREEMPT
+/*
+ * Architectures with CONFIG_ARCH_NO_PREEMPT cannot safely preempt.
+ * So even though they enable CONFIG_PREEMPTION, they never have the
+ * option to dynamically switch preemption models.
+ */
 #if defined(CONFIG_PREEMPT)
 SCHED_FEAT(FORCE_PREEMPT, true)
 SCHED_FEAT(PREEMPT_PRIORITY, true)
@@ -100,3 +106,4 @@ SCHED_FEAT(PREEMPT_PRIORITY, true)
 SCHED_FEAT(FORCE_PREEMPT, false)
 SCHED_FEAT(PREEMPT_PRIORITY, false)
 #endif
+#endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 46/86] tracing: handle lazy resched
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (44 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 45/86] preempt: ARCH_NO_PREEMPT only preempts lazily Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  0:19   ` Steven Rostedt
  2023-11-07 21:57 ` [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT Ankur Arora
                   ` (16 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Tracing support.

Note: this is quite incomplete.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/trace_events.h |  6 +++---
 kernel/trace/trace.c         |  2 ++
 kernel/trace/trace_output.c  | 16 ++++++++++++++--
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 21ae37e49319..355d25d5e398 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -178,7 +178,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status);
 
 enum trace_flag_type {
 	TRACE_FLAG_IRQS_OFF		= 0x01,
-	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
+	TRACE_FLAG_NEED_RESCHED_LAZY    = 0x02,
 	TRACE_FLAG_NEED_RESCHED		= 0x04,
 	TRACE_FLAG_HARDIRQ		= 0x08,
 	TRACE_FLAG_SOFTIRQ		= 0x10,
@@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_ctx(void)
 
 static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
 {
-	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+	return tracing_gen_ctx_irq_test(0);
 }
 static inline unsigned int tracing_gen_ctx(void)
 {
-	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+	return tracing_gen_ctx_irq_test(0);
 }
 #endif
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7f067ad9cf50..0776dba32c2d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
 
 	if (tif_need_resched(RESCHED_eager))
 		trace_flags |= TRACE_FLAG_NEED_RESCHED;
+	if (tif_need_resched(RESCHED_lazy))
+		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
 	if (test_preempt_need_resched())
 		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
 	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index db575094c498..c251a44ad8ac 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
 		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
 		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
 		bh_off ? 'b' :
-		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
 		'.';
 
-	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
+	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
 				TRACE_FLAG_PREEMPT_RESCHED)) {
+	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+		need_resched = 'B';
+		break;
 	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
 		need_resched = 'N';
 		break;
+	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+		need_resched = 'L';
+		break;
+	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
+		need_resched = 'b';
+		break;
 	case TRACE_FLAG_NEED_RESCHED:
 		need_resched = 'n';
 		break;
+	case TRACE_FLAG_NEED_RESCHED_LAZY:
+		need_resched = 'l';
+		break;
 	case TRACE_FLAG_PREEMPT_RESCHED:
 		need_resched = 'p';
 		break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (45 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 46/86] tracing: handle lazy resched Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-08  0:27   ` Steven Rostedt
  2023-11-08 12:15   ` Julian Anastasov
  2023-11-07 21:57 ` [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n Ankur Arora
                   ` (15 subsequent siblings)
  62 siblings, 2 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

With PREEMPTION being always-on, some configurations might prefer
the stronger forward-progress guarantees provided by PREEMPT_RCU=n
as compared to PREEMPT_RCU=y.

So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.

Note that the preemption model can be changed at runtime (modulo
configurations with ARCH_NO_PREEMPT), but the RCU configuration
is statically compiled.

Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

---
CC-note: Paul had flagged some code that might be impacted
with the proposed RCU changes:

1. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
   unchanged, but I must defer to the include/net/ip_vs.h people.

2. I need to check with the BPF folks on the BPF verifier's
   definition of BTF_ID(func, rcu_read_unlock_strict).

3. I must defer to others on the mm/pgtable-generic.c file's
   #ifdef that depends on CONFIG_PREEMPT_RCU.

Detailed here:
 https://lore.kernel.org/lkml/a375674b-de27-4965-a4bf-e0679229e28e@paulmck-laptop/

---
 include/linux/rcutree.h | 2 +-
 kernel/rcu/Kconfig      | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 126f6b418f6a..75aaa6294421 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -104,7 +104,7 @@ extern int rcu_scheduler_active;
 void rcu_end_inkernel_boot(void);
 bool rcu_inkernel_boot_has_ended(void);
 bool rcu_is_watching(void);
-#ifndef CONFIG_PREEMPTION
+#ifndef CONFIG_PREEMPT
 void rcu_all_qs(void);
 #endif
 
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index bdd7eadb33d8..a808cb29ab7c 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -18,7 +18,7 @@ config TREE_RCU
 
 config PREEMPT_RCU
 	bool
-	default y if PREEMPTION
+	default y if PREEMPT || PREEMPT_RT
 	select TREE_RCU
 	help
 	  This option selects the RCU implementation that is
@@ -31,7 +31,7 @@ config PREEMPT_RCU
 
 config TINY_RCU
 	bool
-	default y if !PREEMPTION && !SMP
+	default y if !PREEMPT && !SMP
 	help
 	  This option selects the RCU implementation that is
 	  designed for UP systems from which real-time response
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (46 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-21  0:38   ` Paul E. McKenney
  2023-11-21  3:55   ` Z qiang
  2023-11-07 21:57 ` [RFC PATCH 49/86] osnoise: handle quiescent states directly Ankur Arora
                   ` (14 subsequent siblings)
  62 siblings, 2 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

cond_resched() is used to provide urgent quiescent states for
read-side critical sections on PREEMPT_RCU=n configurations.
This was necessary because lacking preempt_count, there was no
way for the tick handler to know if we were executing in RCU
read-side critical section or not.

An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
reliably report quiescent states.

Accordingly, evaluate preempt_count() based quiescence in
rcu_flavor_sched_clock_irq().

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/rcu/tree_plugin.h |  3 ++-
 kernel/sched/core.c      | 15 +--------------
 2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index f87191e008ff..618f055f8028 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
  */
 static void rcu_flavor_sched_clock_irq(int user)
 {
-	if (user || rcu_is_cpu_rrupt_from_idle()) {
+	if (user || rcu_is_cpu_rrupt_from_idle() ||
+	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
 
 		/*
 		 * Get here if this CPU took its interrupt from user
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bf5df2b866df..15db5fb7acc7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
 		preempt_schedule_common();
 		return 1;
 	}
-	/*
-	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
-	 * whether the current CPU is in an RCU read-side critical section,
-	 * so the tick can report quiescent states even for CPUs looping
-	 * in kernel context.  In contrast, in non-preemptible kernels,
-	 * RCU readers leave no in-memory hints, which means that CPU-bound
-	 * processes executing in kernel context might never report an
-	 * RCU quiescent state.  Therefore, the following code causes
-	 * cond_resched() to report a quiescent state, but only when RCU
-	 * is in urgent need of one.
-	 */
-#ifndef CONFIG_PREEMPT_RCU
-	rcu_all_qs();
-#endif
+
 	return 0;
 }
 EXPORT_SYMBOL(_cond_resched);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 49/86] osnoise: handle quiescent states directly
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (47 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 50/86] rcu: TASKS_RCU does not need to depend on PREEMPTION Ankur Arora
                   ` (13 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

To reduce RCU noise for the stopped tick case we introduce explicit
quiescent states for PREEMPT_RCU=y, and depend on cond_resched()
(and thus rcu_all_qs()) to handle PREEMPT_RCU=n.

With cond_resched() going away, introduce explicit quiescent states
for all configurations.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/trace/trace_osnoise.c | 37 ++++++++++++------------------------
 1 file changed, 12 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index bd0d01d00fb9..db38934c4242 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1531,34 +1531,21 @@ static int run_osnoise(void)
 
 		/*
 		 * In some cases, notably when running on a nohz_full CPU with
-		 * a stopped tick PREEMPT_RCU has no way to account for QSs.
-		 * This will eventually cause unwarranted noise as PREEMPT_RCU
-		 * will force preemption as the means of ending the current
-		 * grace period. We avoid this problem by calling
-		 * rcu_momentary_dyntick_idle(), which performs a zero duration
-		 * EQS allowing PREEMPT_RCU to end the current grace period.
-		 * This call shouldn't be wrapped inside an RCU critical
-		 * section.
-		 *
-		 * Note that in non PREEMPT_RCU kernels QSs are handled through
-		 * cond_resched()
+		 * a stopped tick RCU has no way to account for QSs. This will
+		 * eventually cause unwarranted noise as RCU forces preemption
+		 * as the means of ending the current grace period.
+		 * We avoid this problem by calling rcu_momentary_dyntick_idle(),
+		 * which performs a zero duration EQS allowing RCU to end the
+		 * current grace period. This call shouldn't be wrapped inside
+		 * an RCU critical section.
 		 */
-		if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
-			if (!disable_irq)
-				local_irq_disable();
+		if (!disable_irq)
+			local_irq_disable();
 
-			rcu_momentary_dyntick_idle();
+		rcu_momentary_dyntick_idle();
 
-			if (!disable_irq)
-				local_irq_enable();
-		}
-
-		/*
-		 * For the non-preemptive kernel config: let threads runs, if
-		 * they so wish, unless set not do to so.
-		 */
-		if (!disable_irq && !disable_preemption)
-			cond_resched();
+		if (!disable_irq)
+			local_irq_enable();
 
 		last_sample = sample;
 		last_int_count = int_count;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 50/86] rcu: TASKS_RCU does not need to depend on PREEMPTION
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (48 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 49/86] osnoise: handle quiescent states directly Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-21  0:38   ` Paul E. McKenney
  2023-11-07 21:57 ` [RFC PATCH 51/86] preempt: disallow !PREEMPT_COUNT or !PREEMPTION Ankur Arora
                   ` (12 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

With PREEMPTION being always enabled, we don't need TASKS_RCU
to be explicitly conditioned on it.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/Kconfig             | 4 ++--
 include/linux/rcupdate.h | 4 ----
 kernel/bpf/Kconfig       | 2 +-
 kernel/trace/Kconfig     | 4 ++--
 4 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 05ce60036ecc..f5179b24072c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -55,7 +55,7 @@ config KPROBES
 	depends on MODULES
 	depends on HAVE_KPROBES
 	select KALLSYMS
-	select TASKS_RCU if PREEMPTION
+	select TASKS_RCU
 	help
 	  Kprobes allows you to trap at almost any kernel address and
 	  execute a callback function.  register_kprobe() establishes
@@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
 config OPTPROBES
 	def_bool y
 	depends on KPROBES && HAVE_OPTPROBES
-	select TASKS_RCU if PREEMPTION
+	select TASKS_RCU
 
 config KPROBES_ON_FTRACE
 	def_bool y
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5e5f920ade90..7246ee602b0b 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -171,10 +171,6 @@ static inline void rcu_nocb_flush_deferred_wakeup(void) { }
 	} while (0)
 void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
-# else
-# define rcu_tasks_classic_qs(t, preempt) do { } while (0)
-# define call_rcu_tasks call_rcu
-# define synchronize_rcu_tasks synchronize_rcu
 # endif
 
 # ifdef CONFIG_TASKS_TRACE_RCU
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 6a906ff93006..e3231b28e2a0 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -27,7 +27,7 @@ config BPF_SYSCALL
 	bool "Enable bpf() system call"
 	select BPF
 	select IRQ_WORK
-	select TASKS_RCU if PREEMPTION
+	select TASKS_RCU
 	select TASKS_TRACE_RCU
 	select BINARY_PRINTF
 	select NET_SOCK_MSG if NET
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..e090387b1c2d 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -163,7 +163,7 @@ config TRACING
 	select BINARY_PRINTF
 	select EVENT_TRACING
 	select TRACE_CLOCK
-	select TASKS_RCU if PREEMPTION
+	select TASKS_RCU
 
 config GENERIC_TRACER
 	bool
@@ -204,7 +204,7 @@ config FUNCTION_TRACER
 	select GENERIC_TRACER
 	select CONTEXT_SWITCH_TRACER
 	select GLOB
-	select TASKS_RCU if PREEMPTION
+	select TASKS_RCU
 	select TASKS_RUDE_RCU
 	help
 	  Enable the kernel to trace every kernel function. This is done
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 51/86] preempt: disallow !PREEMPT_COUNT or !PREEMPTION
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (49 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 50/86] rcu: TASKS_RCU does not need to depend on PREEMPTION Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 52/86] sched: remove CONFIG_PREEMPTION from *_needbreak() Ankur Arora
                   ` (11 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

PREEMPT_COUNT and PREEMPTION are selected for all preemption models.
Mark configurations which might not have either as invalid.

Also stub cond_resched() since we don't actually need it for anything.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/kernel.h  | 11 ++---------
 include/linux/preempt.h | 42 +++--------------------------------------
 include/linux/sched.h   |  4 +---
 3 files changed, 6 insertions(+), 51 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index cf077cd69643..a48900d8b409 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -95,13 +95,6 @@
 struct completion;
 struct user;
 
-#ifdef CONFIG_PREEMPT_VOLUNTARY
-extern int _cond_resched(void);
-# define might_resched() _cond_resched()
-#else
-# define might_resched() do { } while (0)
-#endif
-
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 extern void __might_resched(const char *file, int line, unsigned int offsets);
 extern void __might_sleep(const char *file, int line);
@@ -121,7 +114,7 @@ extern void __cant_migrate(const char *file, int line);
  * supposed to.
  */
 # define might_sleep() \
-	do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
+	do { __might_sleep(__FILE__, __LINE__); } while (0)
 /**
  * cant_sleep - annotation for functions that cannot sleep
  *
@@ -163,7 +156,7 @@ extern void __cant_migrate(const char *file, int line);
   static inline void __might_resched(const char *file, int line,
 				     unsigned int offsets) { }
 static inline void __might_sleep(const char *file, int line) { }
-# define might_sleep() do { might_resched(); } while (0)
+# define might_sleep() do { } while (0)
 # define cant_sleep() do { } while (0)
 # define cant_migrate()		do { } while (0)
 # define sched_annotate_sleep() do { } while (0)
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 0abc6a673c41..dc5125b9c36b 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -197,7 +197,9 @@ extern void preempt_count_sub(int val);
 #define preempt_count_inc() preempt_count_add(1)
 #define preempt_count_dec() preempt_count_sub(1)
 
-#ifdef CONFIG_PREEMPT_COUNT
+#if !defined(CONFIG_PREEMPTION) || !defined(CONFIG_PREEMPT_COUNT)
+#error "Configurations with !CONFIG_PREEMPTION or !CONFIG_PREEMPT_COUNT are not supported."
+#endif
 
 #define preempt_disable() \
 do { \
@@ -215,7 +217,6 @@ do { \
 
 #define preemptible()	(preempt_count() == 0 && !irqs_disabled())
 
-#ifdef CONFIG_PREEMPTION
 #define preempt_enable() \
 do { \
 	barrier(); \
@@ -236,22 +237,6 @@ do { \
 		__preempt_schedule(); \
 } while (0)
 
-#else /* !CONFIG_PREEMPTION */
-#define preempt_enable() \
-do { \
-	barrier(); \
-	preempt_count_dec(); \
-} while (0)
-
-#define preempt_enable_notrace() \
-do { \
-	barrier(); \
-	__preempt_count_dec(); \
-} while (0)
-
-#define preempt_check_resched() do { } while (0)
-#endif /* CONFIG_PREEMPTION */
-
 #define preempt_disable_notrace() \
 do { \
 	__preempt_count_inc(); \
@@ -264,27 +249,6 @@ do { \
 	__preempt_count_dec(); \
 } while (0)
 
-#else /* !CONFIG_PREEMPT_COUNT */
-
-/*
- * Even if we don't have any preemption, we need preempt disable/enable
- * to be barriers, so that we don't have things like get_user/put_user
- * that can cause faults and scheduling migrate into our preempt-protected
- * region.
- */
-#define preempt_disable()			barrier()
-#define sched_preempt_enable_no_resched()	barrier()
-#define preempt_enable_no_resched()		barrier()
-#define preempt_enable()			barrier()
-#define preempt_check_resched()			do { } while (0)
-
-#define preempt_disable_notrace()		barrier()
-#define preempt_enable_no_resched_notrace()	barrier()
-#define preempt_enable_notrace()		barrier()
-#define preemptible()				0
-
-#endif /* CONFIG_PREEMPT_COUNT */
-
 #ifdef MODULE
 /*
  * Modules have no business playing preemption tricks.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6dd206b2ef50..4dabd9530f98 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2088,9 +2088,7 @@ static inline bool test_tsk_need_resched_any(struct task_struct *tsk)
  * value indicates whether a reschedule was done in fact.
  * cond_resched_lock() will drop the spinlock before scheduling,
  */
-#ifndef CONFIG_PREEMPTION
-extern int _cond_resched(void);
-#else
+#ifdef CONFIG_PREEMPTION
 static inline int _cond_resched(void) { return 0; }
 #endif
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 52/86] sched: remove CONFIG_PREEMPTION from *_needbreak()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (50 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 51/86] preempt: disallow !PREEMPT_COUNT or !PREEMPTION Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 53/86] sched: fixup __cond_resched_*() Ankur Arora
                   ` (10 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Since CONFIG_PREEMPTION is always enabled we can remove the clutter.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched.h | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4dabd9530f98..6ba4371761c4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2146,16 +2146,13 @@ static inline void cond_resched_rcu(void)
 
 /*
  * Does a critical section need to be broken due to another
- * task waiting?: (technically does not depend on CONFIG_PREEMPTION,
- * but a general need for low latency)
+ * task waiting?: this should really depend on whether we have
+ * sched_feat(FORCE_PREEMPT) or not but that is not visible
+ * outside the scheduler.
  */
 static inline int spin_needbreak(spinlock_t *lock)
 {
-#ifdef CONFIG_PREEMPTION
 	return spin_is_contended(lock);
-#else
-	return 0;
-#endif
 }
 
 /*
@@ -2163,16 +2160,10 @@ static inline int spin_needbreak(spinlock_t *lock)
  * Returns non-zero if there is another task waiting on the rwlock.
  * Returns zero if the lock is not contended or the system / underlying
  * rwlock implementation does not support contention detection.
- * Technically does not depend on CONFIG_PREEMPTION, but a general need
- * for low latency.
  */
 static inline int rwlock_needbreak(rwlock_t *lock)
 {
-#ifdef CONFIG_PREEMPTION
 	return rwlock_is_contended(lock);
-#else
-	return 0;
-#endif
 }
 
 static __always_inline bool need_resched(void)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 53/86] sched: fixup __cond_resched_*()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (51 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 52/86] sched: remove CONFIG_PREEMPTION from *_needbreak() Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 54/86] sched: add cond_resched_stall() Ankur Arora
                   ` (9 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Remove the call to _cond_resched(). The rescheduling happens
implicitly when we give up the lock.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15db5fb7acc7..e1b0759ed3ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8595,12 +8595,8 @@ EXPORT_SYMBOL(_cond_resched);
 #endif
 
 /*
- * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
- * call schedule, and on return reacquire the lock.
- *
- * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level
- * operations here to prevent schedule() from being called twice (once via
- * spin_unlock(), once by hand).
+ * __cond_resched_lock() - if a reschedule is pending, drop the given lock
+ * (implicitly calling schedule), and reacquire the lock.
  */
 int __cond_resched_lock(spinlock_t *lock)
 {
@@ -8611,7 +8607,7 @@ int __cond_resched_lock(spinlock_t *lock)
 
 	if (spin_needbreak(lock) || resched) {
 		spin_unlock(lock);
-		if (!_cond_resched())
+		if (!resched)
 			cpu_relax();
 		ret = 1;
 		spin_lock(lock);
@@ -8629,7 +8625,7 @@ int __cond_resched_rwlock_read(rwlock_t *lock)
 
 	if (rwlock_needbreak(lock) || resched) {
 		read_unlock(lock);
-		if (!_cond_resched())
+		if (!resched)
 			cpu_relax();
 		ret = 1;
 		read_lock(lock);
@@ -8647,7 +8643,7 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
 
 	if (rwlock_needbreak(lock) || resched) {
 		write_unlock(lock);
-		if (!_cond_resched())
+		if (!resched)
 			cpu_relax();
 		ret = 1;
 		write_lock(lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 54/86] sched: add cond_resched_stall()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (52 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 53/86] sched: fixup __cond_resched_*() Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-09 11:19   ` Thomas Gleixner
  2023-11-07 21:57 ` [RFC PATCH 55/86] xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq() Ankur Arora
                   ` (8 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

The kernel has a lot of intances of cond_resched() where it is used
as an alternative to spinning in a tight-loop while waiting to
retry an operation, or while waiting for a device state to change.

Unfortunately, because the scheduler is unlikely to have an
interminable supply of runnable tasks on the runqueue, this just
amounts to spinning in a tight-loop with a cond_resched().
(When running in a fully preemptible kernel, cond_resched()
calls are stubbed out so it amounts to even less.)

In sum, cond_resched() in error handling/retry contexts might
be useful in avoiding softlockup splats, but not very good at
error handling. Ideally, these should be replaced with some kind
of timed or event wait.

For now add cond_resched_stall(), which tries to schedule if
possible, and failing that executes a cpu_relax().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched.h |  6 ++++++
 kernel/sched/core.c   | 12 ++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6ba4371761c4..199f8f7211f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2100,6 +2100,7 @@ static inline int _cond_resched(void) { return 0; }
 extern int __cond_resched_lock(spinlock_t *lock);
 extern int __cond_resched_rwlock_read(rwlock_t *lock);
 extern int __cond_resched_rwlock_write(rwlock_t *lock);
+extern int __cond_resched_stall(void);
 
 #define MIGHT_RESCHED_RCU_SHIFT		8
 #define MIGHT_RESCHED_PREEMPT_MASK	((1U << MIGHT_RESCHED_RCU_SHIFT) - 1)
@@ -2135,6 +2136,11 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
+#define cond_resched_stall() ({					\
+	__might_resched(__FILE__, __LINE__, 0);			\
+	__cond_resched_stall();					\
+})
+
 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1b0759ed3ab..ea00e8489ebb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8652,6 +8652,18 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_rwlock_write);
 
+int __cond_resched_stall(void)
+{
+	if (tif_need_resched(RESCHED_eager)) {
+		__preempt_schedule();
+		return 1;
+	} else {
+		cpu_relax();
+		return 0;
+	}
+}
+EXPORT_SYMBOL(__cond_resched_stall);
+
 /**
  * yield - yield the current processor to other threads.
  *
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 55/86] xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (53 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 54/86] sched: add cond_resched_stall() Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 21:57 ` [RFC PATCH 56/86] xarray: use cond_resched_xas*() Ankur Arora
                   ` (7 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

xarray code has a common open-coded pattern where we do a flush,
release a lock and/or irq (allowing rescheduling to happen) and
reacquire the resource.

Add helpers to do that. Also remove the cond_resched() call which,
with always-on CONFIG_PREEMPTION, is not needed anymore.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/xarray.h | 14 ++++++++++++++
 kernel/sched/core.c    | 17 +++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index cb571dfcf4b1..30b1181219a3 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1883,4 +1883,18 @@ static inline void *xas_next(struct xa_state *xas)
 	return xa_entry(xas->xa, node, xas->xa_offset);
 }
 
+/**
+ * xas_cond_resched_rcu - if a reschedule is needed, allow RCU to
+ * end this read-side critical section, potentially rescheduling,
+ * and begin another.
+ */
+static inline void cond_resched_xas_rcu(struct xa_state *xas)
+{
+	if (need_resched()) {
+		xas_pause(xas);
+		cond_resched_rcu();
+	}
+}
+extern void cond_resched_xas_lock_irq(struct xa_state *xas);
+
 #endif /* _LINUX_XARRAY_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ea00e8489ebb..3467a3a7d4bf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8664,6 +8664,23 @@ int __cond_resched_stall(void)
 }
 EXPORT_SYMBOL(__cond_resched_stall);
 
+/**
+ * xas_cond_resched_lock_irq - safely drop the xarray lock, enable IRQs
+ * (which might cause a reschedule), and reacquire the lock.
+ */
+void cond_resched_xas_lock_irq(struct xa_state *xas)
+{
+	lockdep_assert_irqs_disabled();
+
+	xas_pause(xas);
+	xas_unlock_irq(xas);
+
+	__might_resched(__FILE__, __LINE__, 0);
+
+	xas_lock_irq(xas);
+}
+EXPORT_SYMBOL(cond_resched_xas_lock_irq);
+
 /**
  * yield - yield the current processor to other threads.
  *
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 56/86] xarray: use cond_resched_xas*()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (54 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 55/86] xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq() Ankur Arora
@ 2023-11-07 21:57 ` Ankur Arora
  2023-11-07 23:01 ` [RFC PATCH 00/86] Make the kernel preemptible Steven Rostedt
                   ` (6 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 21:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Replace the open coded xarray pattern, flush, release resource,
allowing rescheduling  to happen, reacquire by the appropriate
helper.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 fs/dax.c            | 15 +++------------
 mm/filemap.c        |  5 +----
 mm/khugepaged.c     |  5 +----
 mm/memfd.c          | 10 ++--------
 mm/page-writeback.c |  5 +----
 mm/shmem.c          | 10 ++--------
 6 files changed, 10 insertions(+), 40 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8fafecbe42b1..93cf6e8d8990 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -726,10 +726,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
 		if (++scanned % XA_CHECK_SCHED)
 			continue;
 
-		xas_pause(&xas);
-		xas_unlock_irq(&xas);
-		cond_resched();
-		xas_lock_irq(&xas);
+		cond_resched_xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
 	return page;
@@ -784,10 +781,7 @@ static int __dax_clear_dirty_range(struct address_space *mapping,
 		if (++scanned % XA_CHECK_SCHED)
 			continue;
 
-		xas_pause(&xas);
-		xas_unlock_irq(&xas);
-		cond_resched();
-		xas_lock_irq(&xas);
+		cond_resched_xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
 
@@ -1052,10 +1046,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		if (++scanned % XA_CHECK_SCHED)
 			continue;
 
-		xas_pause(&xas);
-		xas_unlock_irq(&xas);
-		cond_resched();
-		xas_lock_irq(&xas);
+		cond_resched_xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
 	trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
diff --git a/mm/filemap.c b/mm/filemap.c
index f0a15ce1bd1b..dc4dcc5eaf5e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4210,10 +4210,7 @@ static void filemap_cachestat(struct address_space *mapping,
 			cs->nr_writeback += nr_pages;
 
 resched:
-		if (need_resched()) {
-			xas_pause(&xas);
-			cond_resched_rcu();
-		}
+		cond_resched_xas_rcu(&xas);
 	}
 	rcu_read_unlock();
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 88433cc25d8a..4025225ef434 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2290,10 +2290,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 
 		present++;
 
-		if (need_resched()) {
-			xas_pause(&xas);
-			cond_resched_rcu();
-		}
+		cond_resched_xas_rcu(&xas);
 	}
 	rcu_read_unlock();
 
diff --git a/mm/memfd.c b/mm/memfd.c
index 2dba2cb6f0d0..5c92f7317dbe 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -55,10 +55,7 @@ static void memfd_tag_pins(struct xa_state *xas)
 			continue;
 		latency = 0;
 
-		xas_pause(xas);
-		xas_unlock_irq(xas);
-		cond_resched();
-		xas_lock_irq(xas);
+		cond_resched_xas_lock_irq(xas);
 	}
 	xas_unlock_irq(xas);
 }
@@ -123,10 +120,7 @@ static int memfd_wait_for_pins(struct address_space *mapping)
 				continue;
 			latency = 0;
 
-			xas_pause(&xas);
-			xas_unlock_irq(&xas);
-			cond_resched();
-			xas_lock_irq(&xas);
+			cond_resched_xas_lock_irq(&xas);
 		}
 		xas_unlock_irq(&xas);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b8d3d7040a50..61a190b9d83c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2351,10 +2351,7 @@ void tag_pages_for_writeback(struct address_space *mapping,
 		if (++tagged % XA_CHECK_SCHED)
 			continue;
 
-		xas_pause(&xas);
-		xas_unlock_irq(&xas);
-		cond_resched();
-		xas_lock_irq(&xas);
+		cond_resched_xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 69595d341882..112172031b2c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -882,10 +882,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 			swapped++;
 		if (xas.xa_index == max)
 			break;
-		if (need_resched()) {
-			xas_pause(&xas);
-			cond_resched_rcu();
-		}
+		cond_resched_xas_rcu(&xas);
 	}
 
 	rcu_read_unlock();
@@ -1299,10 +1296,7 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 		if (!folio_batch_add(fbatch, folio))
 			break;
 
-		if (need_resched()) {
-			xas_pause(&xas);
-			cond_resched_rcu();
-		}
+		cond_resched_xas_rcu(&xas);
 	}
 	rcu_read_unlock();
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (55 preceding siblings ...)
  2023-11-07 21:57 ` [RFC PATCH 56/86] xarray: use cond_resched_xas*() Ankur Arora
@ 2023-11-07 23:01 ` Steven Rostedt
  2023-11-07 23:43   ` Ankur Arora
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                   ` (5 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:01 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue,  7 Nov 2023 13:56:46 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Hi,

Hi Ankur,

Thanks for doing this!

> 
> We have two models of preemption: voluntary and full (and RT which is
> a fuller form of full preemption.) In this series -- which is based
> on Thomas' PoC (see [1]), we try to unify the two by letting the
> scheduler enforce policy for the voluntary preemption models as well.

I would say there's "NONE" which is really just a "voluntary" but with
fewer preemption points ;-) But still should be mentioned, otherwise people
may get confused.

> 
> (Note that this is about preemption when executing in the kernel.
> Userspace is always preemptible.)
> 


> Design
> ==
> 
> As Thomas outlines in [1], to unify the preemption models we
> want to: always have the preempt_count enabled and allow the scheduler
> to drive preemption policy based on the model in effect.
> 
> Policies:
> 
> - preemption=none: run to completion
> - preemption=voluntary: run to completion, unless a task of higher
>   sched-class awaits
> - preemption=full: optimized for low-latency. Preempt whenever a higher
>   priority task awaits.
> 
> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
> scheduler to mark that a reschedule is needed, but is deferred until
> the task finishes executing in the kernel -- voluntary preemption
> as it were.
> 
> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
> 
>          ret-to-user    ret-to-kernel    preempt_count()
> none           Y              N                N
> voluntary      Y              Y                Y
> full           Y              Y                Y

Wait. The above is for when RESCHED_LAZY is to preempt, right?

Then, shouldn't voluntary be:

 voluntary      Y              N                N

For LAZY, but 

 voluntary      Y              Y                Y

For NEED_RESCHED (without lazy)

That is, the only difference between voluntary and none (as you describe
above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED,
but on none, it still sets NEED_RESCHED_LAZY?

> 
> 
> There's just one remaining issue: now that explicit preemption points are
> gone, processes that spread a long time in the kernel have no way to give
> up the CPU.

I wonder if this needs to be solved by with a user space knob, to trigger
the time that "NEED_RESCHED" will force a schedule?

> 
> For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.
> 
> For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
> if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
> (This would cause preemption either at ret-to-kernel, or if the task is in
> a non-preemptible section, when it exits that section.)
> 
> Arguably this provides for much more consistent maximum latency (~2 tick
> lengths + length of non-preemptible section) as compared to the old model
> where the maximum latency depended on the dynamic distribution of
> cond_resched() points.

Again, why I think we probably want to set a knob  for users to adjust
this. Default, it will be set to "tick" but if not, then we need to add
another timer to trigger before then. And this would only be available with
HRTIMERS of course ;-)

> 
> (As a bonus it handles code that is preemptible but cannot call
> cond_resched() completely trivially: ex. long running Xen hypercalls, or
> this series which started this discussion:
>  https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/)
> 
> 
> Status
> ==
> 
> What works:
>  - The system seems to keep ticking over with the normal scheduling
> policies (SCHED_OTHER). The support for the realtime policies is somewhat
> more half baked.)
>  - The basic performance numbers seem pretty close to 6.6-rc7 baseline
> 
> What's broken:
>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>    lazily")
>  - Non-x86 architectures. It's trivial to support other archs (only need
>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>    comments on the series.
>    (From some testing on arm64, didn't find any surprises.)


>  - livepatch: livepatch depends on using _cond_resched() to provide
>    low-latency patching. That is obviously difficult with cond_resched()
>    gone. We could get a similar effect by using a static_key in
>    preempt_enable() but at least with inline locks, that might be end
>    up bloating the kernel quite a bit.

Maybe if we have that timer, livepatch could set it to be temporarily
shorter?

>  - Documentation/ and comments mention cond_resched()

>  - ftrace support for need-resched-lazy is incomplete

Shouldn't be a problem.

> 
> What needs more discussion:
>  - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
>    only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
>    change to tif_need_resched(resched_t)")

I would say NEED_RESCHED only, then it would match the description of the
different models above.

>  - Tracking whether a task in userspace or in the kernel (See patch-40
>    "context_tracking: add ct_state_cpu()")
>  - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
>    preemption")
> 
> 
> Performance
> ==
> 


>   * optimal-load (-j 1024)
> 
>            6.6-rc7                                    +series
>         
> 
>   wall        139.2 +-       0.3             wall       138.8  +-
> 0.2 utime     11161.0 +-    3360.4             utime    11061.2  +-
> 3244.9 stime      1357.6 +-     199.3             stime     1366.6  +-
>  216.3 %cpu       9108.8 +-    2431.4             %cpu      9081.0  +-
> 2351.1 csw     2078599   +- 2013320.0             csw    1970610    +-
> 1969030.0
> 
> 
>   For both of these the wallclock, utime, stime etc are pretty much
>   identical. The one interesting difference is that the number of
>   context switches are fewer. This intuitively makes sense given that
>   we reschedule threads lazily rather than rescheduling if we encounter
>   a cond_resched() and there's a thread wanting to be scheduled.
> 
>   The max-load numbers (not posted here) also behave similarly.

It would be interesting to run any "latency sensitive" benchmarks.

I wounder how cyclictest would work under each model with and without this
patch?

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 02/86] Revert "sched/core: Make sched_dynamic_mutex static"
  2023-11-07 21:56 ` [RFC PATCH 02/86] Revert "sched/core: Make sched_dynamic_mutex static" Ankur Arora
@ 2023-11-07 23:04   ` Steven Rostedt
  0 siblings, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:04 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue,  7 Nov 2023 13:56:48 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> This reverts commit 9b8e17813aeccc29c2f9f2e6e68997a6eac2d26d.

Please explain why it's being reverted.

-- Steve

> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 802551e0009b..ab773ea2cb34 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8746,7 +8746,7 @@ int sched_dynamic_mode(const char *str)
>  #error "Unsupported PREEMPT_DYNAMIC mechanism"
>  #endif
>  
> -static DEFINE_MUTEX(sched_dynamic_mutex);
> +DEFINE_MUTEX(sched_dynamic_mutex);
>  static bool klp_override;
>  
>  static void __sched_dynamic_update(int mode)


^ permalink raw reply	[flat|nested] 250+ messages in thread

* [RFC PATCH 57/86] coccinelle: script to remove cond_resched()
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (56 preceding siblings ...)
  2023-11-07 23:01 ` [RFC PATCH 00/86] Make the kernel preemptible Steven Rostedt
@ 2023-11-07 23:07 ` Ankur Arora
  2023-11-07 23:07   ` [RFC PATCH 58/86] treewide: x86: " Ankur Arora
                     ` (30 more replies)
  2023-11-08  4:08 ` [RFC PATCH 00/86] Make the kernel preemptible Christoph Lameter
                   ` (4 subsequent siblings)
  62 siblings, 31 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Julia Lawall, Nicolas Palix

Rudimentary script to remove the straight-forward subset of
cond_resched() and allies:

1)  if (need_resched())
	  cond_resched()

2)  expression*;
    cond_resched();  /* or in the reverse order */

3)  if (expression)
	statement
    cond_resched();  /* or in the reverse order */

The last two patterns depend on the control flow level to ensure
that the complex cond_resched() patterns (ex. conditioned ones)
are left alone and we only pick up ones which are only minimally
related the neighbouring code.

Cc: Julia Lawall <Julia.Lawall@inria.fr>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
 1 file changed, 53 insertions(+)
 create mode 100644 scripts/coccinelle/api/cond_resched.cocci

diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
new file mode 100644
index 000000000000..bf43768a8f8c
--- /dev/null
+++ b/scripts/coccinelle/api/cond_resched.cocci
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/// Remove naked cond_resched() statements
+///
+//# Remove cond_resched() statements when:
+//#   - executing at the same control flow level as the previous or the
+//#     next statement (this lets us avoid complicated conditionals in
+//#     the neighbourhood.)
+//#   - they are of the form "if (need_resched()) cond_resched()" which
+//#     is always safe.
+//#
+//# Coccinelle generally takes care of comments in the immediate neighbourhood
+//# but might need to handle other comments alluding to rescheduling.
+//#
+virtual patch
+virtual context
+
+@ r1 @
+identifier r;
+@@
+
+(
+ r = cond_resched();
+|
+-if (need_resched())
+-	cond_resched();
+)
+
+@ r2 @
+expression E;
+statement S,T;
+@@
+(
+ E;
+|
+ if (E) S
+|
+ if (E) S else T
+|
+)
+-cond_resched();
+
+@ r3 @
+expression E;
+statement S,T;
+@@
+-cond_resched();
+(
+ E;
+|
+ if (E) S
+|
+ if (E) S else T
+)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 58/86] treewide: x86: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
@ 2023-11-07 23:07   ` Ankur Arora
  2023-11-07 23:07   ` [RFC PATCH 59/86] treewide: rcu: " Ankur Arora
                     ` (29 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Sean Christopherson, Paolo Bonzini, David S. Miller, David Ahern

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well
be -- for most workloads the scheduler does not have an interminable
supply of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the instances of cond_resched() here are from set-1 or set-2.
Remove them.

There's one set-3 case where kvm_recalculate_apic_map() sees an
unexpected APIC-id, where we now use cond_resched_stall() to delay
the retry.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Thomas Gleixner <tglx@linutronix.de> 
Cc: Ingo Molnar <mingo@redhat.com> 
Cc: Borislav Petkov <bp@alien8.de> 
Cc: Dave Hansen <dave.hansen@linux.intel.com> 
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> 
Cc: "David S. Miller" <davem@davemloft.net> 
Cc: David Ahern <dsahern@kernel.org> 
Cc: x86@kernel.org 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c   | 10 ----------
 arch/x86/kernel/cpu/sgx/encl.c  | 14 +++++++-------
 arch/x86/kernel/cpu/sgx/ioctl.c |  3 ---
 arch/x86/kernel/cpu/sgx/main.c  |  5 -----
 arch/x86/kernel/cpu/sgx/virt.c  |  4 ----
 arch/x86/kvm/lapic.c            |  6 +++++-
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/svm/sev.c          |  5 +++--
 arch/x86/net/bpf_jit_comp.c     |  1 -
 arch/x86/net/bpf_jit_comp32.c   |  1 -
 arch/x86/xen/mmu_pv.c           |  1 -
 11 files changed, 16 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 73be3931e4f0..3d0b6a606852 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -2189,16 +2189,6 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 */
 	atomic_set_release(&bp_desc.refs, 1);
 
-	/*
-	 * Function tracing can enable thousands of places that need to be
-	 * updated. This can take quite some time, and with full kernel debugging
-	 * enabled, this could cause the softlockup watchdog to trigger.
-	 * This function gets called every 256 entries added to be patched.
-	 * Call cond_resched() here to make sure that other tasks can get scheduled
-	 * while processing all the functions being patched.
-	 */
-	cond_resched();
-
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
 	 * nr_entries and handler are correctly ordered wrt. patching.
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..05afb4e2f552 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -549,14 +549,15 @@ int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 			break;
 		}
 
-		/* Reschedule on every XA_CHECK_SCHED iteration. */
+		/*
+		 * Drop the lock every XA_CHECK_SCHED iteration so the
+		 * scheduler can preempt if needed.
+		 */
 		if (!(++count % XA_CHECK_SCHED)) {
 			xas_pause(&xas);
 			xas_unlock(&xas);
 			mutex_unlock(&encl->lock);
 
-			cond_resched();
-
 			mutex_lock(&encl->lock);
 			xas_lock(&xas);
 		}
@@ -723,16 +724,15 @@ void sgx_encl_release(struct kref *ref)
 		}
 
 		kfree(entry);
+
 		/*
-		 * Invoke scheduler on every XA_CHECK_SCHED iteration
-		 * to prevent soft lockups.
+		 * Drop the lock every XA_CHECK_SCHED iteration so the
+		 * scheduler can preempt if needed.
 		 */
 		if (!(++count % XA_CHECK_SCHED)) {
 			xas_pause(&xas);
 			xas_unlock(&xas);
 
-			cond_resched();
-
 			xas_lock(&xas);
 		}
 	}
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 5d390df21440..2b899569bb60 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -439,9 +439,6 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
 			break;
 		}
 
-		if (need_resched())
-			cond_resched();
-
 		ret = sgx_encl_add_page(encl, add_arg.src + c, add_arg.offset + c,
 					&secinfo, add_arg.flags);
 		if (ret)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..f8bd01e56b72 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -98,8 +98,6 @@ static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
 			list_move_tail(&page->list, &dirty);
 			left_dirty++;
 		}
-
-		cond_resched();
 	}
 
 	list_splice(&dirty, dirty_page_list);
@@ -413,8 +411,6 @@ static int ksgxd(void *p)
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
 			sgx_reclaim_pages();
-
-		cond_resched();
 	}
 
 	return 0;
@@ -581,7 +577,6 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 		}
 
 		sgx_reclaim_pages();
-		cond_resched();
 	}
 
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 7aaa3652e31d..6ce0983c6249 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -175,7 +175,6 @@ static long sgx_vepc_remove_all(struct sgx_vepc *vepc)
 				return -EBUSY;
 			}
 		}
-		cond_resched();
 	}
 
 	/*
@@ -204,7 +203,6 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
 			continue;
 
 		xa_erase(&vepc->page_array, index);
-		cond_resched();
 	}
 
 	/*
@@ -223,7 +221,6 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
 			list_add_tail(&epc_page->list, &secs_pages);
 
 		xa_erase(&vepc->page_array, index);
-		cond_resched();
 	}
 
 	/*
@@ -245,7 +242,6 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
 
 		if (sgx_vepc_free_page(epc_page))
 			list_add_tail(&epc_page->list, &secs_pages);
-		cond_resched();
 	}
 
 	if (!list_empty(&secs_pages))
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3e977dbbf993..dd87a8214c80 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -435,7 +435,11 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
 			kvfree(new);
 			new = NULL;
 			if (r == -E2BIG) {
-				cond_resched();
+				/*
+				 * A vCPU was just added or a enabled its APIC.
+				 * Give things time to settle before retrying.
+				 */
+				cond_resched_stall();
 				goto retry;
 			}
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f7901cb4d2fa..58efaca73dd4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6431,8 +6431,8 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
 	}
 
 	if (need_topup_split_caches_or_resched(kvm)) {
+		/* The preemption point in write_unlock() reschedules if needed. */
 		write_unlock(&kvm->mmu_lock);
-		cond_resched();
 		/*
 		 * If the topup succeeds, return -EAGAIN to indicate that the
 		 * rmap iterator should be restarted because the MMU lock was
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 4900c078045a..a98f29692a29 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -476,7 +476,6 @@ static void sev_clflush_pages(struct page *pages[], unsigned long npages)
 		page_virtual = kmap_local_page(pages[i]);
 		clflush_cache_range(page_virtual, PAGE_SIZE);
 		kunmap_local(page_virtual);
-		cond_resched();
 	}
 }
 
@@ -2157,12 +2156,14 @@ void sev_vm_destroy(struct kvm *kvm)
 	/*
 	 * if userspace was terminated before unregistering the memory regions
 	 * then lets unpin all the registered memory.
+	 *
+	 * This might be a while but we are preemptible so the scheduler can
+	 * always preempt if needed.
 	 */
 	if (!list_empty(head)) {
 		list_for_each_safe(pos, q, head) {
 			__unregister_enc_region_locked(kvm,
 				list_entry(pos, struct enc_region, list));
-			cond_resched();
 		}
 	}
 
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index a5930042139d..bae5b39810bb 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -2819,7 +2819,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 			prog->aux->extable = (void *) image + roundup(proglen, align);
 		}
 		oldproglen = proglen;
-		cond_resched();
 	}
 
 	if (bpf_jit_enable > 1)
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index 429a89c5468b..03566f031b23 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -2594,7 +2594,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 			}
 		}
 		oldproglen = proglen;
-		cond_resched();
 	}
 
 	if (bpf_jit_enable > 1)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index b6830554ff69..a046cde342b1 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2510,7 +2510,6 @@ int xen_remap_pfn(struct vm_area_struct *vma, unsigned long addr,
 		addr += range;
 		if (err_ptr)
 			err_ptr += batch;
-		cond_resched();
 	}
 out:
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 59/86] treewide: rcu: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
  2023-11-07 23:07   ` [RFC PATCH 58/86] treewide: x86: " Ankur Arora
@ 2023-11-07 23:07   ` Ankur Arora
  2023-11-21  1:01     ` Paul E. McKenney
  2023-11-07 23:07   ` [RFC PATCH 60/86] treewide: torture: " Ankur Arora
                     ` (28 subsequent siblings)
  30 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Frederic Weisbecker

All the cond_resched() calls in the RCU interfaces here are to
drive preemption once it has reported a potentially quiescent
state, or to exit the grace period. With PREEMPTION=y that should
happen implicitly.

So we can remove these.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: "Paul E. McKenney" <paulmck@kernel.org> 
Cc: Frederic Weisbecker <frederic@kernel.org> 
Cc: Ingo Molnar <mingo@redhat.com> 
Cc: Peter Zijlstra <peterz@infradead.org> 
Cc: Juri Lelli <juri.lelli@redhat.com> 
Cc: Vincent Guittot <vincent.guittot@linaro.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/rcupdate.h | 6 ++----
 include/linux/sched.h    | 7 ++++++-
 kernel/hung_task.c       | 6 +++---
 kernel/rcu/tasks.h       | 5 +----
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7246ee602b0b..58f8c7faaa52 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -238,14 +238,12 @@ static inline bool rcu_trace_implies_rcu_gp(void) { return true; }
 /**
  * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
  *
- * This macro resembles cond_resched(), except that it is defined to
- * report potential quiescent states to RCU-tasks even if the cond_resched()
- * machinery were to be shut off, as some advocate for PREEMPTION kernels.
+ * This macro resembles cond_resched(), in that it reports potential
+ * quiescent states to RCU-tasks.
  */
 #define cond_resched_tasks_rcu_qs() \
 do { \
 	rcu_tasks_qs(current, false); \
-	cond_resched(); \
 } while (0)
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 199f8f7211f2..bae6eed534dd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2145,7 +2145,12 @@ static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
 	rcu_read_unlock();
-	cond_resched();
+
+	/*
+	 * Might reschedule here as we exit the RCU read-side
+	 * critical section.
+	 */
+
 	rcu_read_lock();
 #endif
 }
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 9a24574988d2..4bdfad08a2e8 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -153,8 +153,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
  * To avoid extending the RCU grace period for an unbounded amount of time,
  * periodically exit the critical section and enter a new one.
  *
- * For preemptible RCU it is sufficient to call rcu_read_unlock in order
- * to exit the grace period. For classic RCU, a reschedule is required.
+ * Under a preemptive kernel, or with preemptible RCU, it is sufficient to
+ * call rcu_read_unlock in order to exit the grace period.
  */
 static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
 {
@@ -163,7 +163,7 @@ static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
 	get_task_struct(g);
 	get_task_struct(t);
 	rcu_read_unlock();
-	cond_resched();
+
 	rcu_read_lock();
 	can_cont = pid_alive(g) && pid_alive(t);
 	put_task_struct(t);
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 8d65f7d576a3..fa1d9aa31b36 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -541,7 +541,6 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
 		local_bh_disable();
 		rhp->func(rhp);
 		local_bh_enable();
-		cond_resched();
 	}
 	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
 	rcu_segcblist_add_len(&rtpcp->cblist, -len);
@@ -974,10 +973,8 @@ static void check_all_holdout_tasks(struct list_head *hop,
 {
 	struct task_struct *t, *t1;
 
-	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list) {
+	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list)
 		check_holdout_task(t, needreport, firstreport);
-		cond_resched();
-	}
 }
 
 /* Finish off the Tasks-RCU grace period. */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 60/86] treewide: torture: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
  2023-11-07 23:07   ` [RFC PATCH 58/86] treewide: x86: " Ankur Arora
  2023-11-07 23:07   ` [RFC PATCH 59/86] treewide: rcu: " Ankur Arora
@ 2023-11-07 23:07   ` Ankur Arora
  2023-11-21  1:02     ` Paul E. McKenney
  2023-11-07 23:07   ` [RFC PATCH 61/86] treewide: bpf: " Ankur Arora
                     ` (27 subsequent siblings)
  30 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Davidlohr Bueso, Josh Triplett, Frederic Weisbecker

Some cases changed to cond_resched_stall() to avoid changing
the behaviour of the test too drastically.

Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/rcu/rcuscale.c   | 2 --
 kernel/rcu/rcutorture.c | 8 ++++----
 kernel/scftorture.c     | 1 -
 kernel/torture.c        | 1 -
 4 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index ffdb30495e3c..737620bbec83 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -672,8 +672,6 @@ kfree_scale_thread(void *arg)
 			else
 				kfree_rcu(alloc_ptr, rh);
 		}
-
-		cond_resched();
 	} while (!torture_must_stop() && ++loop < kfree_loops);
 
 	if (atomic_inc_return(&n_kfree_scale_thread_ended) >= kfree_nrealthreads) {
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index ade42d6a9d9b..158d58710b51 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -81,7 +81,7 @@ torture_param(int, fqs_stutter, 3, "Wait time between fqs bursts (s)");
 torture_param(int, fwd_progress, 1, "Number of grace-period forward progress tasks (0 to disable)");
 torture_param(int, fwd_progress_div, 4, "Fraction of CPU stall to wait");
 torture_param(int, fwd_progress_holdoff, 60, "Time between forward-progress tests (s)");
-torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched() behind need_resched()");
+torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched_stall() behind need_resched()");
 torture_param(bool, gp_cond, false, "Use conditional/async GP wait primitives");
 torture_param(bool, gp_cond_exp, false, "Use conditional/async expedited GP wait primitives");
 torture_param(bool, gp_cond_full, false, "Use conditional/async full-state GP wait primitives");
@@ -2611,7 +2611,7 @@ static void rcu_torture_fwd_prog_cond_resched(unsigned long iter)
 		return;
 	}
 	// No userspace emulation: CB invocation throttles call_rcu()
-	cond_resched();
+	cond_resched_stall();
 }
 
 /*
@@ -2691,7 +2691,7 @@ static void rcu_torture_fwd_prog_nr(struct rcu_fwd *rfp,
 		udelay(10);
 		cur_ops->readunlock(idx);
 		if (!fwd_progress_need_resched || need_resched())
-			cond_resched();
+			cond_resched_stall();
 	}
 	(*tested_tries)++;
 	if (!time_before(jiffies, stopat) &&
@@ -3232,7 +3232,7 @@ static int rcu_torture_read_exit(void *unused)
 				errexit = true;
 				break;
 			}
-			cond_resched();
+			cond_resched_stall();
 			kthread_stop(tsp);
 			n_read_exits++;
 		}
diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index 59032aaccd18..24192fe01125 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -487,7 +487,6 @@ static int scftorture_invoker(void *arg)
 			set_cpus_allowed_ptr(current, cpumask_of(cpu));
 			was_offline = false;
 		}
-		cond_resched();
 		stutter_wait("scftorture_invoker");
 	} while (!torture_must_stop());
 
diff --git a/kernel/torture.c b/kernel/torture.c
index b28b05bbef02..0c0224c76275 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -747,7 +747,6 @@ bool stutter_wait(const char *title)
 			while (READ_ONCE(stutter_pause_test)) {
 				if (!(i++ & 0xffff))
 					torture_hrtimeout_us(10, 0, NULL);
-				cond_resched();
 			}
 		} else {
 			torture_hrtimeout_jiffies(round_jiffies_relative(HZ), NULL);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 61/86] treewide: bpf: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (2 preceding siblings ...)
  2023-11-07 23:07   ` [RFC PATCH 60/86] treewide: torture: " Ankur Arora
@ 2023-11-07 23:07   ` Ankur Arora
  2023-11-07 23:07   ` [RFC PATCH 62/86] treewide: trace: " Ankur Arora
                     ` (26 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, bpf

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the uses of cond_resched() here are from set-1, so we can trivially
remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/bpf/arraymap.c | 3 ---
 kernel/bpf/bpf_iter.c | 7 +------
 kernel/bpf/btf.c      | 9 ---------
 kernel/bpf/cpumap.c   | 2 --
 kernel/bpf/hashtab.c  | 7 -------
 kernel/bpf/syscall.c  | 3 ---
 kernel/bpf/verifier.c | 5 -----
 7 files changed, 1 insertion(+), 35 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 2058e89b5ddd..cb0d626038b4 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -25,7 +25,6 @@ static void bpf_array_free_percpu(struct bpf_array *array)
 
 	for (i = 0; i < array->map.max_entries; i++) {
 		free_percpu(array->pptrs[i]);
-		cond_resched();
 	}
 }
 
@@ -42,7 +41,6 @@ static int bpf_array_alloc_percpu(struct bpf_array *array)
 			return -ENOMEM;
 		}
 		array->pptrs[i] = ptr;
-		cond_resched();
 	}
 
 	return 0;
@@ -423,7 +421,6 @@ static void array_map_free(struct bpf_map *map)
 
 				for_each_possible_cpu(cpu) {
 					bpf_obj_free_fields(map->record, per_cpu_ptr(pptr, cpu));
-					cond_resched();
 				}
 			}
 		} else {
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 96856f130cbf..dfb24f76ccf7 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -73,7 +73,7 @@ static inline bool bpf_iter_target_support_resched(const struct bpf_iter_target_
 	return tinfo->reg_info->feature & BPF_ITER_RESCHED;
 }
 
-static bool bpf_iter_support_resched(struct seq_file *seq)
+static bool __maybe_unused bpf_iter_support_resched(struct seq_file *seq)
 {
 	struct bpf_iter_priv_data *iter_priv;
 
@@ -97,7 +97,6 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 	struct seq_file *seq = file->private_data;
 	size_t n, offs, copied = 0;
 	int err = 0, num_objs = 0;
-	bool can_resched;
 	void *p;
 
 	mutex_lock(&seq->lock);
@@ -150,7 +149,6 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 		goto done;
 	}
 
-	can_resched = bpf_iter_support_resched(seq);
 	while (1) {
 		loff_t pos = seq->index;
 
@@ -196,9 +194,6 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 			}
 			break;
 		}
-
-		if (can_resched)
-			cond_resched();
 	}
 stop:
 	offs = seq->count;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8090d7fb11ef..fe560f80e230 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5361,8 +5361,6 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		if (!__btf_type_is_struct(t))
 			continue;
 
-		cond_resched();
-
 		for_each_member(j, t, member) {
 			if (btf_id_set_contains(&aof.set, member->type))
 				goto parse;
@@ -5427,8 +5425,6 @@ static int btf_check_type_tags(struct btf_verifier_env *env,
 		if (!btf_type_is_modifier(t))
 			continue;
 
-		cond_resched();
-
 		in_tags = btf_type_is_type_tag(t);
 		while (btf_type_is_modifier(t)) {
 			if (!chain_limit--) {
@@ -8296,11 +8292,6 @@ bpf_core_add_cands(struct bpf_cand_cache *cands, const struct btf *targ_btf,
 		if (!targ_name)
 			continue;
 
-		/* the resched point is before strncmp to make sure that search
-		 * for non-existing name will have a chance to schedule().
-		 */
-		cond_resched();
-
 		if (strncmp(cands->name, targ_name, cands->name_len) != 0)
 			continue;
 
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index e42a1bdb7f53..0aed2a6ef262 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -290,8 +290,6 @@ static int cpu_map_kthread_run(void *data)
 			} else {
 				__set_current_state(TASK_RUNNING);
 			}
-		} else {
-			sched = cond_resched();
 		}
 
 		/*
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index a8c7e1c5abfa..17ed14d2dd44 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -142,7 +142,6 @@ static void htab_init_buckets(struct bpf_htab *htab)
 		raw_spin_lock_init(&htab->buckets[i].raw_lock);
 		lockdep_set_class(&htab->buckets[i].raw_lock,
 					  &htab->lockdep_key);
-		cond_resched();
 	}
 }
 
@@ -232,7 +231,6 @@ static void htab_free_prealloced_timers(struct bpf_htab *htab)
 
 		elem = get_htab_elem(htab, i);
 		bpf_obj_free_timer(htab->map.record, elem->key + round_up(htab->map.key_size, 8));
-		cond_resched();
 	}
 }
 
@@ -255,13 +253,10 @@ static void htab_free_prealloced_fields(struct bpf_htab *htab)
 
 			for_each_possible_cpu(cpu) {
 				bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
-				cond_resched();
 			}
 		} else {
 			bpf_obj_free_fields(htab->map.record, elem->key + round_up(htab->map.key_size, 8));
-			cond_resched();
 		}
-		cond_resched();
 	}
 }
 
@@ -278,7 +273,6 @@ static void htab_free_elems(struct bpf_htab *htab)
 		pptr = htab_elem_get_ptr(get_htab_elem(htab, i),
 					 htab->map.key_size);
 		free_percpu(pptr);
-		cond_resched();
 	}
 free_elems:
 	bpf_map_area_free(htab->elems);
@@ -337,7 +331,6 @@ static int prealloc_init(struct bpf_htab *htab)
 			goto free_elems;
 		htab_elem_set_ptr(get_htab_elem(htab, i), htab->map.key_size,
 				  pptr);
-		cond_resched();
 	}
 
 skip_percpu_elems:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d77b2f8b9364..8762c3d678be 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1695,7 +1695,6 @@ int generic_map_delete_batch(struct bpf_map *map,
 		bpf_enable_instrumentation();
 		if (err)
 			break;
-		cond_resched();
 	}
 	if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
 		err = -EFAULT;
@@ -1752,7 +1751,6 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
 
 		if (err)
 			break;
-		cond_resched();
 	}
 
 	if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
@@ -1849,7 +1847,6 @@ int generic_map_lookup_batch(struct bpf_map *map,
 		swap(prev_key, key);
 		retry = MAP_LOOKUP_RETRIES;
 		cp++;
-		cond_resched();
 	}
 
 	if (err == -EFAULT)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 873ade146f3d..25e6f318c561 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -16489,9 +16489,6 @@ static int do_check(struct bpf_verifier_env *env)
 		if (signal_pending(current))
 			return -EAGAIN;
 
-		if (need_resched())
-			cond_resched();
-
 		if (env->log.level & BPF_LOG_LEVEL2 && do_print_state) {
 			verbose(env, "\nfrom %d to %d%s:",
 				env->prev_insn_idx, env->insn_idx,
@@ -18017,7 +18014,6 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			err = -ENOTSUPP;
 			goto out_free;
 		}
-		cond_resched();
 	}
 
 	/* at this point all bpf functions were successfully JITed
@@ -18061,7 +18057,6 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			err = -ENOTSUPP;
 			goto out_free;
 		}
-		cond_resched();
 	}
 
 	/* finally lock prog and jit images for all functions and
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 62/86] treewide: trace: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (3 preceding siblings ...)
  2023-11-07 23:07   ` [RFC PATCH 61/86] treewide: bpf: " Ankur Arora
@ 2023-11-07 23:07   ` Ankur Arora
  2023-11-07 23:07   ` [RFC PATCH 63/86] treewide: futex: " Ankur Arora
                     ` (25 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Masami Hiramatsu, Mark Rutland

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the cond_resched() calls here are from set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/trace/ftrace.c                |  4 ----
 kernel/trace/ring_buffer.c           |  4 ----
 kernel/trace/ring_buffer_benchmark.c | 13 -------------
 kernel/trace/trace.c                 | 11 -----------
 kernel/trace/trace_events.c          |  1 -
 kernel/trace/trace_selftest.c        |  9 ---------
 6 files changed, 42 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8de8bec5f366..096ebb608610 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2723,7 +2723,6 @@ void __weak ftrace_replace_code(int mod_flags)
 	struct dyn_ftrace *rec;
 	struct ftrace_page *pg;
 	bool enable = mod_flags & FTRACE_MODIFY_ENABLE_FL;
-	int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL;
 	int failed;
 
 	if (unlikely(ftrace_disabled))
@@ -2740,8 +2739,6 @@ void __weak ftrace_replace_code(int mod_flags)
 			/* Stop processing */
 			return;
 		}
-		if (schedulable)
-			cond_resched();
 	} while_for_each_ftrace_rec();
 }
 
@@ -4363,7 +4360,6 @@ match_records(struct ftrace_hash *hash, char *func, int len, char *mod)
 			}
 			found = 1;
 		}
-		cond_resched();
 	} while_for_each_ftrace_rec();
  out_unlock:
 	mutex_unlock(&ftrace_lock);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 515cafdb18d9..5c5eb6a8c7db 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1996,8 +1996,6 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
 	tmp_iter_page = first_page;
 
 	do {
-		cond_resched();
-
 		to_remove_page = tmp_iter_page;
 		rb_inc_page(&tmp_iter_page);
 
@@ -2206,8 +2204,6 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 				err = -ENOMEM;
 				goto out_err;
 			}
-
-			cond_resched();
 		}
 
 		cpus_read_lock();
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index aef34673d79d..8d1c23d135cb 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -267,19 +267,6 @@ static void ring_buffer_producer(void)
 		if (consumer && !(cnt % wakeup_interval))
 			wake_up_process(consumer);
 
-#ifndef CONFIG_PREEMPTION
-		/*
-		 * If we are a non preempt kernel, the 10 seconds run will
-		 * stop everything while it runs. Instead, we will call
-		 * cond_resched and also add any time that was lost by a
-		 * reschedule.
-		 *
-		 * Do a cond resched at the same frequency we would wake up
-		 * the reader.
-		 */
-		if (cnt % wakeup_interval)
-			cond_resched();
-#endif
 	} while (ktime_before(end_time, timeout) && !break_test());
 	trace_printk("End ring buffer hammer\n");
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0776dba32c2d..1efb69423818 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2052,13 +2052,6 @@ static int do_run_tracer_selftest(struct tracer *type)
 {
 	int ret;
 
-	/*
-	 * Tests can take a long time, especially if they are run one after the
-	 * other, as does happen during bootup when all the tracers are
-	 * registered. This could cause the soft lockup watchdog to trigger.
-	 */
-	cond_resched();
-
 	tracing_selftest_running = true;
 	ret = run_tracer_selftest(type);
 	tracing_selftest_running = false;
@@ -2083,10 +2076,6 @@ static __init int init_trace_selftests(void)
 
 	tracing_selftest_running = true;
 	list_for_each_entry_safe(p, n, &postponed_selftests, list) {
-		/* This loop can take minutes when sanitizers are enabled, so
-		 * lets make sure we allow RCU processing.
-		 */
-		cond_resched();
 		ret = run_tracer_selftest(p->type);
 		/* If the test fails, then warn and remove from available_tracers */
 		if (ret < 0) {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f49d6ddb6342..91951d038ba4 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2770,7 +2770,6 @@ void trace_event_eval_update(struct trace_eval_map **map, int len)
 				update_event_fields(call, map[i]);
 			}
 		}
-		cond_resched();
 	}
 	up_write(&trace_event_sem);
 }
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 529590499b1f..07cfad8ce16f 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -848,11 +848,6 @@ trace_selftest_startup_function_graph(struct tracer *trace,
 	}
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
-	/*
-	 * These tests can take some time to run. Make sure on non PREEMPT
-	 * kernels, we do not trigger the softlockup detector.
-	 */
-	cond_resched();
 
 	tracing_reset_online_cpus(&tr->array_buffer);
 	set_graph_array(tr);
@@ -875,8 +870,6 @@ trace_selftest_startup_function_graph(struct tracer *trace,
 	if (ret)
 		goto out;
 
-	cond_resched();
-
 	ret = register_ftrace_graph(&fgraph_ops);
 	if (ret) {
 		warn_failed_init_tracer(trace, ret);
@@ -899,8 +892,6 @@ trace_selftest_startup_function_graph(struct tracer *trace,
 	if (ret)
 		goto out;
 
-	cond_resched();
-
 	tracing_start();
 
 	if (!ret && !count) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 63/86] treewide: futex: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (4 preceding siblings ...)
  2023-11-07 23:07   ` [RFC PATCH 62/86] treewide: trace: " Ankur Arora
@ 2023-11-07 23:07   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 64/86] treewide: printk: " Ankur Arora
                     ` (24 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Darren Hart, Davidlohr Bueso, André Almeida

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most cases here are from set-3. Replace with cond_resched_stall().
There were a few cases (__fixup_pi_state_owner() and futex_requeue())
where we had given up a spinlock or mutex and so, a resched, if any
was needed, would have happened already.

Replace with cpu_relax() in one case, with nothing in the other.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Thomas Gleixner <tglx@linutronix.de> 
Cc: Ingo Molnar <mingo@redhat.com> 
Cc: Peter Zijlstra <peterz@infradead.org> 
Cc: Darren Hart <dvhart@infradead.org> 
Cc: Davidlohr Bueso <dave@stgolabs.net> 
Cc: "André Almeida" <andrealmeid@igalia.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/futex/core.c     | 6 +-----
 kernel/futex/pi.c       | 6 +++---
 kernel/futex/requeue.c  | 1 -
 kernel/futex/waitwake.c | 2 +-
 4 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index f10587d1d481..4821931fb19d 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -724,7 +724,7 @@ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
 			goto retry;
 
 		case -EAGAIN:
-			cond_resched();
+			cond_resched_stall();
 			goto retry;
 
 		default:
@@ -822,8 +822,6 @@ static void exit_robust_list(struct task_struct *curr)
 		 */
 		if (!--limit)
 			break;
-
-		cond_resched();
 	}
 
 	if (pending) {
@@ -922,8 +920,6 @@ static void compat_exit_robust_list(struct task_struct *curr)
 		 */
 		if (!--limit)
 			break;
-
-		cond_resched();
 	}
 	if (pending) {
 		void __user *uaddr = futex_uaddr(pending, futex_offset);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index ce2889f12375..e3f6ca4cd875 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -809,7 +809,7 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 		break;
 
 	case -EAGAIN:
-		cond_resched();
+		cpu_relax();
 		err = 0;
 		break;
 
@@ -981,7 +981,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 			 * this task might loop forever, aka. live lock.
 			 */
 			wait_for_owner_exiting(ret, exiting);
-			cond_resched();
+			cond_resched_stall();
 			goto retry;
 		default:
 			goto out_unlock_put_key;
@@ -1219,7 +1219,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	return ret;
 
 pi_retry:
-	cond_resched();
+	cond_resched_stall();
 	goto retry;
 
 pi_faulted:
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index cba8b1a6a4cc..9f916162ef6e 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -560,7 +560,6 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 			 * this task might loop forever, aka. live lock.
 			 */
 			wait_for_owner_exiting(ret, exiting);
-			cond_resched();
 			goto retry;
 		default:
 			goto out_unlock;
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index ba01b9408203..801b1ec3625a 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -277,7 +277,7 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 				return ret;
 		}
 
-		cond_resched();
+		cond_resched_stall();
 		if (!(flags & FLAGS_SHARED))
 			goto retry_private;
 		goto retry;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 64/86] treewide: printk: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (5 preceding siblings ...)
  2023-11-07 23:07   ` [RFC PATCH 63/86] treewide: futex: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 65/86] treewide: task_work: " Ankur Arora
                     ` (23 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Greg Kroah-Hartman, Petr Mladek, John Ogness, Sergey Senozhatsky

The printk code goes to great lengths to ensure that there are no
scheduling stalls which would cause softlockup/RCU splats and make
things worse.

With PREEMPT_COUNT=y and PREEMPTION=y, this should be a non-issue as
the scheduler can determine when this logic can be preempted.

So, remove cond_resched() and related code.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 
Cc: Petr Mladek <pmladek@suse.com> 
Cc: Steven Rostedt <rostedt@goodmis.org> 
Cc: John Ogness <john.ogness@linutronix.de> 
Cc: Sergey Senozhatsky <senozhatsky@chromium.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/console.h |  2 +-
 kernel/printk/printk.c  | 65 +++++++++--------------------------------
 2 files changed, 15 insertions(+), 52 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index 7de11c763eb3..db418dab5674 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -347,7 +347,7 @@ extern int unregister_console(struct console *);
 extern void console_lock(void);
 extern int console_trylock(void);
 extern void console_unlock(void);
-extern void console_conditional_schedule(void);
+static inline void console_conditional_schedule(void) { }
 extern void console_unblank(void);
 extern void console_flush_on_panic(enum con_flush_mode mode);
 extern struct tty_driver *console_device(int *);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 0b3af1529778..2708d9f499a3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -375,9 +375,6 @@ static int preferred_console = -1;
 int console_set_on_cmdline;
 EXPORT_SYMBOL(console_set_on_cmdline);
 
-/* Flag: console code may call schedule() */
-static int console_may_schedule;
-
 enum con_msg_format_flags {
 	MSG_FORMAT_DEFAULT	= 0,
 	MSG_FORMAT_SYSLOG	= (1 << 0),
@@ -2651,7 +2648,6 @@ void console_lock(void)
 
 	down_console_sem();
 	console_locked = 1;
-	console_may_schedule = 1;
 }
 EXPORT_SYMBOL(console_lock);
 
@@ -2671,7 +2667,6 @@ int console_trylock(void)
 	if (down_trylock_console_sem())
 		return 0;
 	console_locked = 1;
-	console_may_schedule = 0;
 	return 1;
 }
 EXPORT_SYMBOL(console_trylock);
@@ -2922,9 +2917,6 @@ static bool console_emit_next_record(struct console *con, bool *handover, int co
 /*
  * Print out all remaining records to all consoles.
  *
- * @do_cond_resched is set by the caller. It can be true only in schedulable
- * context.
- *
  * @next_seq is set to the sequence number after the last available record.
  * The value is valid only when this function returns true. It means that all
  * usable consoles are completely flushed.
@@ -2942,7 +2934,7 @@ static bool console_emit_next_record(struct console *con, bool *handover, int co
  *
  * Requires the console_lock.
  */
-static bool console_flush_all(bool do_cond_resched, u64 *next_seq, bool *handover)
+static bool console_flush_all(u64 *next_seq, bool *handover)
 {
 	bool any_usable = false;
 	struct console *con;
@@ -2983,9 +2975,6 @@ static bool console_flush_all(bool do_cond_resched, u64 *next_seq, bool *handove
 			/* Allow panic_cpu to take over the consoles safely. */
 			if (other_cpu_in_panic())
 				goto abandon;
-
-			if (do_cond_resched)
-				cond_resched();
 		}
 		console_srcu_read_unlock(cookie);
 	} while (any_progress);
@@ -3011,28 +3000,26 @@ static bool console_flush_all(bool do_cond_resched, u64 *next_seq, bool *handove
  */
 void console_unlock(void)
 {
-	bool do_cond_resched;
 	bool handover;
 	bool flushed;
 	u64 next_seq;
 
 	/*
-	 * Console drivers are called with interrupts disabled, so
-	 * @console_may_schedule should be cleared before; however, we may
+	 * Console drivers are called with interrupts disabled, so in
+	 * general we cannot schedule. There are also cases where we will
 	 * end up dumping a lot of lines, for example, if called from
-	 * console registration path, and should invoke cond_resched()
-	 * between lines if allowable.  Not doing so can cause a very long
-	 * scheduling stall on a slow console leading to RCU stall and
-	 * softlockup warnings which exacerbate the issue with more
-	 * messages practically incapacitating the system. Therefore, create
-	 * a local to use for the printing loop.
+	 * console registration path.
+	 *
+	 * Not scheduling while working on a slow console could lead to
+	 * RCU stalls and softlockup warnings which exacerbate the issue
+	 * with more messages practically incapacitating the system.
+	 *
+	 * However, most of the console code is preemptible, so the scheduler
+	 * should be able to preempt us and make forward progress.
 	 */
-	do_cond_resched = console_may_schedule;
 
 	do {
-		console_may_schedule = 0;
-
-		flushed = console_flush_all(do_cond_resched, &next_seq, &handover);
+		flushed = console_flush_all(&next_seq, &handover);
 		if (!handover)
 			__console_unlock();
 
@@ -3055,22 +3042,6 @@ void console_unlock(void)
 }
 EXPORT_SYMBOL(console_unlock);
 
-/**
- * console_conditional_schedule - yield the CPU if required
- *
- * If the console code is currently allowed to sleep, and
- * if this CPU should yield the CPU to another task, do
- * so here.
- *
- * Must be called within console_lock();.
- */
-void __sched console_conditional_schedule(void)
-{
-	if (console_may_schedule)
-		cond_resched();
-}
-EXPORT_SYMBOL(console_conditional_schedule);
-
 void console_unblank(void)
 {
 	bool found_unblank = false;
@@ -3118,7 +3089,6 @@ void console_unblank(void)
 		console_lock();
 
 	console_locked = 1;
-	console_may_schedule = 0;
 
 	cookie = console_srcu_read_lock();
 	for_each_console_srcu(c) {
@@ -3154,13 +3124,6 @@ void console_flush_on_panic(enum con_flush_mode mode)
 	 *   - semaphores are not NMI-safe
 	 */
 
-	/*
-	 * If another context is holding the console lock,
-	 * @console_may_schedule might be set. Clear it so that
-	 * this context does not call cond_resched() while flushing.
-	 */
-	console_may_schedule = 0;
-
 	if (mode == CONSOLE_REPLAY_ALL) {
 		struct console *c;
 		int cookie;
@@ -3179,7 +3142,7 @@ void console_flush_on_panic(enum con_flush_mode mode)
 		console_srcu_read_unlock(cookie);
 	}
 
-	console_flush_all(false, &next_seq, &handover);
+	console_flush_all(&next_seq, &handover);
 }
 
 /*
@@ -3364,7 +3327,7 @@ static void console_init_seq(struct console *newcon, bool bootcon_registered)
 			 * Flush all consoles and set the console to start at
 			 * the next unprinted sequence number.
 			 */
-			if (!console_flush_all(true, &newcon->seq, &handover)) {
+			if (!console_flush_all(&newcon->seq, &handover)) {
 				/*
 				 * Flushing failed. Just choose the lowest
 				 * sequence of the enabled boot consoles.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 65/86] treewide: task_work: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (6 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 64/86] treewide: printk: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 66/86] treewide: kernel: " Ankur Arora
                     ` (22 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Oleg Nesterov, Jens Axboe

The cond_resched() call was added in commit f341861fb0b7 ("task_work:
add a scheduling point in task_work_run()") because of softlockups
when processes with a large number of open sockets would exit.

Given the always-on PREEMPTION, we should be able to remove it
without much concern. However, task_work_run() does get called
from some "interesting" places: one of them being the
exit_to_user_loop() itself.

That means that if TIF_NEED_RESCHED (or TIF_NEED_RESCHED_LAZY) were
to be set once we were in a potentially long running task_work_run()
all, then we would ignore the need-resched flags and there would
be no call to schedule().

However, in that case, the next timer tick should cause rescheduling
in irqentry_exit_cond_resched(), since then the TIF_NEED_RESCHED flag
(even if the original flag were TIF_NEED_RESCHED_LAZY the tick would
upgrade that.)

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/task_work.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..6a891465c8e1 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -179,7 +179,6 @@ void task_work_run(void)
 			next = work->next;
 			work->func(work);
 			work = next;
-			cond_resched();
 		} while (work);
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 66/86] treewide: kernel: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (7 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 65/86] treewide: task_work: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-17 18:14     ` Luis Chamberlain
  2023-11-07 23:08   ` [RFC PATCH 67/86] treewide: kernel: remove cond_reshed() Ankur Arora
                     ` (21 subsequent siblings)
  30 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Tejun Heo, Zefan Li, Johannes Weiner, Peter Oberparleiter,
	Eric Biederman, Will Deacon, Luis Chamberlain, Oleg Nesterov

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All of these are from set-1 except for the retry loops in
task_function_call() or the mutex testing logic.

Replace these with cond_resched_stall(). The others can be removed.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Tejun Heo <tj@kernel.org> 
Cc: Zefan Li <lizefan.x@bytedance.com> 
Cc: Johannes Weiner <hannes@cmpxchg.org> 
Cc: Peter Oberparleiter <oberpar@linux.ibm.com> 
Cc: Eric Biederman <ebiederm@xmission.com> 
Cc: Will Deacon <will@kernel.org> 
Cc: Luis Chamberlain <mcgrof@kernel.org> 
Cc: Oleg Nesterov <oleg@redhat.com> 
Cc: Juri Lelli <juri.lelli@redhat.com> 
Cc: Vincent Guittot <vincent.guittot@linaro.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched/cond_resched.h | 1 -
 kernel/auditsc.c                   | 2 --
 kernel/cgroup/rstat.c              | 3 +--
 kernel/dma/debug.c                 | 2 --
 kernel/events/core.c               | 2 +-
 kernel/gcov/base.c                 | 1 -
 kernel/kallsyms.c                  | 4 +---
 kernel/kexec_core.c                | 6 ------
 kernel/locking/test-ww_mutex.c     | 4 ++--
 kernel/module/main.c               | 1 -
 kernel/ptrace.c                    | 2 --
 kernel/sched/core.c                | 1 -
 kernel/sched/fair.c                | 4 ----
 13 files changed, 5 insertions(+), 28 deletions(-)
 delete mode 100644 include/linux/sched/cond_resched.h

diff --git a/include/linux/sched/cond_resched.h b/include/linux/sched/cond_resched.h
deleted file mode 100644
index 227f5be81bcd..000000000000
--- a/include/linux/sched/cond_resched.h
+++ /dev/null
@@ -1 +0,0 @@
-#include <linux/sched.h>
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 6f0d6fb6523f..47abfc1e6c75 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -2460,8 +2460,6 @@ void __audit_inode_child(struct inode *parent,
 		}
 	}
 
-	cond_resched();
-
 	/* is there a matching child entry? */
 	list_for_each_entry(n, &context->names_list, list) {
 		/* can only match entries that have a name */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d80d7a608141..d61dc98d1d2f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -210,8 +210,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
 			spin_unlock_irq(&cgroup_rstat_lock);
-			if (!cond_resched())
-				cpu_relax();
+			cond_resched_stall();
 			spin_lock_irq(&cgroup_rstat_lock);
 		}
 	}
diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 06366acd27b0..fb8e7aed9751 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -543,8 +543,6 @@ void debug_dma_dump_mappings(struct device *dev)
 			}
 		}
 		spin_unlock_irqrestore(&bucket->lock, flags);
-
-		cond_resched();
 	}
 }
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a2f2a9525d72..02330c190472 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -125,7 +125,7 @@ task_function_call(struct task_struct *p, remote_function_f func, void *info)
 		if (ret != -EAGAIN)
 			break;
 
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	return ret;
diff --git a/kernel/gcov/base.c b/kernel/gcov/base.c
index 073a3738c5e6..3c22a15065b3 100644
--- a/kernel/gcov/base.c
+++ b/kernel/gcov/base.c
@@ -43,7 +43,6 @@ void gcov_enable_events(void)
 	/* Perform event callback for previously registered entries. */
 	while ((info = gcov_info_next(info))) {
 		gcov_event(GCOV_ADD, info);
-		cond_resched();
 	}
 
 	mutex_unlock(&gcov_lock);
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 18edd57b5fe8..a3c5ce9246cd 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -19,7 +19,7 @@
 #include <linux/kdb.h>
 #include <linux/err.h>
 #include <linux/proc_fs.h>
-#include <linux/sched.h>	/* for cond_resched */
+#include <linux/sched.h>
 #include <linux/ctype.h>
 #include <linux/slab.h>
 #include <linux/filter.h>
@@ -295,7 +295,6 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, unsigned long),
 		ret = fn(data, namebuf, kallsyms_sym_address(i));
 		if (ret != 0)
 			return ret;
-		cond_resched();
 	}
 	return 0;
 }
@@ -312,7 +311,6 @@ int kallsyms_on_each_match_symbol(int (*fn)(void *, unsigned long),
 
 	for (i = start; !ret && i <= end; i++) {
 		ret = fn(data, kallsyms_sym_address(get_symbol_seq(i)));
-		cond_resched();
 	}
 
 	return ret;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 9dc728982d79..40699ea33034 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -452,8 +452,6 @@ static struct page *kimage_alloc_crash_control_pages(struct kimage *image,
 	while (hole_end <= crashk_res.end) {
 		unsigned long i;
 
-		cond_resched();
-
 		if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT)
 			break;
 		/* See if I overlap any of the segments */
@@ -832,8 +830,6 @@ static int kimage_load_normal_segment(struct kimage *image,
 		else
 			buf += mchunk;
 		mbytes -= mchunk;
-
-		cond_resched();
 	}
 out:
 	return result;
@@ -900,8 +896,6 @@ static int kimage_load_crash_segment(struct kimage *image,
 		else
 			buf += mchunk;
 		mbytes -= mchunk;
-
-		cond_resched();
 	}
 out:
 	return result;
diff --git a/kernel/locking/test-ww_mutex.c b/kernel/locking/test-ww_mutex.c
index 93cca6e69860..b1bb683274f8 100644
--- a/kernel/locking/test-ww_mutex.c
+++ b/kernel/locking/test-ww_mutex.c
@@ -46,7 +46,7 @@ static void test_mutex_work(struct work_struct *work)
 
 	if (mtx->flags & TEST_MTX_TRY) {
 		while (!ww_mutex_trylock(&mtx->mutex, NULL))
-			cond_resched();
+			cond_resched_stall();
 	} else {
 		ww_mutex_lock(&mtx->mutex, NULL);
 	}
@@ -84,7 +84,7 @@ static int __test_mutex(unsigned int flags)
 				ret = -EINVAL;
 				break;
 			}
-			cond_resched();
+			cond_resched_stall();
 		} while (time_before(jiffies, timeout));
 	} else {
 		ret = wait_for_completion_timeout(&mtx.done, TIMEOUT);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 98fedfdb8db5..03f6fcfa87f8 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1908,7 +1908,6 @@ static int copy_chunked_from_user(void *dst, const void __user *usrc, unsigned l
 
 		if (copy_from_user(dst, usrc, n) != 0)
 			return -EFAULT;
-		cond_resched();
 		dst += n;
 		usrc += n;
 		len -= n;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 443057bee87c..83a65a3c614a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -798,8 +798,6 @@ static int ptrace_peek_siginfo(struct task_struct *child,
 
 		if (signal_pending(current))
 			break;
-
-		cond_resched();
 	}
 
 	if (i > 0)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3467a3a7d4bf..691b50791e04 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -25,7 +25,6 @@
 #include <linux/refcount_api.h>
 #include <linux/topology.h>
 #include <linux/sched/clock.h>
-#include <linux/sched/cond_resched.h>
 #include <linux/sched/cputime.h>
 #include <linux/sched/debug.h>
 #include <linux/sched/hotplug.h>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 448fe36e7bbb..4e67e88282a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -33,7 +33,6 @@
 #include <linux/refcount_api.h>
 #include <linux/topology.h>
 #include <linux/sched/clock.h>
-#include <linux/sched/cond_resched.h>
 #include <linux/sched/cputime.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/nohz.h>
@@ -51,8 +50,6 @@
 
 #include <asm/switch_to.h>
 
-#include <linux/sched/cond_resched.h>
-
 #include "sched.h"
 #include "stats.h"
 #include "autogroup.h"
@@ -3374,7 +3371,6 @@ static void task_numa_work(struct callback_head *work)
 			if (pages <= 0 || virtpages <= 0)
 				goto out;
 
-			cond_resched();
 		} while (end != vma->vm_end);
 	} for_each_vma(vmi, vma);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 67/86] treewide: kernel: remove cond_reshed()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (8 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 66/86] treewide: kernel: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 68/86] treewide: mm: remove cond_resched() Ankur Arora
                     ` (20 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Tejun Heo, Lai Jiangshan, Nicholas Piggin

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All of these are set-1 or set-2. Replace the call in stop_one_cpu()
with cond_resched_stall() to allow it a chance to schedule.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Tejun Heo <tj@kernel.org> 
Cc: Lai Jiangshan <jiangshanlai@gmail.com> 
Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: Nicholas Piggin <npiggin@gmail.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/kthread.c      |  1 -
 kernel/softirq.c      |  1 -
 kernel/stop_machine.c |  2 +-
 kernel/workqueue.c    | 10 ----------
 4 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 1eea53050bab..e111eebee240 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -830,7 +830,6 @@ int kthread_worker_fn(void *worker_ptr)
 		schedule();
 
 	try_to_freeze();
-	cond_resched();
 	goto repeat;
 }
 EXPORT_SYMBOL_GPL(kthread_worker_fn);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 210cf5f8d92c..c80237cbcb3d 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -920,7 +920,6 @@ static void run_ksoftirqd(unsigned int cpu)
 		 */
 		__do_softirq();
 		ksoftirqd_run_end();
-		cond_resched();
 		return;
 	}
 	ksoftirqd_run_end();
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index cedb17ba158a..1929fe8ecd70 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -148,7 +148,7 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
 	 * In case @cpu == smp_proccessor_id() we can avoid a sleep+wakeup
 	 * cycle by doing a preemption:
 	 */
-	cond_resched();
+	cond_resched_stall();
 	wait_for_completion(&done.completion);
 	return done.ret;
 }
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a3522b70218d..be5080e1b7d6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2646,16 +2646,6 @@ __acquires(&pool->lock)
 		dump_stack();
 	}
 
-	/*
-	 * The following prevents a kworker from hogging CPU on !PREEMPTION
-	 * kernels, where a requeueing work item waiting for something to
-	 * happen could deadlock with stop_machine as such work item could
-	 * indefinitely requeue itself while all other CPUs are trapped in
-	 * stop_machine. At the same time, report a quiescent RCU state so
-	 * the same condition doesn't freeze RCU.
-	 */
-	cond_resched();
-
 	raw_spin_lock_irq(&pool->lock);
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 68/86] treewide: mm: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (9 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 67/86] treewide: kernel: remove cond_reshed() Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-08  1:28     ` Sergey Senozhatsky
  2023-11-07 23:08   ` [RFC PATCH 69/86] treewide: io_uring: " Ankur Arora
                     ` (19 subsequent siblings)
  30 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	SeongJae Park, Mike Kravetz, Muchun Song, Andrey Ryabinin,
	Marco Elver, Catalin Marinas, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Naoya Horiguchi, Miaohe Lin,
	David Hildenbrand, Oscar Salvador, Mike Rapoport, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Hugh Dickins, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Vitaly Wool, Minchan Kim,
	Sergey Senozhatsky, Seth Jennings, Dan Streetman

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the cond_resched() cases here are from set-1, we are
executing in long loops and want to see if rescheduling is
needed.

Now the scheduler can handle rescheduling for those.

There are a few set-2 cases where we give up a lock and reacquire
it.  The unlock will take care of the preemption, but maybe there
should be a cpu_relax() before reacquiring?

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: SeongJae Park <sj@kernel.org> 
Cc: "Matthew Wilcox 
Cc: Mike Kravetz <mike.kravetz@oracle.com> 
Cc: Muchun Song <muchun.song@linux.dev> 
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> 
Cc: Marco Elver <elver@google.com> 
Cc: Catalin Marinas <catalin.marinas@arm.com> 
Cc: Johannes Weiner <hannes@cmpxchg.org> 
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Roman Gushchin <roman.gushchin@linux.dev> 
Cc: Shakeel Butt <shakeelb@google.com> 
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> 
Cc: Miaohe Lin <linmiaohe@huawei.com> 
Cc: David Hildenbrand <david@redhat.com> 
Cc: Oscar Salvador <osalvador@suse.de> 
Cc: Mike Rapoport <rppt@kernel.org> 
Cc: Will Deacon <will@kernel.org> 
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> 
Cc: Nick Piggin <npiggin@gmail.com> 
Cc: Peter Zijlstra <peterz@infradead.org> 
Cc: Dennis Zhou <dennis@kernel.org> 
Cc: Tejun Heo <tj@kernel.org> 
Cc: Christoph Lameter <cl@linux.com> 
Cc: Hugh Dickins <hughd@google.com> 
Cc: Pekka Enberg <penberg@kernel.org> 
Cc: David Rientjes <rientjes@google.com> 
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> 
Cc: Vlastimil Babka <vbabka@suse.cz> 
Cc: Vitaly Wool <vitaly.wool@konsulko.com> 
Cc: Minchan Kim <minchan@kernel.org> 
Cc: Sergey Senozhatsky <senozhatsky@chromium.org> 
Cc: Seth Jennings <sjenning@redhat.com> 
Cc: Dan Streetman <ddstreet@ieee.org> 
Cc: linux-mm@kvack.org

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/backing-dev.c        |  8 +++++++-
 mm/compaction.c         | 23 ++++++-----------------
 mm/damon/paddr.c        |  1 -
 mm/dmapool_test.c       |  2 --
 mm/filemap.c            |  6 ------
 mm/gup.c                |  1 -
 mm/huge_memory.c        |  3 ---
 mm/hugetlb.c            | 12 ------------
 mm/hugetlb_cgroup.c     |  1 -
 mm/kasan/quarantine.c   |  6 ++++--
 mm/kfence/kfence_test.c | 22 +---------------------
 mm/khugepaged.c         |  5 -----
 mm/kmemleak.c           |  8 --------
 mm/ksm.c                | 21 ++++-----------------
 mm/madvise.c            |  3 ---
 mm/memcontrol.c         |  4 ----
 mm/memory-failure.c     |  1 -
 mm/memory.c             | 12 +-----------
 mm/memory_hotplug.c     |  6 ------
 mm/mempolicy.c          |  1 -
 mm/migrate.c            |  6 ------
 mm/mincore.c            |  1 -
 mm/mlock.c              |  2 --
 mm/mm_init.c            | 13 +++----------
 mm/mmap.c               |  1 -
 mm/mmu_gather.c         |  2 --
 mm/mprotect.c           |  1 -
 mm/mremap.c             |  1 -
 mm/nommu.c              |  1 -
 mm/page-writeback.c     |  1 -
 mm/page_alloc.c         | 13 ++-----------
 mm/page_counter.c       |  1 -
 mm/page_ext.c           |  1 -
 mm/page_idle.c          |  2 --
 mm/page_io.c            |  2 --
 mm/page_owner.c         |  1 -
 mm/percpu.c             |  5 -----
 mm/rmap.c               |  2 --
 mm/shmem.c              |  9 ---------
 mm/shuffle.c            |  6 ++++--
 mm/slab.c               |  3 ---
 mm/swap_cgroup.c        |  4 ----
 mm/swapfile.c           | 14 --------------
 mm/truncate.c           |  4 ----
 mm/userfaultfd.c        |  3 ---
 mm/util.c               |  1 -
 mm/vmalloc.c            |  5 -----
 mm/vmscan.c             | 29 ++---------------------------
 mm/vmstat.c             |  4 ----
 mm/workingset.c         |  1 -
 mm/z3fold.c             | 15 ++++-----------
 mm/zsmalloc.c           |  1 -
 mm/zswap.c              |  1 -
 53 files changed, 38 insertions(+), 264 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 1e3447bccdb1..22ca90addb35 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -816,8 +816,14 @@ static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
 			continue;
 
 		spin_unlock_irq(&cgwb_lock);
+
+		/*
+		 * cleanup_offline_cgwb() can implicitly reschedule
+		 * on unlock when needed, so just loop here.
+		 */
 		while (cleanup_offline_cgwb(wb))
-			cond_resched();
+			;
+
 		spin_lock_irq(&cgwb_lock);
 
 		wb_put(wb);
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..5bca34760fec 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -395,8 +395,6 @@ static void __reset_isolation_suitable(struct zone *zone)
 	 */
 	for (; migrate_pfn < free_pfn; migrate_pfn += pageblock_nr_pages,
 					free_pfn -= pageblock_nr_pages) {
-		cond_resched();
-
 		/* Update the migrate PFN */
 		if (__reset_isolation_pfn(zone, migrate_pfn, true, source_set) &&
 		    migrate_pfn < reset_migrate) {
@@ -571,8 +569,6 @@ static bool compact_unlock_should_abort(spinlock_t *lock,
 		return true;
 	}
 
-	cond_resched();
-
 	return false;
 }
 
@@ -874,8 +870,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			return -EINTR;
 	}
 
-	cond_resched();
-
 	if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
 		skip_on_failure = true;
 		next_skip_pfn = block_end_pfn(low_pfn, cc->order);
@@ -923,8 +917,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 				goto fatal_pending;
 			}
-
-			cond_resched();
 		}
 
 		nr_scanned++;
@@ -1681,11 +1673,10 @@ static void isolate_freepages(struct compact_control *cc)
 		unsigned long nr_isolated;
 
 		/*
-		 * This can iterate a massively long zone without finding any
-		 * suitable migration targets, so periodically check resched.
+		 * We can iterate over a massively long zone without finding
+		 * any suitable migration targets. Since we don't disable
+		 * preemption while doing so, expect to be preempted.
 		 */
-		if (!(block_start_pfn % (COMPACT_CLUSTER_MAX * pageblock_nr_pages)))
-			cond_resched();
 
 		page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,
 									zone);
@@ -2006,12 +1997,10 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 			block_end_pfn += pageblock_nr_pages) {
 
 		/*
-		 * This can potentially iterate a massively long zone with
-		 * many pageblocks unsuitable, so periodically check if we
-		 * need to schedule.
+		 * We can potentially iterate a massively long zone with
+		 * many pageblocks unsuitable. Since we don't disable
+		 * preemption while doing so, expect to be preempted.
 		 */
-		if (!(low_pfn % (COMPACT_CLUSTER_MAX * pageblock_nr_pages)))
-			cond_resched();
 
 		page = pageblock_pfn_to_page(block_start_pfn,
 						block_end_pfn, cc->zone);
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 909db25efb35..97eed5e0f89b 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -251,7 +251,6 @@ static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
 		folio_put(folio);
 	}
 	applied = reclaim_pages(&folio_list);
-	cond_resched();
 	return applied * PAGE_SIZE;
 }
 
diff --git a/mm/dmapool_test.c b/mm/dmapool_test.c
index 370fb9e209ef..c519475310e4 100644
--- a/mm/dmapool_test.c
+++ b/mm/dmapool_test.c
@@ -82,8 +82,6 @@ static int dmapool_test_block(const struct dmapool_parms *parms)
 		ret = dmapool_test_alloc(p, blocks);
 		if (ret)
 			goto free_pool;
-		if (need_resched())
-			cond_resched();
 	}
 	end_time = ktime_get();
 
diff --git a/mm/filemap.c b/mm/filemap.c
index dc4dcc5eaf5e..e3c9cf5b33b4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -526,7 +526,6 @@ static void __filemap_fdatawait_range(struct address_space *mapping,
 			folio_clear_error(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -2636,8 +2635,6 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 	folio_batch_init(&fbatch);
 
 	do {
-		cond_resched();
-
 		/*
 		 * If we've already successfully copied some data, then we
 		 * can no longer safely return -EIOCBQUEUED. Hence mark
@@ -2910,8 +2907,6 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
 	folio_batch_init(&fbatch);
 
 	do {
-		cond_resched();
-
 		if (*ppos >= i_size_read(in->f_mapping->host))
 			break;
 
@@ -3984,7 +3979,6 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			if (unlikely(status < 0))
 				break;
 		}
-		cond_resched();
 
 		if (unlikely(status == 0)) {
 			/*
diff --git a/mm/gup.c b/mm/gup.c
index 2f8a2d89fde1..f6d913e97d71 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1232,7 +1232,6 @@ static long __get_user_pages(struct mm_struct *mm,
 			ret = -EINTR;
 			goto out;
 		}
-		cond_resched();
 
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page || PTR_ERR(page) == -EMLINK) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..6d48ee94a8c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2954,7 +2954,6 @@ static void split_huge_pages_all(void)
 			folio_unlock(folio);
 next:
 			folio_put(folio);
-			cond_resched();
 		}
 	}
 
@@ -3044,7 +3043,6 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		folio_unlock(folio);
 next:
 		folio_put(folio);
-		cond_resched();
 	}
 	mmap_read_unlock(mm);
 	mmput(mm);
@@ -3101,7 +3099,6 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		folio_unlock(folio);
 next:
 		folio_put(folio);
-		cond_resched();
 	}
 
 	filp_close(candidate, NULL);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1301ba7b2c9a..d611d256ebc2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1830,8 +1830,6 @@ static void free_hpage_workfn(struct work_struct *work)
 		h = size_to_hstate(page_size(page));
 
 		__update_and_free_hugetlb_folio(h, page_folio(page));
-
-		cond_resched();
 	}
 }
 static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
@@ -1869,7 +1867,6 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 	list_for_each_entry_safe(page, t_page, list, lru) {
 		folio = page_folio(page);
 		update_and_free_hugetlb_folio(h, folio, false);
-		cond_resched();
 	}
 }
 
@@ -2319,7 +2316,6 @@ int dissolve_free_huge_page(struct page *page)
 		 */
 		if (unlikely(!folio_test_hugetlb_freed(folio))) {
 			spin_unlock_irq(&hugetlb_lock);
-			cond_resched();
 
 			/*
 			 * Theoretically, we should return -EBUSY when we
@@ -2563,7 +2559,6 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 			break;
 		}
 		list_add(&folio->lru, &surplus_list);
-		cond_resched();
 	}
 	allocated += i;
 
@@ -2961,7 +2956,6 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
 		 * we retry.
 		 */
 		spin_unlock_irq(&hugetlb_lock);
-		cond_resched();
 		goto retry;
 	} else {
 		/*
@@ -3233,7 +3227,6 @@ static void __init gather_bootmem_prealloc(void)
 		 * other side-effects, like CommitLimit going negative.
 		 */
 		adjust_managed_page_count(page, pages_per_huge_page(h));
-		cond_resched();
 	}
 }
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
@@ -3255,7 +3248,6 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 				break;
 			free_huge_folio(folio); /* free it into the hugepage allocator */
 		}
-		cond_resched();
 	}
 	if (i == h->max_huge_pages_node[nid])
 		return;
@@ -3317,7 +3309,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 					 &node_states[N_MEMORY],
 					 node_alloc_noretry))
 			break;
-		cond_resched();
 	}
 	if (i < h->max_huge_pages) {
 		char buf[32];
@@ -3536,9 +3527,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		 */
 		spin_unlock_irq(&hugetlb_lock);
 
-		/* yield cpu to avoid soft lockup */
-		cond_resched();
-
 		ret = alloc_pool_huge_page(h, nodes_allowed,
 						node_alloc_noretry);
 		spin_lock_irq(&hugetlb_lock);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index dedd2edb076e..a4441f328752 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -235,7 +235,6 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 
 			spin_unlock_irq(&hugetlb_lock);
 		}
-		cond_resched();
 	} while (hugetlb_cgroup_have_usage(h_cg));
 }
 
diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index 152dca73f398..1a1edadbeb39 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -374,9 +374,11 @@ void kasan_quarantine_remove_cache(struct kmem_cache *cache)
 		if (qlist_empty(&global_quarantine[i]))
 			continue;
 		qlist_move_cache(&global_quarantine[i], &to_free, cache);
-		/* Scanning whole quarantine can take a while. */
+		/*
+		 * Scanning whole quarantine can take a while so check if need
+		 * to reschedule after giving up the lock.
+		 */
 		raw_spin_unlock_irqrestore(&quarantine_lock, flags);
-		cond_resched();
 		raw_spin_lock_irqsave(&quarantine_lock, flags);
 	}
 	raw_spin_unlock_irqrestore(&quarantine_lock, flags);
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index 95b2b84c296d..29fbc24046b9 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -244,7 +244,7 @@ enum allocation_policy {
 static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocation_policy policy)
 {
 	void *alloc;
-	unsigned long timeout, resched_after;
+	unsigned long timeout;
 	const char *policy_name;
 
 	switch (policy) {
@@ -265,17 +265,6 @@ static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocat
 	kunit_info(test, "%s: size=%zu, gfp=%x, policy=%s, cache=%i\n", __func__, size, gfp,
 		   policy_name, !!test_cache);
 
-	/*
-	 * 100x the sample interval should be more than enough to ensure we get
-	 * a KFENCE allocation eventually.
-	 */
-	timeout = jiffies + msecs_to_jiffies(100 * kfence_sample_interval);
-	/*
-	 * Especially for non-preemption kernels, ensure the allocation-gate
-	 * timer can catch up: after @resched_after, every failed allocation
-	 * attempt yields, to ensure the allocation-gate timer is scheduled.
-	 */
-	resched_after = jiffies + msecs_to_jiffies(kfence_sample_interval);
 	do {
 		if (test_cache)
 			alloc = kmem_cache_alloc(test_cache, gfp);
@@ -307,8 +296,6 @@ static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocat
 
 		test_free(alloc);
 
-		if (time_after(jiffies, resched_after))
-			cond_resched();
 	} while (time_before(jiffies, timeout));
 
 	KUNIT_ASSERT_TRUE_MSG(test, false, "failed to allocate from KFENCE");
@@ -628,7 +615,6 @@ static void test_gfpzero(struct kunit *test)
 			kunit_warn(test, "giving up ... cannot get same object back\n");
 			return;
 		}
-		cond_resched();
 	}
 
 	for (i = 0; i < size; i++)
@@ -755,12 +741,6 @@ static void test_memcache_alloc_bulk(struct kunit *test)
 			}
 		}
 		kmem_cache_free_bulk(test_cache, num, objects);
-		/*
-		 * kmem_cache_alloc_bulk() disables interrupts, and calling it
-		 * in a tight loop may not give KFENCE a chance to switch the
-		 * static branch. Call cond_resched() to let KFENCE chime in.
-		 */
-		cond_resched();
 	} while (!pass && time_before(jiffies, timeout));
 
 	KUNIT_EXPECT_TRUE(test, pass);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4025225ef434..ebec87db5cc1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2361,7 +2361,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	for_each_vma(vmi, vma) {
 		unsigned long hstart, hend;
 
-		cond_resched();
 		if (unlikely(hpage_collapse_test_exit(mm))) {
 			progress++;
 			break;
@@ -2382,7 +2381,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
 
-			cond_resched();
 			if (unlikely(hpage_collapse_test_exit(mm)))
 				goto breakouterloop;
 
@@ -2488,8 +2486,6 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 	lru_add_drain_all();
 
 	while (true) {
-		cond_resched();
-
 		if (unlikely(kthread_should_stop() || try_to_freeze()))
 			break;
 
@@ -2721,7 +2717,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		int result = SCAN_FAIL;
 
 		if (!mmap_locked) {
-			cond_resched();
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 54c2c90d3abc..9092941cb259 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1394,7 +1394,6 @@ static void scan_large_block(void *start, void *end)
 		next = min(start + MAX_SCAN_SIZE, end);
 		scan_block(start, next, NULL);
 		start = next;
-		cond_resched();
 	}
 }
 #endif
@@ -1439,7 +1438,6 @@ static void scan_object(struct kmemleak_object *object)
 				break;
 
 			raw_spin_unlock_irqrestore(&object->lock, flags);
-			cond_resched();
 			raw_spin_lock_irqsave(&object->lock, flags);
 		} while (object->flags & OBJECT_ALLOCATED);
 	} else
@@ -1466,8 +1464,6 @@ static void scan_gray_list(void)
 	 */
 	object = list_entry(gray_list.next, typeof(*object), gray_list);
 	while (&object->gray_list != &gray_list) {
-		cond_resched();
-
 		/* may add new objects to the list */
 		if (!scan_should_stop())
 			scan_object(object);
@@ -1501,7 +1497,6 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
 	raw_spin_unlock_irq(&kmemleak_lock);
 
 	rcu_read_unlock();
-	cond_resched();
 	rcu_read_lock();
 
 	raw_spin_lock_irq(&kmemleak_lock);
@@ -1584,9 +1579,6 @@ static void kmemleak_scan(void)
 		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 			struct page *page = pfn_to_online_page(pfn);
 
-			if (!(pfn & 63))
-				cond_resched();
-
 			if (!page)
 				continue;
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 981af9c72e7a..df5bca0af731 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -492,7 +492,6 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_v
 	do {
 		int ksm_page;
 
-		cond_resched();
 		ksm_page = walk_page_range_vma(vma, addr, addr + 1, ops, NULL);
 		if (WARN_ON_ONCE(ksm_page < 0))
 			return ksm_page;
@@ -686,7 +685,6 @@ static void remove_node_from_stable_tree(struct ksm_stable_node *stable_node)
 		stable_node->rmap_hlist_len--;
 		put_anon_vma(rmap_item->anon_vma);
 		rmap_item->address &= PAGE_MASK;
-		cond_resched();
 	}
 
 	/*
@@ -813,6 +811,10 @@ static struct page *get_ksm_page(struct ksm_stable_node *stable_node,
  */
 static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
 {
+	/*
+	 * We are called from many long loops, and for the most part don't
+	 * disable preemption. So expect to be preempted occasionally.
+	 */
 	if (rmap_item->address & STABLE_FLAG) {
 		struct ksm_stable_node *stable_node;
 		struct page *page;
@@ -858,7 +860,6 @@ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
 		rmap_item->address &= PAGE_MASK;
 	}
 out:
-	cond_resched();		/* we're called from many long loops */
 }
 
 static void remove_trailing_rmap_items(struct ksm_rmap_item **rmap_list)
@@ -1000,13 +1001,11 @@ static int remove_all_stable_nodes(void)
 				err = -EBUSY;
 				break;	/* proceed to next nid */
 			}
-			cond_resched();
 		}
 	}
 	list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) {
 		if (remove_stable_node(stable_node))
 			err = -EBUSY;
-		cond_resched();
 	}
 	return err;
 }
@@ -1452,7 +1451,6 @@ static struct page *stable_node_dup(struct ksm_stable_node **_stable_node_dup,
 
 	hlist_for_each_entry_safe(dup, hlist_safe,
 				  &stable_node->hlist, hlist_dup) {
-		cond_resched();
 		/*
 		 * We must walk all stable_node_dup to prune the stale
 		 * stable nodes during lookup.
@@ -1654,7 +1652,6 @@ static struct page *stable_tree_search(struct page *page)
 		struct page *tree_page;
 		int ret;
 
-		cond_resched();
 		stable_node = rb_entry(*new, struct ksm_stable_node, node);
 		stable_node_any = NULL;
 		tree_page = chain_prune(&stable_node_dup, &stable_node,	root);
@@ -1899,7 +1896,6 @@ static struct ksm_stable_node *stable_tree_insert(struct page *kpage)
 		struct page *tree_page;
 		int ret;
 
-		cond_resched();
 		stable_node = rb_entry(*new, struct ksm_stable_node, node);
 		stable_node_any = NULL;
 		tree_page = chain(&stable_node_dup, stable_node, root);
@@ -2016,7 +2012,6 @@ struct ksm_rmap_item *unstable_tree_search_insert(struct ksm_rmap_item *rmap_ite
 		struct page *tree_page;
 		int ret;
 
-		cond_resched();
 		tree_rmap_item = rb_entry(*new, struct ksm_rmap_item, node);
 		tree_page = get_mergeable_page(tree_rmap_item);
 		if (!tree_page)
@@ -2350,7 +2345,6 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 						    GET_KSM_PAGE_NOLOCK);
 				if (page)
 					put_page(page);
-				cond_resched();
 			}
 		}
 
@@ -2396,7 +2390,6 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
 			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
-				cond_resched();
 				continue;
 			}
 			if (is_zone_device_page(*page))
@@ -2418,7 +2411,6 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 next_page:
 			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
-			cond_resched();
 		}
 	}
 
@@ -2489,7 +2481,6 @@ static void ksm_do_scan(unsigned int scan_npages)
 	unsigned int npages = scan_npages;
 
 	while (npages-- && likely(!freezing(current))) {
-		cond_resched();
 		rmap_item = scan_get_next_rmap_item(&page);
 		if (!rmap_item)
 			return;
@@ -2858,7 +2849,6 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		cond_resched();
 		if (!anon_vma_trylock_read(anon_vma)) {
 			if (rwc->try_lock) {
 				rwc->contended = true;
@@ -2870,7 +2860,6 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 					       0, ULONG_MAX) {
 			unsigned long addr;
 
-			cond_resched();
 			vma = vmac->vma;
 
 			/* Ignore the stable/unstable/sqnr flags */
@@ -3046,14 +3035,12 @@ static void ksm_check_stable_tree(unsigned long start_pfn,
 				node = rb_first(root_stable_tree + nid);
 			else
 				node = rb_next(node);
-			cond_resched();
 		}
 	}
 	list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) {
 		if (stable_node->kpfn >= start_pfn &&
 		    stable_node->kpfn < end_pfn)
 			remove_node_from_stable_tree(stable_node);
-		cond_resched();
 	}
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 4dded5d27e7e..3aa53f2e70e2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -225,7 +225,6 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	if (ptep)
 		pte_unmap_unlock(ptep, ptl);
 	swap_read_unplug(splug);
-	cond_resched();
 
 	return 0;
 }
@@ -531,7 +530,6 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 	if (pageout)
 		reclaim_pages(&folio_list);
-	cond_resched();
 
 	return 0;
 }
@@ -755,7 +753,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		arch_leave_lazy_mmu_mode();
 		pte_unmap_unlock(start_pte, ptl);
 	}
-	cond_resched();
 
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5b009b233ab8..4bccab7df97f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5650,7 +5650,6 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		if (ret)
 			return ret;
 		mc.precharge++;
-		cond_resched();
 	}
 	return 0;
 }
@@ -6035,7 +6034,6 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 		if (get_mctgt_type(vma, addr, ptep_get(pte), NULL))
 			mc.precharge++;	/* increment precharge temporarily */
 	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
 
 	return 0;
 }
@@ -6303,7 +6301,6 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		}
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
 
 	if (addr != end) {
 		/*
@@ -6345,7 +6342,6 @@ static void mem_cgroup_move_charge(void)
 		 * feature anyway, so it wouldn't be a big problem.
 		 */
 		__mem_cgroup_clear_mc();
-		cond_resched();
 		goto retry;
 	}
 	/*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c88489..f291bb06c37c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -804,7 +804,6 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 	}
 	pte_unmap_unlock(mapped_pte, ptl);
 out:
-	cond_resched();
 	return ret;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 517221f01303..faa36db93f80 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,7 +1104,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	pte_unmap_unlock(orig_src_pte, src_ptl);
 	add_mm_rss_vec(dst_mm, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
-	cond_resched();
 
 	if (ret == -EIO) {
 		VM_WARN_ON_ONCE(!entry.val);
@@ -1573,7 +1572,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		addr = zap_pte_range(tlb, vma, pmd, addr, next, details);
 		if (addr != next)
 			pmd--;
-	} while (pmd++, cond_resched(), addr != end);
+	} while (pmd++, addr != end);
 
 	return addr;
 }
@@ -1601,7 +1600,6 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 			continue;
 		next = zap_pmd_range(tlb, vma, pud, addr, next, details);
 next:
-		cond_resched();
 	} while (pud++, addr = next, addr != end);
 
 	return addr;
@@ -5926,7 +5924,6 @@ static inline int process_huge_page(
 		l = n;
 		/* Process subpages at the end of huge page */
 		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
-			cond_resched();
 			ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
 			if (ret)
 				return ret;
@@ -5937,7 +5934,6 @@ static inline int process_huge_page(
 		l = pages_per_huge_page - n;
 		/* Process subpages at the begin of huge page */
 		for (i = 0; i < base; i++) {
-			cond_resched();
 			ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
 			if (ret)
 				return ret;
@@ -5951,11 +5947,9 @@ static inline int process_huge_page(
 		int left_idx = base + i;
 		int right_idx = base + 2 * l - 1 - i;
 
-		cond_resched();
 		ret = process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg);
 		if (ret)
 			return ret;
-		cond_resched();
 		ret = process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg);
 		if (ret)
 			return ret;
@@ -5973,7 +5967,6 @@ static void clear_gigantic_page(struct page *page,
 	might_sleep();
 	for (i = 0; i < pages_per_huge_page; i++) {
 		p = nth_page(page, i);
-		cond_resched();
 		clear_user_highpage(p, addr + i * PAGE_SIZE);
 	}
 }
@@ -6013,7 +6006,6 @@ static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
 		dst_page = folio_page(dst, i);
 		src_page = folio_page(src, i);
 
-		cond_resched();
 		if (copy_mc_user_highpage(dst_page, src_page,
 					  addr + i*PAGE_SIZE, vma)) {
 			memory_failure_queue(page_to_pfn(src_page), 0);
@@ -6085,8 +6077,6 @@ long copy_folio_from_user(struct folio *dst_folio,
 			break;
 
 		flush_dcache_page(subpage);
-
-		cond_resched();
 	}
 	return ret_val;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1b03f4ec6fd2..2a621f00db1a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -402,7 +402,6 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 					 params->pgmap);
 		if (err)
 			break;
-		cond_resched();
 	}
 	vmemmap_populate_print_last();
 	return err;
@@ -532,8 +531,6 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
 
 	/* Poison struct pages because they are now uninitialized again. */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) {
-		cond_resched();
-
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages =
 			min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn);
@@ -580,7 +577,6 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 	}
 
 	for (; pfn < end_pfn; pfn += cur_nr_pages) {
-		cond_resched();
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
@@ -1957,8 +1953,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 				goto failed_removal_isolated;
 			}
 
-			cond_resched();
-
 			ret = scan_movable_pages(pfn, end_pfn, &pfn);
 			if (!ret) {
 				/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 29ebf1e7898c..fa201f89568e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -554,7 +554,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 			break;
 	}
 	pte_unmap_unlock(mapped_pte, ptl);
-	cond_resched();
 
 	return addr != end ? -EIO : 0;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 06086dc9da28..6b0d0d4f07d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1528,8 +1528,6 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
 
 			nr_pages = folio_nr_pages(folio);
 
-			cond_resched();
-
 			/*
 			 * Migratability of hugepages depends on architectures and
 			 * their size.  This check is necessary because some callers
@@ -1633,8 +1631,6 @@ static int migrate_pages_batch(struct list_head *from,
 			is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
 			nr_pages = folio_nr_pages(folio);
 
-			cond_resched();
-
 			/*
 			 * Large folio migration might be unsupported or
 			 * the allocation might be failed so we should retry
@@ -1754,8 +1750,6 @@ static int migrate_pages_batch(struct list_head *from,
 			is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
 			nr_pages = folio_nr_pages(folio);
 
-			cond_resched();
-
 			rc = migrate_folio_move(put_new_folio, private,
 						folio, dst, mode,
 						reason, ret_folios);
diff --git a/mm/mincore.c b/mm/mincore.c
index dad3622cc963..46a1716621d1 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -151,7 +151,6 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_unmap_unlock(ptep - 1, ptl);
 out:
 	walk->private += nr;
-	cond_resched();
 	return 0;
 }
 
diff --git a/mm/mlock.c b/mm/mlock.c
index 06bdfab83b58..746ca30145b5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -351,7 +351,6 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_unmap(start_pte);
 out:
 	spin_unlock(ptl);
-	cond_resched();
 	return 0;
 }
 
@@ -696,7 +695,6 @@ static int apply_mlockall_flags(int flags)
 		/* Ignore errors */
 		mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
 			    newflags);
-		cond_resched();
 	}
 out:
 	return 0;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 50f2f34745af..88d27009800e 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -892,10 +892,8 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 		 * such that unmovable allocations won't be scattered all
 		 * over the place during system boot.
 		 */
-		if (pageblock_aligned(pfn)) {
+		if (pageblock_aligned(pfn))
 			set_pageblock_migratetype(page, migratetype);
-			cond_resched();
-		}
 		pfn++;
 	}
 }
@@ -996,10 +994,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
 	 * because this is done early in section_activate()
 	 */
-	if (pageblock_aligned(pfn)) {
+	if (pageblock_aligned(pfn))
 		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		cond_resched();
-	}
 
 	/*
 	 * ZONE_DEVICE pages are released directly to the driver page allocator
@@ -2163,10 +2159,8 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 	 * Initialize and free pages in MAX_ORDER sized increments so that we
 	 * can avoid introducing any issues with the buddy allocator.
 	 */
-	while (spfn < end_pfn) {
+	while (spfn < end_pfn)
 		deferred_init_maxorder(&i, zone, &spfn, &epfn);
-		cond_resched();
-	}
 }
 
 /* An arch may override for more concurrency. */
@@ -2365,7 +2359,6 @@ void set_zone_contiguous(struct zone *zone)
 		if (!__pageblock_pfn_to_page(block_start_pfn,
 					     block_end_pfn, zone))
 			return;
-		cond_resched();
 	}
 
 	/* We confirm that there is no hole */
diff --git a/mm/mmap.c b/mm/mmap.c
index 9e018d8dd7d6..436c255f4f45 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3253,7 +3253,6 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		remove_vma(vma, true);
 		count++;
-		cond_resched();
 	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
 
 	BUG_ON(count != mm->map_count);
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 4f559f4ddd21..dbf660a14469 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -98,8 +98,6 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 			free_pages_and_swap_cache(pages, nr);
 			pages += nr;
 			batch->nr -= nr;
-
-			cond_resched();
 		} while (batch->nr);
 	}
 	tlb->active = &tlb->local;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b94fbb45d5c7..45af8b1aac59 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -423,7 +423,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			goto again;
 		pages += ret;
 next:
-		cond_resched();
 	} while (pmd++, addr = next, addr != end);
 
 	if (range.start)
diff --git a/mm/mremap.c b/mm/mremap.c
index 382e81c33fc4..26f06349558e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -514,7 +514,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
-		cond_resched();
 		/*
 		 * If extent is PUD-sized try to speed up the move by moving at the
 		 * PUD level if possible.
diff --git a/mm/nommu.c b/mm/nommu.c
index 7f9e9e5a0e12..54cb28e9919d 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1525,7 +1525,6 @@ void exit_mmap(struct mm_struct *mm)
 	for_each_vma(vmi, vma) {
 		cleanup_vma_from_mm(vma);
 		delete_vma(mm, vma);
-		cond_resched();
 	}
 	__mt_destroy(&mm->mm_mt);
 	mmap_write_unlock(mm);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 61a190b9d83c..582cb5a72467 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2510,7 +2510,6 @@ int write_cache_pages(struct address_space *mapping,
 			}
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 85741403948f..c7e7a236de3d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3418,8 +3418,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTFAIL);
 
-	cond_resched();
-
 	return NULL;
 }
 
@@ -3617,8 +3615,6 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	unsigned int noreclaim_flag;
 	unsigned long progress;
 
-	cond_resched();
-
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	fs_reclaim_acquire(gfp_mask);
@@ -3630,8 +3626,6 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(gfp_mask);
 
-	cond_resched();
-
 	return progress;
 }
 
@@ -3852,13 +3846,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Memory allocation/reclaim might be called from a WQ context and the
 	 * current implementation of the WQ concurrency control doesn't
 	 * recognize that a particular WQ is congested if the worker thread is
-	 * looping without ever sleeping. Therefore we have to do a short sleep
-	 * here rather than calling cond_resched().
+	 * looping without ever sleeping. Therefore do a short sleep here.
 	 */
 	if (current->flags & PF_WQ_WORKER)
 		schedule_timeout_uninterruptible(1);
-	else
-		cond_resched();
+
 	return ret;
 }
 
@@ -4162,7 +4154,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		cond_resched();
 		goto retry;
 	}
 fail:
diff --git a/mm/page_counter.c b/mm/page_counter.c
index db20d6452b71..c15befd5b02a 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -196,7 +196,6 @@ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages)
 			return 0;
 
 		counter->max = old;
-		cond_resched();
 	}
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4548fcc66d74..855271588c8c 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -472,7 +472,6 @@ void __init page_ext_init(void)
 				continue;
 			if (init_section_page_ext(pfn, nid))
 				goto oom;
-			cond_resched();
 		}
 	}
 	hotplug_memory_notifier(page_ext_callback, DEFAULT_CALLBACK_PRI);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 41ea77f22011..694eb1b14a66 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -151,7 +151,6 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
 			out++;
-		cond_resched();
 	}
 	return (char *)out - buf;
 }
@@ -188,7 +187,6 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
 			in++;
-		cond_resched();
 	}
 	return (char *)in - buf;
 }
diff --git a/mm/page_io.c b/mm/page_io.c
index fe4c21af23f2..02bbc8165400 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -106,8 +106,6 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 		unsigned block_in_page;
 		sector_t first_block;
 
-		cond_resched();
-
 		first_block = probe_block;
 		ret = bmap(inode, &first_block);
 		if (ret || !first_block)
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 4e2723e1b300..72278db2f01c 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -680,7 +680,6 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
 ext_put_continue:
 			page_ext_put(page_ext);
 		}
-		cond_resched();
 	}
 
 	pr_info("Node %d, zone %8s: page owner found early allocated %lu pages\n",
diff --git a/mm/percpu.c b/mm/percpu.c
index a7665de8485f..538b63f399ae 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2015,7 +2015,6 @@ static void pcpu_balance_free(bool empty_only)
 			spin_unlock_irq(&pcpu_lock);
 		}
 		pcpu_destroy_chunk(chunk);
-		cond_resched();
 	}
 	spin_lock_irq(&pcpu_lock);
 }
@@ -2083,7 +2082,6 @@ static void pcpu_balance_populated(void)
 
 			spin_unlock_irq(&pcpu_lock);
 			ret = pcpu_populate_chunk(chunk, rs, rs + nr, gfp);
-			cond_resched();
 			spin_lock_irq(&pcpu_lock);
 			if (!ret) {
 				nr_to_pop -= nr;
@@ -2101,7 +2099,6 @@ static void pcpu_balance_populated(void)
 		/* ran out of chunks to populate, create a new one and retry */
 		spin_unlock_irq(&pcpu_lock);
 		chunk = pcpu_create_chunk(gfp);
-		cond_resched();
 		spin_lock_irq(&pcpu_lock);
 		if (chunk) {
 			pcpu_chunk_relocate(chunk, -1);
@@ -2186,7 +2183,6 @@ static void pcpu_reclaim_populated(void)
 
 			spin_unlock_irq(&pcpu_lock);
 			pcpu_depopulate_chunk(chunk, i + 1, end + 1);
-			cond_resched();
 			spin_lock_irq(&pcpu_lock);
 
 			pcpu_chunk_depopulated(chunk, i + 1, end + 1);
@@ -2203,7 +2199,6 @@ static void pcpu_reclaim_populated(void)
 			pcpu_post_unmap_tlb_flush(chunk,
 						  freed_page_start,
 						  freed_page_end);
-			cond_resched();
 			spin_lock_irq(&pcpu_lock);
 		}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 9f795b93cf40..c7aec4516309 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2434,7 +2434,6 @@ static void rmap_walk_anon(struct folio *folio,
 		unsigned long address = vma_address(&folio->page, vma);
 
 		VM_BUG_ON_VMA(address == -EFAULT, vma);
-		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
 			continue;
@@ -2495,7 +2494,6 @@ static void rmap_walk_file(struct folio *folio,
 		unsigned long address = vma_address(&folio->page, vma);
 
 		VM_BUG_ON_VMA(address == -EFAULT, vma);
-		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
 			continue;
diff --git a/mm/shmem.c b/mm/shmem.c
index 112172031b2c..0280fe449ad8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -939,7 +939,6 @@ void shmem_unlock_mapping(struct address_space *mapping)
 	       filemap_get_folios(mapping, &index, ~0UL, &fbatch)) {
 		check_move_unevictable_folios(&fbatch);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -1017,7 +1016,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	/*
@@ -1058,8 +1056,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 
 	index = start;
 	while (index < end) {
-		cond_resched();
-
 		if (!find_get_entries(mapping, &index, end - 1, &fbatch,
 				indices)) {
 			/* If all gone or hole-punch or unfalloc, we're done */
@@ -1394,7 +1390,6 @@ int shmem_unuse(unsigned int type)
 		mutex_unlock(&shmem_swaplist_mutex);
 
 		error = shmem_unuse_inode(&info->vfs_inode, type);
-		cond_resched();
 
 		mutex_lock(&shmem_swaplist_mutex);
 		next = list_next_entry(info, swaplist);
@@ -2832,7 +2827,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			error = -EFAULT;
 			break;
 		}
-		cond_resched();
 	}
 
 	*ppos = ((loff_t) index << PAGE_SHIFT) + offset;
@@ -2986,8 +2980,6 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
 		in->f_ra.prev_pos = *ppos;
 		if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
 			break;
-
-		cond_resched();
 	} while (len);
 
 	if (folio)
@@ -3155,7 +3147,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		folio_mark_dirty(folio);
 		folio_unlock(folio);
 		folio_put(folio);
-		cond_resched();
 	}
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
diff --git a/mm/shuffle.c b/mm/shuffle.c
index fb1393b8b3a9..f78f201c773b 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -136,10 +136,12 @@ void __meminit __shuffle_zone(struct zone *z)
 
 		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
 
-		/* take it easy on the zone lock */
+		/*
+		 * Drop the zone lock occasionally to allow the scheduler to
+		 * preempt us if needed.
+		 */
 		if ((i % (100 * order_pages)) == 0) {
 			spin_unlock_irqrestore(&z->lock, flags);
-			cond_resched();
 			spin_lock_irqsave(&z->lock, flags);
 		}
 	}
diff --git a/mm/slab.c b/mm/slab.c
index 9ad3d0f2d1a5..7681d2cb5e64 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2196,8 +2196,6 @@ static int drain_freelist(struct kmem_cache *cache,
 		raw_spin_unlock_irq(&n->list_lock);
 		slab_destroy(cache, slab);
 		nr_freed++;
-
-		cond_resched();
 	}
 out:
 	return nr_freed;
@@ -3853,7 +3851,6 @@ static void cache_reap(struct work_struct *w)
 			STATS_ADD_REAPED(searchp, freed);
 		}
 next:
-		cond_resched();
 	}
 	check_irq_on();
 	mutex_unlock(&slab_mutex);
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index db6c4a26cf59..20d2aefbefd6 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -50,8 +50,6 @@ static int swap_cgroup_prepare(int type)
 			goto not_enough_page;
 		ctrl->map[idx] = page;
 
-		if (!(idx % SWAP_CLUSTER_MAX))
-			cond_resched();
 	}
 	return 0;
 not_enough_page:
@@ -223,8 +221,6 @@ void swap_cgroup_swapoff(int type)
 			struct page *page = map[i];
 			if (page)
 				__free_page(page);
-			if (!(i % SWAP_CLUSTER_MAX))
-				cond_resched();
 		}
 		vfree(map);
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e52f486834eb..27db3dcec1a2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -190,7 +190,6 @@ static int discard_swap(struct swap_info_struct *si)
 				nr_blocks, GFP_KERNEL);
 		if (err)
 			return err;
-		cond_resched();
 	}
 
 	for (se = next_se(se); se; se = next_se(se)) {
@@ -201,8 +200,6 @@ static int discard_swap(struct swap_info_struct *si)
 				nr_blocks, GFP_KERNEL);
 		if (err)
 			break;
-
-		cond_resched();
 	}
 	return err;		/* That will often be -EOPNOTSUPP */
 }
@@ -864,7 +861,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 				goto checks;
 			}
 			if (unlikely(--latency_ration < 0)) {
-				cond_resched();
 				latency_ration = LATENCY_LIMIT;
 			}
 		}
@@ -931,7 +927,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		if (n_ret)
 			goto done;
 		spin_unlock(&si->lock);
-		cond_resched();
 		spin_lock(&si->lock);
 		latency_ration = LATENCY_LIMIT;
 	}
@@ -974,7 +969,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	spin_unlock(&si->lock);
 	while (++offset <= READ_ONCE(si->highest_bit)) {
 		if (unlikely(--latency_ration < 0)) {
-			cond_resched();
 			latency_ration = LATENCY_LIMIT;
 			scanned_many = true;
 		}
@@ -984,7 +978,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	offset = si->lowest_bit;
 	while (offset < scan_base) {
 		if (unlikely(--latency_ration < 0)) {
-			cond_resched();
 			latency_ration = LATENCY_LIMIT;
 			scanned_many = true;
 		}
@@ -1099,7 +1092,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 		spin_unlock(&si->lock);
 		if (n_ret || size == SWAPFILE_CLUSTER)
 			goto check_out;
-		cond_resched();
 
 		spin_lock(&swap_avail_lock);
 nextsi:
@@ -1914,7 +1906,6 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		cond_resched();
 		next = pmd_addr_end(addr, end);
 		ret = unuse_pte_range(vma, pmd, addr, next, type);
 		if (ret)
@@ -1997,8 +1988,6 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
 			if (ret)
 				break;
 		}
-
-		cond_resched();
 	}
 	mmap_read_unlock(mm);
 	return ret;
@@ -2025,8 +2014,6 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 		count = READ_ONCE(si->swap_map[i]);
 		if (count && swap_count(count) != SWAP_MAP_BAD)
 			break;
-		if ((i % LATENCY_LIMIT) == 0)
-			cond_resched();
 	}
 
 	if (i == si->max)
@@ -2079,7 +2066,6 @@ static int try_to_unuse(unsigned int type)
 		 * Make sure that we aren't completely killing
 		 * interactive performance.
 		 */
-		cond_resched();
 		spin_lock(&mmlist_lock);
 	}
 	spin_unlock(&mmlist_lock);
diff --git a/mm/truncate.c b/mm/truncate.c
index 8e3aa9e8618e..9efcec90f24d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -369,7 +369,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		for (i = 0; i < folio_batch_count(&fbatch); i++)
 			folio_unlock(fbatch.folios[i]);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
@@ -399,7 +398,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 	index = start;
 	while (index < end) {
-		cond_resched();
 		if (!find_get_entries(mapping, &index, end - 1, &fbatch,
 				indices)) {
 			/* If all gone from start onwards, we're done */
@@ -533,7 +531,6 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 		}
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	return count;
 }
@@ -677,7 +674,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		}
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	/*
 	 * For DAX we invalidate page tables after invalidating page cache.  We
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 96d9eae5c7cc..89127f6b8bd7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -459,8 +459,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 		hugetlb_vma_unlock_read(dst_vma);
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 
-		cond_resched();
-
 		if (unlikely(err == -ENOENT)) {
 			mmap_read_unlock(dst_mm);
 			BUG_ON(!folio);
@@ -677,7 +675,6 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
 
 		err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr,
 				       src_addr, flags, &folio);
-		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
 			void *kaddr;
diff --git a/mm/util.c b/mm/util.c
index 8cbbfd3a3d59..3bc08be921fa 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -796,7 +796,6 @@ void folio_copy(struct folio *dst, struct folio *src)
 		copy_highpage(folio_page(dst, i), folio_page(src, i));
 		if (++i == nr)
 			break;
-		cond_resched();
 	}
 }
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a3fedb3ee0db..7d2b76cde1a7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -351,8 +351,6 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		vunmap_pte_range(pmd, addr, next, mask);
-
-		cond_resched();
 	} while (pmd++, addr = next, addr != end);
 }
 
@@ -2840,7 +2838,6 @@ void vfree(const void *addr)
 		 * can be freed as an array of order-0 allocations
 		 */
 		__free_page(page);
-		cond_resched();
 	}
 	atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
 	kvfree(vm->pages);
@@ -3035,7 +3032,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 							pages + nr_allocated);
 
 			nr_allocated += nr;
-			cond_resched();
 
 			/*
 			 * If zero or pages were obtained partly,
@@ -3091,7 +3087,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		for (i = 0; i < (1U << order); i++)
 			pages[nr_allocated + i] = page + i;
 
-		cond_resched();
 		nr_allocated += 1U << order;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..e12f9fd27002 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -905,8 +905,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
 		total_scan -= shrinkctl->nr_scanned;
 		scanned += shrinkctl->nr_scanned;
-
-		cond_resched();
 	}
 
 	/*
@@ -1074,7 +1072,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 
 	up_read(&shrinker_rwsem);
 out:
-	cond_resched();
 	return freed;
 }
 
@@ -1204,7 +1201,6 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 	 */
 	if (!current_is_kswapd() &&
 	    current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
-		cond_resched();
 		return;
 	}
 
@@ -1232,7 +1228,6 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 		fallthrough;
 	case VMSCAN_THROTTLE_NOPROGRESS:
 		if (skip_throttle_noprogress(pgdat)) {
-			cond_resched();
 			return;
 		}
 
@@ -1715,7 +1710,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	struct swap_iocb *plug = NULL;
 
 	memset(stat, 0, sizeof(*stat));
-	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc);
 
 retry:
@@ -1726,8 +1720,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		bool dirty, writeback;
 		unsigned int nr_pages;
 
-		cond_resched();
-
 		folio = lru_to_folio(folio_list);
 		list_del(&folio->lru);
 
@@ -2719,7 +2711,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	while (!list_empty(&l_hold)) {
 		struct folio *folio;
 
-		cond_resched();
 		folio = lru_to_folio(&l_hold);
 		list_del(&folio->lru);
 
@@ -4319,8 +4310,6 @@ static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_
 			reset_batch_size(lruvec, walk);
 			spin_unlock_irq(&lruvec->lru_lock);
 		}
-
-		cond_resched();
 	} while (err == -EAGAIN);
 }
 
@@ -4455,7 +4444,6 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
 			continue;
 
 		spin_unlock_irq(&lruvec->lru_lock);
-		cond_resched();
 		goto restart;
 	}
 
@@ -4616,8 +4604,6 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 			mem_cgroup_iter_break(NULL, memcg);
 			return;
 		}
-
-		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 
 	/*
@@ -5378,8 +5364,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 		if (sc->nr_reclaimed >= nr_to_reclaim)
 			break;
-
-		cond_resched();
 	}
 
 	/* whether try_to_inc_max_seq() was successful */
@@ -5718,14 +5702,11 @@ static void lru_gen_change_state(bool enabled)
 
 			while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
 				spin_unlock_irq(&lruvec->lru_lock);
-				cond_resched();
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
 			spin_unlock_irq(&lruvec->lru_lock);
 		}
-
-		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 unlock:
 	mutex_unlock(&state_mutex);
@@ -6026,8 +6007,6 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 
 		if (!evict_folios(lruvec, sc, swappiness))
 			return 0;
-
-		cond_resched();
 	}
 
 	return -EINTR;
@@ -6321,8 +6300,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			}
 		}
 
-		cond_resched();
-
 		if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
 			continue;
 
@@ -6473,10 +6450,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		 * This loop can become CPU-bound when target memcgs
 		 * aren't eligible for reclaim - either because they
 		 * don't have any reclaimable pages, or because their
-		 * memory is explicitly protected. Avoid soft lockups.
+		 * memory is explicitly protected. We don't disable
+		 * preemption, so expect to be preempted.
 		 */
-		cond_resched();
-
 		mem_cgroup_calculate_protection(target_memcg, memcg);
 
 		if (mem_cgroup_below_min(target_memcg, memcg)) {
@@ -8024,7 +8000,6 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
 					   sc.gfp_mask);
 
-	cond_resched();
 	psi_memstall_enter(&pflags);
 	fs_reclaim_acquire(sc.gfp_mask);
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..de61cc004865 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -835,7 +835,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 #ifdef CONFIG_NUMA
 
 		if (do_pagesets) {
-			cond_resched();
 			/*
 			 * Deal with draining the remote pageset of this
 			 * processor
@@ -1525,7 +1524,6 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 			}
 			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
 			spin_unlock_irq(&zone->lock);
-			cond_resched();
 			spin_lock_irq(&zone->lock);
 		}
 		seq_putc(m, '\n');
@@ -2041,8 +2039,6 @@ static void vmstat_shepherd(struct work_struct *w)
 
 		if (!delayed_work_pending(dw) && need_update(cpu))
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
-
-		cond_resched();
 	}
 	cpus_read_unlock();
 
diff --git a/mm/workingset.c b/mm/workingset.c
index da58a26d0d4d..ba94e5fb8390 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -750,7 +750,6 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	}
 	ret = LRU_REMOVED_RETRY;
 out:
-	cond_resched();
 	spin_lock_irq(lru_lock);
 	return ret;
 }
diff --git a/mm/z3fold.c b/mm/z3fold.c
index 7c76b396b74c..2614236c2212 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -175,7 +175,7 @@ enum z3fold_handle_flags {
 /*
  * Forward declarations
  */
-static struct z3fold_header *__z3fold_alloc(struct z3fold_pool *, size_t, bool);
+static struct z3fold_header *__z3fold_alloc(struct z3fold_pool *, size_t);
 static void compact_page_work(struct work_struct *w);
 
 /*****************
@@ -504,7 +504,6 @@ static void free_pages_work(struct work_struct *w)
 		spin_unlock(&pool->stale_lock);
 		cancel_work_sync(&zhdr->work);
 		free_z3fold_page(page, false);
-		cond_resched();
 		spin_lock(&pool->stale_lock);
 	}
 	spin_unlock(&pool->stale_lock);
@@ -629,7 +628,7 @@ static struct z3fold_header *compact_single_buddy(struct z3fold_header *zhdr)
 		short chunks = size_to_chunks(sz);
 		void *q;
 
-		new_zhdr = __z3fold_alloc(pool, sz, false);
+		new_zhdr = __z3fold_alloc(pool, sz);
 		if (!new_zhdr)
 			return NULL;
 
@@ -783,7 +782,7 @@ static void compact_page_work(struct work_struct *w)
 
 /* returns _locked_ z3fold page header or NULL */
 static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
-						size_t size, bool can_sleep)
+						   size_t size)
 {
 	struct z3fold_header *zhdr = NULL;
 	struct page *page;
@@ -811,8 +810,6 @@ static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
 			spin_unlock(&pool->lock);
 			zhdr = NULL;
 			migrate_enable();
-			if (can_sleep)
-				cond_resched();
 			goto lookup;
 		}
 		list_del_init(&zhdr->buddy);
@@ -825,8 +822,6 @@ static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
 			z3fold_page_unlock(zhdr);
 			zhdr = NULL;
 			migrate_enable();
-			if (can_sleep)
-				cond_resched();
 			goto lookup;
 		}
 
@@ -869,8 +864,6 @@ static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
 			    test_bit(PAGE_CLAIMED, &page->private)) {
 				z3fold_page_unlock(zhdr);
 				zhdr = NULL;
-				if (can_sleep)
-					cond_resched();
 				continue;
 			}
 			kref_get(&zhdr->refcount);
@@ -1016,7 +1009,7 @@ static int z3fold_alloc(struct z3fold_pool *pool, size_t size, gfp_t gfp,
 		bud = HEADLESS;
 	else {
 retry:
-		zhdr = __z3fold_alloc(pool, size, can_sleep);
+		zhdr = __z3fold_alloc(pool, size);
 		if (zhdr) {
 			bud = get_free_buddy(zhdr, chunks);
 			if (bud == HEADLESS) {
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b58f957429f0..e6fe6522c845 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 			dst_zspage = NULL;
 
 			spin_unlock(&pool->lock);
-			cond_resched();
 			spin_lock(&pool->lock);
 		}
 	}
diff --git a/mm/zswap.c b/mm/zswap.c
index 37d2b1cb2ecb..ad6d67ebbf70 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -704,7 +704,6 @@ static void shrink_worker(struct work_struct *w)
 			if (++failures == MAX_RECLAIM_RETRIES)
 				break;
 		}
-		cond_resched();
 	} while (!zswap_can_accept());
 	zswap_pool_put(pool);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 69/86] treewide: io_uring: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (10 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 68/86] treewide: mm: remove cond_resched() Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 70/86] treewide: ipc: " Ankur Arora
                     ` (18 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Jens Axboe, Pavel Begunkov

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All of the uses of cond_resched() are from set-1 or set-2.
Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 io_uring/io-wq.c    |  4 +---
 io_uring/io_uring.c | 21 ++++++++++++---------
 io_uring/kbuf.c     |  2 --
 io_uring/sqpoll.c   |  6 ++++--
 io_uring/tctx.c     |  4 +---
 5 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..fcaf9161be03 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -532,10 +532,8 @@ static struct io_wq_work *io_get_next_work(struct io_wq_acct *acct,
 static void io_assign_current_work(struct io_worker *worker,
 				   struct io_wq_work *work)
 {
-	if (work) {
+	if (work)
 		io_run_task_work();
-		cond_resched();
-	}
 
 	raw_spin_lock(&worker->lock);
 	worker->cur_work = work;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8d1bc6cdfe71..547b7c6bdc68 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1203,9 +1203,14 @@ static unsigned int handle_tw_list(struct llist_node *node,
 		node = next;
 		count++;
 		if (unlikely(need_resched())) {
+
+			/*
+			 * Depending on whether we have PREEMPT_RCU or not, the
+			 * mutex_unlock() or percpu_ref_put() should cause us to
+			 * reschedule.
+			 */
 			ctx_flush_and_put(*ctx, ts);
 			*ctx = NULL;
-			cond_resched();
 		}
 	}
 
@@ -1611,7 +1616,6 @@ static __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
 		 */
 		if (need_resched()) {
 			mutex_unlock(&ctx->uring_lock);
-			cond_resched();
 			mutex_lock(&ctx->uring_lock);
 		}
 	}
@@ -1977,7 +1981,6 @@ void io_wq_submit_work(struct io_wq_work *work)
 				break;
 			if (io_wq_worker_stopped())
 				break;
-			cond_resched();
 			continue;
 		}
 
@@ -2649,7 +2652,6 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			ret = 0;
 			break;
 		}
-		cond_resched();
 	} while (1);
 
 	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
@@ -3096,8 +3098,12 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
 			io_move_task_work_from_local(ctx);
 
+		/*
+		 * io_uring_try_cancel_requests() will reschedule when needed
+		 * in the mutex_unlock().
+		 */
 		while (io_uring_try_cancel_requests(ctx, NULL, true))
-			cond_resched();
+			;
 
 		if (ctx->sq_data) {
 			struct io_sq_data *sqd = ctx->sq_data;
@@ -3313,7 +3319,6 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 		while (!wq_list_empty(&ctx->iopoll_list)) {
 			io_iopoll_try_reap_events(ctx);
 			ret = true;
-			cond_resched();
 		}
 	}
 
@@ -3382,10 +3387,8 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 								     cancel_all);
 		}
 
-		if (loop) {
-			cond_resched();
+		if (loop)
 			continue;
-		}
 
 		prepare_to_wait(&tctx->wait, &wait, TASK_INTERRUPTIBLE);
 		io_run_task_work();
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 9123138aa9f4..ef94a7c76d9a 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -246,7 +246,6 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
 		list_move(&nxt->list, &ctx->io_buffers_cache);
 		if (++i == nbufs)
 			return i;
-		cond_resched();
 	}
 
 	return i;
@@ -421,7 +420,6 @@ static int io_add_buffers(struct io_ring_ctx *ctx, struct io_provide_buf *pbuf,
 		buf->bgid = pbuf->bgid;
 		addr += pbuf->len;
 		bid++;
-		cond_resched();
 	}
 
 	return i ? 0 : -ENOMEM;
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index bd6c2c7959a5..b297b7b8047e 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -212,7 +212,6 @@ static bool io_sqd_handle_event(struct io_sq_data *sqd)
 		mutex_unlock(&sqd->lock);
 		if (signal_pending(current))
 			did_sig = get_signal(&ksig);
-		cond_resched();
 		mutex_lock(&sqd->lock);
 	}
 	return did_sig || test_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
@@ -258,8 +257,11 @@ static int io_sq_thread(void *data)
 			if (sqt_spin)
 				timeout = jiffies + sqd->sq_thread_idle;
 			if (unlikely(need_resched())) {
+				/*
+				 * Drop the mutex and reacquire so a reschedule can
+				 * happen on unlock.
+				 */
 				mutex_unlock(&sqd->lock);
-				cond_resched();
 				mutex_lock(&sqd->lock);
 			}
 			continue;
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index c043fe93a3f2..1bf58f01e50c 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -181,10 +181,8 @@ __cold void io_uring_clean_tctx(struct io_uring_task *tctx)
 	struct io_tctx_node *node;
 	unsigned long index;
 
-	xa_for_each(&tctx->xa, index, node) {
+	xa_for_each(&tctx->xa, index, node)
 		io_uring_del_tctx_node(index);
-		cond_resched();
-	}
 	if (wq) {
 		/*
 		 * Must be after io_uring_del_tctx_node() (removes nodes under
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 70/86] treewide: ipc: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (11 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 69/86] treewide: io_uring: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 71/86] treewide: lib: " Ankur Arora
                     ` (17 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Davidlohr Bueso, Christophe JAILLET, Manfred Spraul, Jann Horn

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All calls to cond_resched() are from set-1, from potentially long
running loops. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Davidlohr Bueso <dave@stgolabs.net> 
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> 
Cc: Manfred Spraul <manfred@colorfullife.com> 
Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: Jann Horn <jannh@google.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 ipc/msgutil.c | 3 ---
 ipc/sem.c     | 2 --
 2 files changed, 5 deletions(-)

diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index d0a0e877cadd..d9d1b7957bb6 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -62,8 +62,6 @@ static struct msg_msg *alloc_msg(size_t len)
 	while (len > 0) {
 		struct msg_msgseg *seg;
 
-		cond_resched();
-
 		alen = min(len, DATALEN_SEG);
 		seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT);
 		if (seg == NULL)
@@ -177,7 +175,6 @@ void free_msg(struct msg_msg *msg)
 	while (seg != NULL) {
 		struct msg_msgseg *tmp = seg->next;
 
-		cond_resched();
 		kfree(seg);
 		seg = tmp;
 	}
diff --git a/ipc/sem.c b/ipc/sem.c
index a39cdc7bf88f..e12ab01161f6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2350,8 +2350,6 @@ void exit_sem(struct task_struct *tsk)
 		int semid, i;
 		DEFINE_WAKE_Q(wake_q);
 
-		cond_resched();
-
 		rcu_read_lock();
 		un = list_entry_rcu(ulp->list_proc.next,
 				    struct sem_undo, list_proc);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (12 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 70/86] treewide: ipc: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-08  9:15     ` Herbert Xu
  2023-11-08 19:15     ` Kees Cook
  2023-11-07 23:08   ` [RFC PATCH 72/86] treewide: crypto: " Ankur Arora
                     ` (16 subsequent siblings)
  30 siblings, 2 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Herbert Xu, David S. Miller, Kees Cook, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Almost all the cond_resched() calls are from set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Kees Cook <keescook@chromium.org> 
Cc: Eric Dumazet <edumazet@google.com> 
Cc: Jakub Kicinski <kuba@kernel.org> 
Cc: Paolo Abeni <pabeni@redhat.com> 
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 lib/crc32test.c          |  2 --
 lib/crypto/mpi/mpi-pow.c |  1 -
 lib/memcpy_kunit.c       |  5 -----
 lib/random32.c           |  1 -
 lib/rhashtable.c         |  2 --
 lib/test_bpf.c           |  3 ---
 lib/test_lockup.c        |  2 +-
 lib/test_maple_tree.c    |  8 --------
 lib/test_rhashtable.c    | 10 ----------
 9 files changed, 1 insertion(+), 33 deletions(-)

diff --git a/lib/crc32test.c b/lib/crc32test.c
index 9b4af79412c4..3eee90482e9a 100644
--- a/lib/crc32test.c
+++ b/lib/crc32test.c
@@ -729,7 +729,6 @@ static int __init crc32c_combine_test(void)
 			      crc_full == test[i].crc32c_le))
 				errors++;
 			runs++;
-			cond_resched();
 		}
 	}
 
@@ -817,7 +816,6 @@ static int __init crc32_combine_test(void)
 			      crc_full == test[i].crc_le))
 				errors++;
 			runs++;
-			cond_resched();
 		}
 	}
 
diff --git a/lib/crypto/mpi/mpi-pow.c b/lib/crypto/mpi/mpi-pow.c
index 2fd7a46d55ec..074534900b7e 100644
--- a/lib/crypto/mpi/mpi-pow.c
+++ b/lib/crypto/mpi/mpi-pow.c
@@ -242,7 +242,6 @@ int mpi_powm(MPI res, MPI base, MPI exp, MPI mod)
 				}
 				e <<= 1;
 				c--;
-				cond_resched();
 			}
 
 			i--;
diff --git a/lib/memcpy_kunit.c b/lib/memcpy_kunit.c
index 440aee705ccc..c2a6b09fe93a 100644
--- a/lib/memcpy_kunit.c
+++ b/lib/memcpy_kunit.c
@@ -361,8 +361,6 @@ static void copy_large_test(struct kunit *test, bool use_memmove)
 			/* Zero out what we copied for the next cycle. */
 			memset(large_dst + offset, 0, bytes);
 		}
-		/* Avoid stall warnings if this loop gets slow. */
-		cond_resched();
 	}
 }
 
@@ -489,9 +487,6 @@ static void memmove_overlap_test(struct kunit *test)
 			for (int s_off = s_start; s_off < s_end;
 			     s_off = next_step(s_off, s_start, s_end, window_step))
 				inner_loop(test, bytes, d_off, s_off);
-
-			/* Avoid stall warnings. */
-			cond_resched();
 		}
 	}
 }
diff --git a/lib/random32.c b/lib/random32.c
index 32060b852668..10bc804d99d6 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -287,7 +287,6 @@ static int __init prandom_state_selftest(void)
 			errors++;
 
 		runs++;
-		cond_resched();
 	}
 
 	if (errors)
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6ae2ba8e06a2..5ff0f521bf29 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -328,7 +328,6 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
 		err = rhashtable_rehash_chain(ht, old_hash);
 		if (err)
 			return err;
-		cond_resched();
 	}
 
 	/* Publish the new table pointer. */
@@ -1147,7 +1146,6 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
 		for (i = 0; i < tbl->size; i++) {
 			struct rhash_head *pos, *next;
 
-			cond_resched();
 			for (pos = rht_ptr_exclusive(rht_bucket(tbl, i)),
 			     next = !rht_is_a_nulls(pos) ?
 					rht_dereference(pos->next, ht) : NULL;
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index ecde4216201e..15b4d32712d8 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -14758,7 +14758,6 @@ static __init int test_skb_segment(void)
 	for (i = 0; i < ARRAY_SIZE(skb_segment_tests); i++) {
 		const struct skb_segment_test *test = &skb_segment_tests[i];
 
-		cond_resched();
 		if (exclude_test(i))
 			continue;
 
@@ -14787,7 +14786,6 @@ static __init int test_bpf(void)
 		struct bpf_prog *fp;
 		int err;
 
-		cond_resched();
 		if (exclude_test(i))
 			continue;
 
@@ -15171,7 +15169,6 @@ static __init int test_tail_calls(struct bpf_array *progs)
 		u64 duration;
 		int ret;
 
-		cond_resched();
 		if (exclude_test(i))
 			continue;
 
diff --git a/lib/test_lockup.c b/lib/test_lockup.c
index c3fd87d6c2dd..9af5d34c98f6 100644
--- a/lib/test_lockup.c
+++ b/lib/test_lockup.c
@@ -381,7 +381,7 @@ static void test_lockup(bool master)
 			touch_nmi_watchdog();
 
 		if (call_cond_resched)
-			cond_resched();
+			cond_resched_stall();
 
 		test_wait(cooldown_secs, cooldown_nsecs);
 
diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 464eeb90d5ad..321fd5d8aef3 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -2672,7 +2672,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		rcu_barrier();
 	}
 
-	cond_resched();
 	mt_cache_shrink();
 	/* Check with a value at zero, no gap */
 	for (i = 1000; i < 2000; i++) {
@@ -2682,7 +2681,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		rcu_barrier();
 	}
 
-	cond_resched();
 	mt_cache_shrink();
 	/* Check with a value at zero and unreasonably large */
 	for (i = big_start; i < big_start + 10; i++) {
@@ -2692,7 +2690,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		rcu_barrier();
 	}
 
-	cond_resched();
 	mt_cache_shrink();
 	/* Small to medium size not starting at zero*/
 	for (i = 200; i < 1000; i++) {
@@ -2702,7 +2699,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		rcu_barrier();
 	}
 
-	cond_resched();
 	mt_cache_shrink();
 	/* Unreasonably large not starting at zero*/
 	for (i = big_start; i < big_start + 10; i++) {
@@ -2710,7 +2706,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		check_dup_gaps(mt, i, false, 5);
 		mtree_destroy(mt);
 		rcu_barrier();
-		cond_resched();
 		mt_cache_shrink();
 	}
 
@@ -2720,7 +2715,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		check_dup_gaps(mt, i, false, 5);
 		mtree_destroy(mt);
 		rcu_barrier();
-		cond_resched();
 		if (i % 2 == 0)
 			mt_cache_shrink();
 	}
@@ -2732,7 +2726,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		check_dup_gaps(mt, i, true, 5);
 		mtree_destroy(mt);
 		rcu_barrier();
-		cond_resched();
 	}
 
 	mt_cache_shrink();
@@ -2743,7 +2736,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		mtree_destroy(mt);
 		rcu_barrier();
 		mt_cache_shrink();
-		cond_resched();
 	}
 }
 
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index c20f6cb4bf55..e5d1f272f2c6 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -119,7 +119,6 @@ static int insert_retry(struct rhashtable *ht, struct test_obj *obj,
 
 	do {
 		retries++;
-		cond_resched();
 		err = rhashtable_insert_fast(ht, &obj->node, params);
 		if (err == -ENOMEM && enomem_retry) {
 			enomem_retries++;
@@ -253,8 +252,6 @@ static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array,
 
 			rhashtable_remove_fast(ht, &obj->node, test_rht_params);
 		}
-
-		cond_resched();
 	}
 
 	end = ktime_get_ns();
@@ -371,8 +368,6 @@ static int __init test_rhltable(unsigned int entries)
 		u32 i = get_random_u32_below(entries);
 		u32 prand = get_random_u32_below(4);
 
-		cond_resched();
-
 		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
 		if (test_bit(i, obj_in_table)) {
 			clear_bit(i, obj_in_table);
@@ -412,7 +407,6 @@ static int __init test_rhltable(unsigned int entries)
 	}
 
 	for (i = 0; i < entries; i++) {
-		cond_resched();
 		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
 		if (test_bit(i, obj_in_table)) {
 			if (WARN(err, "cannot remove element at slot %d", i))
@@ -607,8 +601,6 @@ static int thread_lookup_test(struct thread_data *tdata)
 			       obj->value.tid, obj->value.id, key.tid, key.id);
 			err++;
 		}
-
-		cond_resched();
 	}
 	return err;
 }
@@ -660,8 +652,6 @@ static int threadfunc(void *data)
 				goto out;
 			}
 			tdata->objs[i].value.id = TEST_INSERT_FAIL;
-
-			cond_resched();
 		}
 		err = thread_lookup_test(tdata);
 		if (err) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 72/86] treewide: crypto: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (13 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 71/86] treewide: lib: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 73/86] treewide: security: " Ankur Arora
                     ` (15 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Herbert Xu, David S. Miller, linux-crypto

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the cond_resched() calls are from set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 crypto/internal.h |  2 +-
 crypto/tcrypt.c   |  5 -----
 crypto/testmgr.c  | 10 ----------
 3 files changed, 1 insertion(+), 16 deletions(-)

diff --git a/crypto/internal.h b/crypto/internal.h
index 63e59240d5fb..930f8f5fad39 100644
--- a/crypto/internal.h
+++ b/crypto/internal.h
@@ -203,7 +203,7 @@ static inline void crypto_notify(unsigned long val, void *v)
 static inline void crypto_yield(u32 flags)
 {
 	if (flags & CRYPTO_TFM_REQ_MAY_SLEEP)
-		cond_resched();
+		cond_resched_stall();
 }
 
 static inline int crypto_is_test_larval(struct crypto_larval *larval)
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 202ca1a3105d..9f33b9724a2e 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -414,7 +414,6 @@ static void test_mb_aead_speed(const char *algo, int enc, int secs,
 			if (secs) {
 				ret = test_mb_aead_jiffies(data, enc, bs,
 							   secs, num_mb);
-				cond_resched();
 			} else {
 				ret = test_mb_aead_cycles(data, enc, bs,
 							  num_mb);
@@ -667,7 +666,6 @@ static void test_aead_speed(const char *algo, int enc, unsigned int secs,
 			if (secs) {
 				ret = test_aead_jiffies(req, enc, bs,
 							secs);
-				cond_resched();
 			} else {
 				ret = test_aead_cycles(req, enc, bs);
 			}
@@ -923,7 +921,6 @@ static void test_ahash_speed_common(const char *algo, unsigned int secs,
 		if (secs) {
 			ret = test_ahash_jiffies(req, speed[i].blen,
 						 speed[i].plen, output, secs);
-			cond_resched();
 		} else {
 			ret = test_ahash_cycles(req, speed[i].blen,
 						speed[i].plen, output);
@@ -1182,7 +1179,6 @@ static void test_mb_skcipher_speed(const char *algo, int enc, int secs,
 				ret = test_mb_acipher_jiffies(data, enc,
 							      bs, secs,
 							      num_mb);
-				cond_resched();
 			} else {
 				ret = test_mb_acipher_cycles(data, enc,
 							     bs, num_mb);
@@ -1397,7 +1393,6 @@ static void test_skcipher_speed(const char *algo, int enc, unsigned int secs,
 			if (secs) {
 				ret = test_acipher_jiffies(req, enc,
 							   bs, secs);
-				cond_resched();
 			} else {
 				ret = test_acipher_cycles(req, enc,
 							  bs);
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 216878c8bc3d..2909c5aa4b8b 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1676,7 +1676,6 @@ static int test_hash_vec(const struct hash_testvec *vec, unsigned int vec_num,
 						req, desc, tsgl, hashstate);
 			if (err)
 				return err;
-			cond_resched();
 		}
 	}
 #endif
@@ -1837,7 +1836,6 @@ static int test_hash_vs_generic_impl(const char *generic_driver,
 					req, desc, tsgl, hashstate);
 		if (err)
 			goto out;
-		cond_resched();
 	}
 	err = 0;
 out:
@@ -1966,7 +1964,6 @@ static int __alg_test_hash(const struct hash_testvec *vecs,
 		err = test_hash_vec(&vecs[i], i, req, desc, tsgl, hashstate);
 		if (err)
 			goto out;
-		cond_resched();
 	}
 	err = test_hash_vs_generic_impl(generic_driver, maxkeysize, req,
 					desc, tsgl, hashstate);
@@ -2246,7 +2243,6 @@ static int test_aead_vec(int enc, const struct aead_testvec *vec,
 						&cfg, req, tsgls);
 			if (err)
 				return err;
-			cond_resched();
 		}
 	}
 #endif
@@ -2476,7 +2472,6 @@ static int test_aead_inauthentic_inputs(struct aead_extra_tests_ctx *ctx)
 			if (err)
 				return err;
 		}
-		cond_resched();
 	}
 	return 0;
 }
@@ -2580,7 +2575,6 @@ static int test_aead_vs_generic_impl(struct aead_extra_tests_ctx *ctx)
 			if (err)
 				goto out;
 		}
-		cond_resched();
 	}
 	err = 0;
 out:
@@ -2659,7 +2653,6 @@ static int test_aead(int enc, const struct aead_test_suite *suite,
 		err = test_aead_vec(enc, &suite->vecs[i], i, req, tsgls);
 		if (err)
 			return err;
-		cond_resched();
 	}
 	return 0;
 }
@@ -3006,7 +2999,6 @@ static int test_skcipher_vec(int enc, const struct cipher_testvec *vec,
 						    &cfg, req, tsgls);
 			if (err)
 				return err;
-			cond_resched();
 		}
 	}
 #endif
@@ -3203,7 +3195,6 @@ static int test_skcipher_vs_generic_impl(const char *generic_driver,
 					    cfg, req, tsgls);
 		if (err)
 			goto out;
-		cond_resched();
 	}
 	err = 0;
 out:
@@ -3236,7 +3227,6 @@ static int test_skcipher(int enc, const struct cipher_test_suite *suite,
 		err = test_skcipher_vec(enc, &suite->vecs[i], i, req, tsgls);
 		if (err)
 			return err;
-		cond_resched();
 	}
 	return 0;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 73/86] treewide: security: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (14 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 72/86] treewide: crypto: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 74/86] treewide: fs: " Ankur Arora
                     ` (14 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	David Howells, Jarkko Sakkinen

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the cond_resched() calls are to avoid monopolizing the CPU while
executing in long loops (set-1 or set-2).

Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: David Howells <dhowells@redhat.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 security/keys/gc.c             | 1 -
 security/landlock/fs.c         | 1 -
 security/selinux/ss/hashtab.h  | 2 --
 security/selinux/ss/policydb.c | 6 ------
 security/selinux/ss/services.c | 1 -
 security/selinux/ss/sidtab.c   | 1 -
 6 files changed, 12 deletions(-)

diff --git a/security/keys/gc.c b/security/keys/gc.c
index 3c90807476eb..edb886df2d82 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -265,7 +265,6 @@ static void key_garbage_collector(struct work_struct *work)
 
 maybe_resched:
 	if (cursor) {
-		cond_resched();
 		spin_lock(&key_serial_lock);
 		goto continue_scanning;
 	}
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 1c0c198f6fdb..e7ecd8cca418 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1013,7 +1013,6 @@ static void hook_sb_delete(struct super_block *const sb)
 			 * previous loop walk, which is not needed anymore.
 			 */
 			iput(prev_inode);
-			cond_resched();
 			spin_lock(&sb->s_inode_list_lock);
 		}
 		prev_inode = inode;
diff --git a/security/selinux/ss/hashtab.h b/security/selinux/ss/hashtab.h
index f9713b56d3d0..1e297dd83b3e 100644
--- a/security/selinux/ss/hashtab.h
+++ b/security/selinux/ss/hashtab.h
@@ -64,8 +64,6 @@ static inline int hashtab_insert(struct hashtab *h, void *key, void *datum,
 	u32 hvalue;
 	struct hashtab_node *prev, *cur;
 
-	cond_resched();
-
 	if (!h->size || h->nel == HASHTAB_MAX_NODES)
 		return -EINVAL;
 
diff --git a/security/selinux/ss/policydb.c b/security/selinux/ss/policydb.c
index 2d528f699a22..2737b753d9da 100644
--- a/security/selinux/ss/policydb.c
+++ b/security/selinux/ss/policydb.c
@@ -336,7 +336,6 @@ static int filenametr_destroy(void *key, void *datum, void *p)
 		kfree(d);
 		d = next;
 	} while (unlikely(d));
-	cond_resched();
 	return 0;
 }
 
@@ -348,7 +347,6 @@ static int range_tr_destroy(void *key, void *datum, void *p)
 	ebitmap_destroy(&rt->level[0].cat);
 	ebitmap_destroy(&rt->level[1].cat);
 	kfree(datum);
-	cond_resched();
 	return 0;
 }
 
@@ -786,7 +784,6 @@ void policydb_destroy(struct policydb *p)
 	struct role_allow *ra, *lra = NULL;
 
 	for (i = 0; i < SYM_NUM; i++) {
-		cond_resched();
 		hashtab_map(&p->symtab[i].table, destroy_f[i], NULL);
 		hashtab_destroy(&p->symtab[i].table);
 	}
@@ -802,7 +799,6 @@ void policydb_destroy(struct policydb *p)
 	avtab_destroy(&p->te_avtab);
 
 	for (i = 0; i < OCON_NUM; i++) {
-		cond_resched();
 		c = p->ocontexts[i];
 		while (c) {
 			ctmp = c;
@@ -814,7 +810,6 @@ void policydb_destroy(struct policydb *p)
 
 	g = p->genfs;
 	while (g) {
-		cond_resched();
 		kfree(g->fstype);
 		c = g->head;
 		while (c) {
@@ -834,7 +829,6 @@ void policydb_destroy(struct policydb *p)
 	hashtab_destroy(&p->role_tr);
 
 	for (ra = p->role_allow; ra; ra = ra->next) {
-		cond_resched();
 		kfree(lra);
 		lra = ra;
 	}
diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 1eeffc66ea7d..0cb652456256 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -2790,7 +2790,6 @@ int security_get_user_sids(u32 fromsid,
 					  &dummy_avd);
 		if (!rc)
 			mysids2[j++] = mysids[i];
-		cond_resched();
 	}
 	kfree(mysids);
 	*sids = mysids2;
diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c
index d8ead463b8df..c5537cecb755 100644
--- a/security/selinux/ss/sidtab.c
+++ b/security/selinux/ss/sidtab.c
@@ -415,7 +415,6 @@ static int sidtab_convert_tree(union sidtab_entry_inner *edst,
 			(*pos)++;
 			i++;
 		}
-		cond_resched();
 	}
 	return 0;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 74/86] treewide: fs: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (15 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 73/86] treewide: security: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 75/86] treewide: virt: " Ankur Arora
                     ` (13 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Chris Mason, Josef Bacik, David Sterba, Alexander Viro,
	Christian Brauner, Gao Xiang, Chao Yu, Theodore Ts'o,
	Andreas Dilger, Jaegeuk Kim, OGAWA Hirofumi, Mikulas Patocka,
	Mike Kravetz, Muchun Song, Trond Myklebust, Anna Schumaker

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most uses here are from set-1 or ones that can be converted to set-2.
And a few cases in retry loops where we replace cond_resched() with
cpu_relax() or cond_resched_stall().

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Chris Mason <clm@fb.com> 
Cc: Josef Bacik <josef@toxicpanda.com> 
Cc: David Sterba <dsterba@suse.com> 
Cc: Alexander Viro <viro@zeniv.linux.org.uk> 
Cc: Christian Brauner <brauner@kernel.org> 
Cc: Gao Xiang <xiang@kernel.org> 
Cc: Chao Yu <chao@kernel.org> 
Cc: "Theodore Ts'o" <tytso@mit.edu> 
Cc: Andreas Dilger <adilger.kernel@dilger.ca> 
Cc: Jaegeuk Kim <jaegeuk@kernel.org> 
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> 
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> 
Cc: Mike Kravetz <mike.kravetz@oracle.com> 
Cc: Muchun Song <muchun.song@linux.dev> 
Cc: Trond Myklebust <trond.myklebust@hammerspace.com> 
Cc: Anna Schumaker <anna@kernel.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 fs/afs/write.c                     |  2 --
 fs/btrfs/backref.c                 |  6 ------
 fs/btrfs/block-group.c             |  3 ---
 fs/btrfs/ctree.c                   |  1 -
 fs/btrfs/defrag.c                  |  1 -
 fs/btrfs/disk-io.c                 |  3 ---
 fs/btrfs/extent-io-tree.c          |  5 -----
 fs/btrfs/extent-tree.c             |  8 --------
 fs/btrfs/extent_io.c               |  9 ---------
 fs/btrfs/file-item.c               |  1 -
 fs/btrfs/file.c                    |  4 ----
 fs/btrfs/free-space-cache.c        |  4 ----
 fs/btrfs/inode.c                   |  9 ---------
 fs/btrfs/ordered-data.c            |  2 --
 fs/btrfs/qgroup.c                  |  1 -
 fs/btrfs/reflink.c                 |  2 --
 fs/btrfs/relocation.c              |  9 ---------
 fs/btrfs/scrub.c                   |  3 ---
 fs/btrfs/send.c                    |  1 -
 fs/btrfs/space-info.c              |  1 -
 fs/btrfs/tests/extent-io-tests.c   |  1 -
 fs/btrfs/transaction.c             |  3 ---
 fs/btrfs/tree-log.c                | 12 ------------
 fs/btrfs/uuid-tree.c               |  1 -
 fs/btrfs/volumes.c                 |  2 --
 fs/buffer.c                        |  1 -
 fs/cachefiles/cache.c              |  4 +---
 fs/cachefiles/namei.c              |  1 -
 fs/cachefiles/volume.c             |  1 -
 fs/ceph/addr.c                     |  1 -
 fs/dax.c                           |  1 -
 fs/dcache.c                        |  2 --
 fs/dlm/ast.c                       |  1 -
 fs/dlm/dir.c                       |  2 --
 fs/dlm/lock.c                      |  3 ---
 fs/dlm/lowcomms.c                  |  3 ---
 fs/dlm/recover.c                   |  1 -
 fs/drop_caches.c                   |  1 -
 fs/erofs/utils.c                   |  1 -
 fs/erofs/zdata.c                   |  8 ++++++--
 fs/eventpoll.c                     |  3 ---
 fs/exec.c                          |  4 ----
 fs/ext4/block_validity.c           |  2 --
 fs/ext4/dir.c                      |  1 -
 fs/ext4/extents.c                  |  1 -
 fs/ext4/ialloc.c                   |  1 -
 fs/ext4/inode.c                    |  1 -
 fs/ext4/mballoc.c                  | 12 ++++--------
 fs/ext4/namei.c                    |  3 ---
 fs/ext4/orphan.c                   |  1 -
 fs/ext4/super.c                    |  2 --
 fs/f2fs/checkpoint.c               | 16 ++++++----------
 fs/f2fs/compress.c                 |  1 -
 fs/f2fs/data.c                     |  3 ---
 fs/f2fs/dir.c                      |  1 -
 fs/f2fs/extent_cache.c             |  1 -
 fs/f2fs/f2fs.h                     |  6 +++++-
 fs/f2fs/file.c                     |  3 ---
 fs/f2fs/node.c                     |  4 ----
 fs/f2fs/super.c                    |  1 -
 fs/fat/fatent.c                    |  2 --
 fs/file.c                          |  7 +------
 fs/fs-writeback.c                  |  3 ---
 fs/gfs2/aops.c                     |  1 -
 fs/gfs2/bmap.c                     |  2 --
 fs/gfs2/glock.c                    |  2 +-
 fs/gfs2/log.c                      |  1 -
 fs/gfs2/ops_fstype.c               |  1 -
 fs/hpfs/buffer.c                   |  8 --------
 fs/hugetlbfs/inode.c               |  3 ---
 fs/inode.c                         |  3 ---
 fs/iomap/buffered-io.c             |  7 +------
 fs/jbd2/checkpoint.c               |  2 --
 fs/jbd2/commit.c                   |  3 ---
 fs/jbd2/recovery.c                 |  2 --
 fs/jffs2/build.c                   |  6 +-----
 fs/jffs2/erase.c                   |  3 ---
 fs/jffs2/gc.c                      |  2 --
 fs/jffs2/nodelist.c                |  1 -
 fs/jffs2/nodemgmt.c                | 11 ++++++++---
 fs/jffs2/readinode.c               |  2 --
 fs/jffs2/scan.c                    |  4 ----
 fs/jffs2/summary.c                 |  2 --
 fs/jfs/jfs_txnmgr.c                | 14 ++++----------
 fs/libfs.c                         |  5 ++---
 fs/mbcache.c                       |  1 -
 fs/namei.c                         |  1 -
 fs/netfs/io.c                      |  1 -
 fs/nfs/delegation.c                |  3 ---
 fs/nfs/pnfs.c                      |  2 --
 fs/nfs/write.c                     |  4 ----
 fs/nilfs2/btree.c                  |  1 -
 fs/nilfs2/inode.c                  |  1 -
 fs/nilfs2/page.c                   |  4 ----
 fs/nilfs2/segment.c                |  4 ----
 fs/notify/fanotify/fanotify_user.c |  1 -
 fs/notify/fsnotify.c               |  1 -
 fs/ntfs/attrib.c                   |  3 ---
 fs/ntfs/file.c                     |  2 --
 fs/ntfs3/file.c                    |  9 ---------
 fs/ntfs3/frecord.c                 |  2 --
 fs/ocfs2/alloc.c                   |  4 +---
 fs/ocfs2/cluster/tcp.c             |  8 ++++++--
 fs/ocfs2/dlm/dlmthread.c           |  7 +++----
 fs/ocfs2/file.c                    | 10 ++++------
 fs/proc/base.c                     |  1 -
 fs/proc/fd.c                       |  1 -
 fs/proc/kcore.c                    |  1 -
 fs/proc/page.c                     |  6 ------
 fs/proc/task_mmu.c                 |  7 -------
 fs/quota/dquot.c                   |  1 -
 fs/reiserfs/journal.c              |  2 --
 fs/select.c                        |  1 -
 fs/smb/client/file.c               |  2 --
 fs/splice.c                        |  1 -
 fs/ubifs/budget.c                  |  1 -
 fs/ubifs/commit.c                  |  1 -
 fs/ubifs/debug.c                   |  5 -----
 fs/ubifs/dir.c                     |  1 -
 fs/ubifs/gc.c                      |  5 -----
 fs/ubifs/io.c                      |  2 --
 fs/ubifs/lprops.c                  |  2 --
 fs/ubifs/lpt_commit.c              |  3 ---
 fs/ubifs/orphan.c                  |  1 -
 fs/ubifs/recovery.c                |  4 ----
 fs/ubifs/replay.c                  |  7 -------
 fs/ubifs/scan.c                    |  2 --
 fs/ubifs/shrinker.c                |  1 -
 fs/ubifs/super.c                   |  2 --
 fs/ubifs/tnc_commit.c              |  2 --
 fs/ubifs/tnc_misc.c                |  1 -
 fs/userfaultfd.c                   |  9 ---------
 fs/verity/enable.c                 |  1 -
 fs/verity/read_metadata.c          |  1 -
 fs/xfs/scrub/common.h              |  7 -------
 fs/xfs/scrub/xfarray.c             |  7 -------
 fs/xfs/xfs_aops.c                  |  1 -
 fs/xfs/xfs_icache.c                |  2 --
 fs/xfs/xfs_iwalk.c                 |  1 -
 139 files changed, 54 insertions(+), 396 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index e1c45341719b..6b2bc1dad8e0 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -568,7 +568,6 @@ static void afs_extend_writeback(struct address_space *mapping,
 		}
 
 		folio_batch_release(&fbatch);
-		cond_resched();
 	} while (!stop);
 
 	*_len = len;
@@ -790,7 +789,6 @@ static int afs_writepages_region(struct address_space *mapping,
 		}
 
 		folio_batch_release(&fbatch);
-		cond_resched();
 	} while (wbc->nr_to_write > 0);
 
 	*_next = start;
diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index a4a809efc92f..2adaabd18b6e 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -823,7 +823,6 @@ static int resolve_indirect_refs(struct btrfs_backref_walk_ctx *ctx,
 		prelim_ref_insert(ctx->fs_info, &preftrees->direct, ref, NULL);
 
 		ulist_reinit(parents);
-		cond_resched();
 	}
 out:
 	/*
@@ -879,7 +878,6 @@ static int add_missing_keys(struct btrfs_fs_info *fs_info,
 			btrfs_tree_read_unlock(eb);
 		free_extent_buffer(eb);
 		prelim_ref_insert(fs_info, &preftrees->indirect, ref, NULL);
-		cond_resched();
 	}
 	return 0;
 }
@@ -1676,7 +1674,6 @@ static int find_parent_nodes(struct btrfs_backref_walk_ctx *ctx,
 			 */
 			ref->inode_list = NULL;
 		}
-		cond_resched();
 	}
 
 out:
@@ -1784,7 +1781,6 @@ static int btrfs_find_all_roots_safe(struct btrfs_backref_walk_ctx *ctx)
 		if (!node)
 			break;
 		ctx->bytenr = node->val;
-		cond_resched();
 	}
 
 	ulist_free(ctx->refs);
@@ -1993,7 +1989,6 @@ int btrfs_is_data_extent_shared(struct btrfs_inode *inode, u64 bytenr,
 		}
 		shared.share_count = 0;
 		shared.have_delayed_delete_refs = false;
-		cond_resched();
 	}
 
 	/*
@@ -3424,7 +3419,6 @@ int btrfs_backref_add_tree_node(struct btrfs_trans_handle *trans,
 		struct btrfs_key key;
 		int type;
 
-		cond_resched();
 		eb = btrfs_backref_get_eb(iter);
 
 		key.objectid = iter->bytenr;
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index b2e5107b7cec..fe9f0a23dbb2 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -769,7 +769,6 @@ static int load_extent_tree_free(struct btrfs_caching_control *caching_ctl)
 				btrfs_release_path(path);
 				up_read(&fs_info->commit_root_sem);
 				mutex_unlock(&caching_ctl->mutex);
-				cond_resched();
 				mutex_lock(&caching_ctl->mutex);
 				down_read(&fs_info->commit_root_sem);
 				goto next;
@@ -4066,8 +4065,6 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 			wait_for_alloc = false;
 			spin_unlock(&space_info->lock);
 		}
-
-		cond_resched();
 	} while (wait_for_alloc);
 
 	mutex_lock(&fs_info->chunk_mutex);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 617d4827eec2..09b70b271cd2 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5052,7 +5052,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 				 */
 				free_extent_buffer(next);
 				btrfs_release_path(path);
-				cond_resched();
 				goto again;
 			}
 			if (!ret)
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index f2ff4cbe8656..2219c3ccb863 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1326,7 +1326,6 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 			ret = 0;
 			break;
 		}
-		cond_resched();
 	}
 
 	if (ra_allocated)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 68f60d50e1fd..e9d1cef7d030 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4561,7 +4561,6 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info)
 		spin_unlock(&fs_info->ordered_root_lock);
 		btrfs_destroy_ordered_extents(root);
 
-		cond_resched();
 		spin_lock(&fs_info->ordered_root_lock);
 	}
 	spin_unlock(&fs_info->ordered_root_lock);
@@ -4643,7 +4642,6 @@ static void btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 		}
 		btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
 		btrfs_put_delayed_ref_head(head);
-		cond_resched();
 		spin_lock(&delayed_refs->lock);
 	}
 	btrfs_qgroup_destroy_extent_records(trans);
@@ -4759,7 +4757,6 @@ static void btrfs_destroy_pinned_extent(struct btrfs_fs_info *fs_info,
 		free_extent_state(cached_state);
 		btrfs_error_unpin_extent_range(fs_info, start, end);
 		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
-		cond_resched();
 	}
 }
 
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index ff8e117a1ace..39aa803cbb13 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -695,8 +695,6 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (gfpflags_allow_blocking(mask))
-		cond_resched();
 	goto again;
 
 out:
@@ -1189,8 +1187,6 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (gfpflags_allow_blocking(mask))
-		cond_resched();
 	goto again;
 
 out:
@@ -1409,7 +1405,6 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	cond_resched();
 	first_iteration = false;
 	goto again;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fc313fce5bbd..33be7bb96872 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1996,7 +1996,6 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
 		}
 
 		btrfs_put_delayed_ref(ref);
-		cond_resched();
 
 		spin_lock(&locked_ref->lock);
 		btrfs_merge_delayed_refs(fs_info, delayed_refs, locked_ref);
@@ -2074,7 +2073,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		 */
 
 		locked_ref = NULL;
-		cond_resched();
 	} while ((nr != -1 && count < nr) || locked_ref);
 
 	return 0;
@@ -2183,7 +2181,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		mutex_unlock(&head->mutex);
 
 		btrfs_put_delayed_ref_head(head);
-		cond_resched();
 		goto again;
 	}
 out:
@@ -2805,7 +2802,6 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		unpin_extent_range(fs_info, start, end, true);
 		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 		free_extent_state(cached_state);
-		cond_resched();
 	}
 
 	if (btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
@@ -4416,7 +4412,6 @@ static noinline int find_free_extent(struct btrfs_root *root,
 			goto have_block_group;
 		}
 		release_block_group(block_group, ffe_ctl, ffe_ctl->delalloc);
-		cond_resched();
 	}
 	up_read(&space_info->groups_sem);
 
@@ -5037,7 +5032,6 @@ static noinline void reada_walk_down(struct btrfs_trans_handle *trans,
 		if (nread >= wc->reada_count)
 			break;
 
-		cond_resched();
 		bytenr = btrfs_node_blockptr(eb, slot);
 		generation = btrfs_node_ptr_generation(eb, slot);
 
@@ -6039,8 +6033,6 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
 			ret = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	return ret;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index caccd0376342..209911d0e873 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -227,7 +227,6 @@ static void __process_pages_contig(struct address_space *mapping,
 					 page_ops, start, end);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -291,7 +290,6 @@ static noinline int lock_delalloc_pages(struct inode *inode,
 			processed_end = page_offset(page) + PAGE_SIZE - 1;
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	return 0;
@@ -401,7 +399,6 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 			      &cached_state);
 		__unlock_for_delalloc(inode, locked_page,
 			      delalloc_start, delalloc_end);
-		cond_resched();
 		goto again;
 	}
 	free_extent_state(cached_state);
@@ -1924,7 +1921,6 @@ int btree_write_cache_pages(struct address_space *mapping,
 			nr_to_write_done = wbc->nr_to_write <= 0;
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	if (!scanned && !done) {
 		/*
@@ -2116,7 +2112,6 @@ static int extent_write_cache_pages(struct address_space *mapping,
 					    wbc->nr_to_write <= 0);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	if (!scanned && !done) {
 		/*
@@ -2397,8 +2392,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
 
 			/* once for us */
 			free_extent_map(em);
-
-			cond_resched(); /* Allow large-extent preemption. */
 		}
 	}
 	return try_release_extent_state(tree, page, mask);
@@ -2698,7 +2691,6 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
 		last_delalloc_end = delalloc_end;
 		cur_offset = delalloc_end + 1;
 		extent_offset += cur_offset - delalloc_start;
-		cond_resched();
 	}
 
 	/*
@@ -2986,7 +2978,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			/* No more file extent items for this inode. */
 			break;
 		}
-		cond_resched();
 	}
 
 check_eof_delalloc:
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 1ce5dd154499..12cc0cfde0ff 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -1252,7 +1252,6 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(path->nodes[0]);
 	if (total_bytes < sums->len) {
 		btrfs_release_path(path);
-		cond_resched();
 		goto again;
 	}
 out:
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 361535c71c0f..541b6c87ddf3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1405,8 +1405,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 
 		btrfs_drop_pages(fs_info, pages, num_pages, pos, copied);
 
-		cond_resched();
-
 		pos += copied;
 		num_written += copied;
 	}
@@ -3376,7 +3374,6 @@ bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
 
 		prev_delalloc_end = delalloc_end;
 		cur_offset = delalloc_end + 1;
-		cond_resched();
 	}
 
 	return ret;
@@ -3654,7 +3651,6 @@ static loff_t find_desired_extent(struct file *file, loff_t offset, int whence)
 			ret = -EINTR;
 			goto out;
 		}
-		cond_resched();
 	}
 
 	/* We have an implicit hole from the last extent found up to i_size. */
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 27fad70451aa..c9606fcdc310 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3807,8 +3807,6 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 			ret = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	return ret;
@@ -4000,8 +3998,6 @@ static int trim_bitmaps(struct btrfs_block_group *block_group,
 			ret = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	if (offset >= end)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7814b9d654ce..789569e135cf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1021,7 +1021,6 @@ static void compress_file_range(struct btrfs_work *work)
 			 nr_pages, compress_type);
 	if (start + total_in < end) {
 		start += total_in;
-		cond_resched();
 		goto again;
 	}
 	return;
@@ -3376,7 +3375,6 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
 		run_delayed_iput_locked(fs_info, inode);
 		if (need_resched()) {
 			spin_unlock_irq(&fs_info->delayed_iput_lock);
-			cond_resched();
 			spin_lock_irq(&fs_info->delayed_iput_lock);
 		}
 	}
@@ -4423,7 +4421,6 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
 			 * cache when its usage count hits zero.
 			 */
 			iput(inode);
-			cond_resched();
 			spin_lock(&root->inode_lock);
 			goto again;
 		}
@@ -5135,7 +5132,6 @@ static void evict_inode_truncate_pages(struct inode *inode)
 				 EXTENT_CLEAR_ALL_BITS | EXTENT_DO_ACCOUNTING,
 				 &cached_state);
 
-		cond_resched();
 		spin_lock(&io_tree->lock);
 	}
 	spin_unlock(&io_tree->lock);
@@ -7209,8 +7205,6 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 
 		if (ret)
 			break;
-
-		cond_resched();
 	}
 
 	return ret;
@@ -9269,7 +9263,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 			if (ret || wbc->nr_to_write <= 0)
 				goto out;
 		}
-		cond_resched();
 		spin_lock(&root->delalloc_lock);
 	}
 	spin_unlock(&root->delalloc_lock);
@@ -10065,7 +10058,6 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			break;
 		btrfs_put_ordered_extent(ordered);
 		unlock_extent(io_tree, start, lockend, &cached_state);
-		cond_resched();
 	}
 
 	em = btrfs_get_extent(inode, NULL, 0, start, lockend - start + 1);
@@ -10306,7 +10298,6 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 		if (ordered)
 			btrfs_put_ordered_extent(ordered);
 		unlock_extent(io_tree, start, end, &cached_state);
-		cond_resched();
 	}
 
 	/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 345c449d588c..58463c479c91 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -715,7 +715,6 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 		list_add_tail(&ordered->work_list, &works);
 		btrfs_queue_work(fs_info->flush_workers, &ordered->flush_work);
 
-		cond_resched();
 		spin_lock(&root->ordered_extent_lock);
 		if (nr != U64_MAX)
 			nr--;
@@ -729,7 +728,6 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 		list_del_init(&ordered->work_list);
 		wait_for_completion(&ordered->completion);
 		btrfs_put_ordered_extent(ordered);
-		cond_resched();
 	}
 	mutex_unlock(&root->ordered_extent_mutex);
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index b99230db3c82..c483648be366 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1926,7 +1926,6 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
 		if (ret)
 			return ret;
 	}
-	cond_resched();
 	return 0;
 }
 
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 65d2bd6910f2..6f599c275dc7 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -569,8 +569,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 			ret = -EINTR;
 			goto out;
 		}
-
-		cond_resched();
 	}
 	ret = 0;
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index c6d4bb8cbe29..7e16a6d953d9 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1094,7 +1094,6 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 	for (i = 0; i < nritems; i++) {
 		struct btrfs_ref ref = { 0 };
 
-		cond_resched();
 		btrfs_item_key_to_cpu(leaf, &key, i);
 		if (key.type != BTRFS_EXTENT_DATA_KEY)
 			continue;
@@ -1531,7 +1530,6 @@ static int invalidate_extent_cache(struct btrfs_root *root,
 	while (1) {
 		struct extent_state *cached_state = NULL;
 
-		cond_resched();
 		iput(inode);
 
 		if (objectid > max_key->objectid)
@@ -2163,7 +2161,6 @@ struct btrfs_root *select_reloc_root(struct btrfs_trans_handle *trans,
 
 	next = node;
 	while (1) {
-		cond_resched();
 		next = walk_up_backref(next, edges, &index);
 		root = next->root;
 
@@ -2286,7 +2283,6 @@ struct btrfs_root *select_one_root(struct btrfs_backref_node *node)
 
 	next = node;
 	while (1) {
-		cond_resched();
 		next = walk_up_backref(next, edges, &index);
 		root = next->root;
 
@@ -2331,7 +2327,6 @@ u64 calcu_metadata_size(struct reloc_control *rc,
 	BUG_ON(reserve && node->processed);
 
 	while (next) {
-		cond_resched();
 		while (1) {
 			if (next->processed && (reserve || next != node))
 				break;
@@ -2426,8 +2421,6 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 	list_for_each_entry(edge, &node->upper, list[LOWER]) {
 		struct btrfs_ref ref = { 0 };
 
-		cond_resched();
-
 		upper = edge->node[UPPER];
 		root = select_reloc_root(trans, rc, upper, edges);
 		if (IS_ERR(root)) {
@@ -2609,7 +2602,6 @@ static void update_processed_blocks(struct reloc_control *rc,
 	int index = 0;
 
 	while (next) {
-		cond_resched();
 		while (1) {
 			if (next->processed)
 				break;
@@ -3508,7 +3500,6 @@ int find_next_extent(struct reloc_control *rc, struct btrfs_path *path,
 	while (1) {
 		bool block_found;
 
-		cond_resched();
 		if (rc->search_start >= last) {
 			ret = 1;
 			break;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b877203f1dc5..4dba0e3b6887 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2046,9 +2046,6 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 			break;
 
 		cur_logical = found_logical + BTRFS_STRIPE_LEN;
-
-		/* Don't hold CPU for too long time */
-		cond_resched();
 	}
 	return ret;
 }
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 3a566150c531..503782af0b35 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -7778,7 +7778,6 @@ static int btrfs_compare_trees(struct btrfs_root *left_root,
 		if (need_resched() ||
 		    rwsem_is_contended(&fs_info->commit_root_sem)) {
 			up_read(&fs_info->commit_root_sem);
-			cond_resched();
 			down_read(&fs_info->commit_root_sem);
 		}
 
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d7e8cd4f140c..e597c5365c71 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1211,7 +1211,6 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 		if (!to_reclaim)
 			to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1);
 		flush_space(fs_info, space_info, to_reclaim, flush, true);
-		cond_resched();
 		spin_lock(&space_info->lock);
 	}
 
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index 1cc86af97dc6..7021025d8535 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -45,7 +45,6 @@ static noinline int process_page_range(struct inode *inode, u64 start, u64 end,
 				folio_put(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 		loops++;
 		if (loops > 100000) {
 			printk(KERN_ERR
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index c780d3729463..ce5cbc12e041 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1115,7 +1115,6 @@ int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
 			werr = filemap_fdatawait_range(mapping, start, end);
 		free_extent_state(cached_state);
 		cached_state = NULL;
-		cond_resched();
 		start = end + 1;
 	}
 	return werr;
@@ -1157,7 +1156,6 @@ static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
 			werr = err;
 		free_extent_state(cached_state);
 		cached_state = NULL;
-		cond_resched();
 		start = end + 1;
 	}
 	if (err)
@@ -1507,7 +1505,6 @@ int btrfs_defrag_root(struct btrfs_root *root)
 
 		btrfs_end_transaction(trans);
 		btrfs_btree_balance_dirty(info);
-		cond_resched();
 
 		if (btrfs_fs_closing(info) || ret != -EAGAIN)
 			break;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index cbb17b542131..3c215762a07f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2657,11 +2657,9 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
 		path->nodes[*level-1] = next;
 		*level = btrfs_header_level(next);
 		path->slots[*level] = 0;
-		cond_resched();
 	}
 	path->slots[*level] = btrfs_header_nritems(path->nodes[*level]);
 
-	cond_resched();
 	return 0;
 }
 
@@ -3898,7 +3896,6 @@ static noinline int log_dir_items(struct btrfs_trans_handle *trans,
 		}
 		if (need_resched()) {
 			btrfs_release_path(path);
-			cond_resched();
 			goto search;
 		}
 	}
@@ -5037,7 +5034,6 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
 		ins_nr++;
 		path->slots[0]++;
 		found_xattrs = true;
-		cond_resched();
 	}
 	if (ins_nr > 0) {
 		ret = copy_items(trans, inode, dst_path, path,
@@ -5135,7 +5131,6 @@ static int btrfs_log_holes(struct btrfs_trans_handle *trans,
 
 		prev_extent_end = btrfs_file_extent_end(path);
 		path->slots[0]++;
-		cond_resched();
 	}
 
 	if (prev_extent_end < i_size) {
@@ -5919,13 +5914,6 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
 		} else {
 			break;
 		}
-
-		/*
-		 * We may process many leaves full of items for our inode, so
-		 * avoid monopolizing a cpu for too long by rescheduling while
-		 * not holding locks on any tree.
-		 */
-		cond_resched();
 	}
 	if (ins_nr) {
 		ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
diff --git a/fs/btrfs/uuid-tree.c b/fs/btrfs/uuid-tree.c
index 7c7001f42b14..98890e0d7b24 100644
--- a/fs/btrfs/uuid-tree.c
+++ b/fs/btrfs/uuid-tree.c
@@ -324,7 +324,6 @@ int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info)
 			ret = -EINTR;
 			goto out;
 		}
-		cond_resched();
 		leaf = path->nodes[0];
 		slot = path->slots[0];
 		btrfs_item_key_to_cpu(leaf, &key, slot);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b9ef6f54635c..ceda63fcc721 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1689,7 +1689,6 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
 			search_start = extent_end;
 next:
 		path->slots[0]++;
-		cond_resched();
 	}
 
 	/*
@@ -4756,7 +4755,6 @@ int btrfs_uuid_scan_kthread(void *data)
 		} else {
 			break;
 		}
-		cond_resched();
 	}
 
 out:
diff --git a/fs/buffer.c b/fs/buffer.c
index 12e9a71c693d..a362b42bc63d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1743,7 +1743,6 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
 			folio_unlock(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 		/* End of range already reached? */
 		if (index > end || !index)
 			break;
diff --git a/fs/cachefiles/cache.c b/fs/cachefiles/cache.c
index 7077f72e6f47..7f078244cc0a 100644
--- a/fs/cachefiles/cache.c
+++ b/fs/cachefiles/cache.c
@@ -299,9 +299,7 @@ static void cachefiles_withdraw_objects(struct cachefiles_cache *cache)
 		fscache_withdraw_cookie(object->cookie);
 		count++;
 		if ((count & 63) == 0) {
-			spin_unlock(&cache->object_list_lock);
-			cond_resched();
-			spin_lock(&cache->object_list_lock);
+			cond_resched_lock(&cache->object_list_lock);
 		}
 	}
 
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 7bf7a5fcc045..3fa8a2ecb299 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -353,7 +353,6 @@ int cachefiles_bury_object(struct cachefiles_cache *cache,
 		unlock_rename(cache->graveyard, dir);
 		dput(grave);
 		grave = NULL;
-		cond_resched();
 		goto try_again;
 	}
 
diff --git a/fs/cachefiles/volume.c b/fs/cachefiles/volume.c
index 89df0ba8ba5e..6a4d9d87c68c 100644
--- a/fs/cachefiles/volume.c
+++ b/fs/cachefiles/volume.c
@@ -62,7 +62,6 @@ void cachefiles_acquire_volume(struct fscache_volume *vcookie)
 			cachefiles_bury_object(cache, NULL, cache->store, vdentry,
 					       FSCACHE_VOLUME_IS_WEIRD);
 			cachefiles_put_directory(volume->dentry);
-			cond_resched();
 			goto retry;
 		}
 	}
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index f4863078f7fe..f2be2adf5d41 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1375,7 +1375,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 					wait_on_page_writeback(page);
 				}
 				folio_batch_release(&fbatch);
-				cond_resched();
 			}
 		}
 
diff --git a/fs/dax.c b/fs/dax.c
index 93cf6e8d8990..f68e026e6ec4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -986,7 +986,6 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) {
 		pfn_mkclean_range(pfn, count, index, vma);
-		cond_resched();
 	}
 	i_mmap_unlock_read(mapping);
 
diff --git a/fs/dcache.c b/fs/dcache.c
index 25ac74d30bff..3f5b4adba111 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -619,7 +619,6 @@ static void __dentry_kill(struct dentry *dentry)
 	spin_unlock(&dentry->d_lock);
 	if (likely(can_free))
 		dentry_free(dentry);
-	cond_resched();
 }
 
 static struct dentry *__lock_parent(struct dentry *dentry)
@@ -1629,7 +1628,6 @@ void shrink_dcache_parent(struct dentry *parent)
 			continue;
 		}
 
-		cond_resched();
 		if (!data.found)
 			break;
 		data.victim = NULL;
diff --git a/fs/dlm/ast.c b/fs/dlm/ast.c
index 1f2f70a1b824..d6f36527814f 100644
--- a/fs/dlm/ast.c
+++ b/fs/dlm/ast.c
@@ -261,7 +261,6 @@ void dlm_callback_resume(struct dlm_ls *ls)
 	sum += count;
 	if (!empty) {
 		count = 0;
-		cond_resched();
 		goto more;
 	}
 
diff --git a/fs/dlm/dir.c b/fs/dlm/dir.c
index f6acba4310a7..d8b24f9bb744 100644
--- a/fs/dlm/dir.c
+++ b/fs/dlm/dir.c
@@ -94,8 +94,6 @@ int dlm_recover_directory(struct dlm_ls *ls, uint64_t seq)
 			if (error)
 				goto out_free;
 
-			cond_resched();
-
 			/*
 			 * pick namelen/name pairs out of received buffer
 			 */
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 652c51fbbf76..6bf02cbc5550 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -1713,7 +1713,6 @@ void dlm_scan_rsbs(struct dlm_ls *ls)
 		shrink_bucket(ls, i);
 		if (dlm_locking_stopped(ls))
 			break;
-		cond_resched();
 	}
 }
 
@@ -5227,7 +5226,6 @@ void dlm_recover_purge(struct dlm_ls *ls)
 		}
 		unlock_rsb(r);
 		unhold_rsb(r);
-		cond_resched();
 	}
 	up_write(&ls->ls_root_sem);
 
@@ -5302,7 +5300,6 @@ void dlm_recover_grant(struct dlm_ls *ls)
 		confirm_master(r, 0);
 		unlock_rsb(r);
 		put_rsb(r);
-		cond_resched();
 	}
 
 	if (lkb_count)
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index f7bc22e74db2..494ede3678d6 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -562,7 +562,6 @@ int dlm_lowcomms_connect_node(int nodeid)
 	up_read(&con->sock_lock);
 	srcu_read_unlock(&connections_srcu, idx);
 
-	cond_resched();
 	return 0;
 }
 
@@ -1504,7 +1503,6 @@ static void process_recv_sockets(struct work_struct *work)
 		/* CF_RECV_PENDING cleared */
 		break;
 	case DLM_IO_RESCHED:
-		cond_resched();
 		queue_work(io_workqueue, &con->rwork);
 		/* CF_RECV_PENDING not cleared */
 		break;
@@ -1650,7 +1648,6 @@ static void process_send_sockets(struct work_struct *work)
 		break;
 	case DLM_IO_RESCHED:
 		/* CF_SEND_PENDING not cleared */
-		cond_resched();
 		queue_work(io_workqueue, &con->swork);
 		break;
 	default:
diff --git a/fs/dlm/recover.c b/fs/dlm/recover.c
index 53917c0aa3c0..6d9b074631ff 100644
--- a/fs/dlm/recover.c
+++ b/fs/dlm/recover.c
@@ -545,7 +545,6 @@ int dlm_recover_masters(struct dlm_ls *ls, uint64_t seq)
 		else
 			error = recover_master(r, &count, seq);
 		unlock_rsb(r);
-		cond_resched();
 		total++;
 
 		if (error) {
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index b9575957a7c2..3409677acfae 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -41,7 +41,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		iput(toput_inode);
 		toput_inode = inode;
 
-		cond_resched();
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
diff --git a/fs/erofs/utils.c b/fs/erofs/utils.c
index cc6fb9e98899..f32ff29392d1 100644
--- a/fs/erofs/utils.c
+++ b/fs/erofs/utils.c
@@ -93,7 +93,6 @@ struct erofs_workgroup *erofs_insert_workgroup(struct super_block *sb,
 		} else if (!erofs_workgroup_get(pre)) {
 			/* try to legitimize the current in-tree one */
 			xa_unlock(&sbi->managed_pslots);
-			cond_resched();
 			goto repeat;
 		}
 		lockref_put_return(&grp->lockref);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 036f610e044b..20ae6af8a9d6 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -697,8 +697,13 @@ static void z_erofs_cache_invalidate_folio(struct folio *folio,
 	DBG_BUGON(stop > folio_size(folio) || stop < length);
 
 	if (offset == 0 && stop == folio_size(folio))
+		/*
+		 * We are in a seemingly tight loop here. Though, if needed,
+		 * preemption can happen in z_erofs_cache_release_folio()
+		 * via the spin_unlock() call.
+		 */
 		while (!z_erofs_cache_release_folio(folio, GFP_NOFS))
-			cond_resched();
+			;
 }
 
 static const struct address_space_operations z_erofs_cache_aops = {
@@ -1527,7 +1532,6 @@ static struct page *pickup_page_for_submission(struct z_erofs_pcluster *pcl,
 	if (oldpage != cmpxchg(&pcl->compressed_bvecs[nr].page,
 			       oldpage, page)) {
 		erofs_pagepool_add(pagepool, page);
-		cond_resched();
 		goto repeat;
 	}
 out_tocache:
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1d9a71a0c4c1..45794a9da768 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -801,7 +801,6 @@ static void ep_clear_and_put(struct eventpoll *ep)
 		epi = rb_entry(rbp, struct epitem, rbn);
 
 		ep_unregister_pollwait(ep, epi);
-		cond_resched();
 	}
 
 	/*
@@ -816,7 +815,6 @@ static void ep_clear_and_put(struct eventpoll *ep)
 		next = rb_next(rbp);
 		epi = rb_entry(rbp, struct epitem, rbn);
 		ep_remove_safe(ep, epi);
-		cond_resched();
 	}
 
 	dispose = ep_refcount_dec_and_test(ep);
@@ -1039,7 +1037,6 @@ static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long t
 			else
 				toff--;
 		}
-		cond_resched();
 	}
 
 	return NULL;
diff --git a/fs/exec.c b/fs/exec.c
index 6518e33ea813..ca3b25054e3f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -451,7 +451,6 @@ static int count(struct user_arg_ptr argv, int max)
 
 			if (fatal_signal_pending(current))
 				return -ERESTARTNOHAND;
-			cond_resched();
 		}
 	}
 	return i;
@@ -469,7 +468,6 @@ static int count_strings_kernel(const char *const *argv)
 			return -E2BIG;
 		if (fatal_signal_pending(current))
 			return -ERESTARTNOHAND;
-		cond_resched();
 	}
 	return i;
 }
@@ -562,7 +560,6 @@ static int copy_strings(int argc, struct user_arg_ptr argv,
 				ret = -ERESTARTNOHAND;
 				goto out;
 			}
-			cond_resched();
 
 			offset = pos % PAGE_SIZE;
 			if (offset == 0)
@@ -661,7 +658,6 @@ static int copy_strings_kernel(int argc, const char *const *argv,
 			return ret;
 		if (fatal_signal_pending(current))
 			return -ERESTARTNOHAND;
-		cond_resched();
 	}
 	return 0;
 }
diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
index 6fe3c941b565..1a7baca041cf 100644
--- a/fs/ext4/block_validity.c
+++ b/fs/ext4/block_validity.c
@@ -162,7 +162,6 @@ static int ext4_protect_reserved_inode(struct super_block *sb,
 		return PTR_ERR(inode);
 	num = (inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits;
 	while (i < num) {
-		cond_resched();
 		map.m_lblk = i;
 		map.m_len = num - i;
 		n = ext4_map_blocks(NULL, inode, &map, 0);
@@ -224,7 +223,6 @@ int ext4_setup_system_zone(struct super_block *sb)
 	for (i=0; i < ngroups; i++) {
 		unsigned int meta_blks = ext4_num_base_meta_blocks(sb, i);
 
-		cond_resched();
 		if (meta_blks != 0) {
 			ret = add_system_zone(system_blks,
 					ext4_group_first_block_no(sb, i),
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 3985f8c33f95..cb7d2427be8b 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -174,7 +174,6 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
 			err = -ERESTARTSYS;
 			goto errout;
 		}
-		cond_resched();
 		offset = ctx->pos & (sb->s_blocksize - 1);
 		map.m_lblk = ctx->pos >> EXT4_BLOCK_SIZE_BITS(sb);
 		map.m_len = 1;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 202c76996b62..79851e582c7d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3001,7 +3001,6 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			}
 			/* Yield here to deal with large extent trees.
 			 * Should be a no-op if we did IO above. */
-			cond_resched();
 			if (WARN_ON(i + 1 > depth)) {
 				err = -EFSCORRUPTED;
 				break;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index b65058d972f9..25d78953eec9 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1482,7 +1482,6 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
 		if (!gdp)
 			continue;
 		desc_count += ext4_free_inodes_count(sb, gdp);
-		cond_resched();
 	}
 	return desc_count;
 #endif
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4ce35f1c8b0a..1c3af3a8fe2e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2491,7 +2491,6 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			}
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	mpd->scanned_until_end = 1;
 	if (handle)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1e599305d85f..074b5cdea363 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2843,7 +2843,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
 			int ret = 0;
 
-			cond_resched();
 			if (new_cr != cr) {
 				cr = new_cr;
 				goto repeat;
@@ -3387,7 +3386,6 @@ static int ext4_mb_init_backend(struct super_block *sb)
 	sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
 	EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
 	for (i = 0; i < ngroups; i++) {
-		cond_resched();
 		desc = ext4_get_group_desc(sb, i, NULL);
 		if (desc == NULL) {
 			ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i);
@@ -3746,7 +3744,6 @@ int ext4_mb_release(struct super_block *sb)
 
 	if (sbi->s_group_info) {
 		for (i = 0; i < ngroups; i++) {
-			cond_resched();
 			grinfo = ext4_get_group_info(sb, i);
 			if (!grinfo)
 				continue;
@@ -6034,7 +6031,6 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 		ret = ext4_mb_discard_group_preallocations(sb, i, &busy);
 		freed += ret;
 		needed -= ret;
-		cond_resched();
 	}
 
 	if (needed > 0 && busy && ++retry < 3) {
@@ -6173,8 +6169,6 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 		while (ar->len &&
 			ext4_claim_free_clusters(sbi, ar->len, ar->flags)) {
 
-			/* let others to free the space */
-			cond_resched();
 			ar->len = ar->len >> 1;
 		}
 		if (!ar->len) {
@@ -6720,7 +6714,6 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		int is_metadata = flags & EXT4_FREE_BLOCKS_METADATA;
 
 		for (i = 0; i < count; i++) {
-			cond_resched();
 			if (is_metadata)
 				bh = sb_find_get_block(inode->i_sb, block + i);
 			ext4_forget(handle, is_metadata, inode, bh, block + i);
@@ -6959,8 +6952,11 @@ __releases(ext4_group_lock_ptr(sb, e4b->bd_group))
 			return count;
 
 		if (need_resched()) {
+			/*
+			 * Rescheduling can implicitly happen after the
+			 * unlock.
+			 */
 			ext4_unlock_group(sb, e4b->bd_group);
-			cond_resched();
 			ext4_lock_group(sb, e4b->bd_group);
 		}
 
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index bbda587f76b8..2ab27008c4dd 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1255,7 +1255,6 @@ int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
 			err = -ERESTARTSYS;
 			goto errout;
 		}
-		cond_resched();
 		block = dx_get_block(frame->at);
 		ret = htree_dirblock_to_tree(dir_file, dir, block, &hinfo,
 					     start_hash, start_minor_hash);
@@ -1341,7 +1340,6 @@ static int dx_make_map(struct inode *dir, struct buffer_head *bh,
 			map_tail->size = ext4_rec_len_from_disk(de->rec_len,
 								blocksize);
 			count++;
-			cond_resched();
 		}
 		de = ext4_next_entry(de, blocksize);
 	}
@@ -1658,7 +1656,6 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
 		/*
 		 * We deal with the read-ahead logic here.
 		 */
-		cond_resched();
 		if (ra_ptr >= ra_max) {
 			/* Refill the readahead buffer */
 			ra_ptr = 0;
diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
index e5b47dda3317..fb04e8bccd3c 100644
--- a/fs/ext4/orphan.c
+++ b/fs/ext4/orphan.c
@@ -67,7 +67,6 @@ static int ext4_orphan_file_add(handle_t *handle, struct inode *inode)
 				atomic_inc(&oi->of_binfo[i].ob_free_entries);
 				return -ENOSPC;
 			}
-			cond_resched();
 		}
 		while (bdata[j]) {
 			if (++j >= inodes_per_ob) {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dbebd8b3127e..170c75323300 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3861,7 +3861,6 @@ static int ext4_lazyinit_thread(void *arg)
 		cur = jiffies;
 		if ((time_after_eq(cur, next_wakeup)) ||
 		    (MAX_JIFFY_OFFSET == next_wakeup)) {
-			cond_resched();
 			continue;
 		}
 
@@ -4226,7 +4225,6 @@ int ext4_calculate_overhead(struct super_block *sb)
 		overhead += blks;
 		if (blks)
 			memset(buf, 0, PAGE_SIZE);
-		cond_resched();
 	}
 
 	/*
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index b0597a539fc5..20ea41b5814c 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -45,7 +45,6 @@ struct page *f2fs_grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index)
 repeat:
 	page = f2fs_grab_cache_page(mapping, index, false);
 	if (!page) {
-		cond_resched();
 		goto repeat;
 	}
 	f2fs_wait_on_page_writeback(page, META, true, true);
@@ -76,7 +75,6 @@ static struct page *__get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index,
 repeat:
 	page = f2fs_grab_cache_page(mapping, index, false);
 	if (!page) {
-		cond_resched();
 		goto repeat;
 	}
 	if (PageUptodate(page))
@@ -463,7 +461,6 @@ long f2fs_sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
 				break;
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 stop:
 	if (nwritten)
@@ -1111,9 +1108,13 @@ int f2fs_sync_dirty_inodes(struct f2fs_sb_info *sbi, enum inode_type type,
 			F2FS_I(inode)->cp_task = NULL;
 
 		iput(inode);
-		/* We need to give cpu to another writers. */
+		/*
+		 * We need to give cpu to other writers but cond_resched_stall()
+		 * does not guarantee that. Perhaps we should explicitly wait on
+		 * an event or a timeout?
+		 */
 		if (ino == cur_ino)
-			cond_resched();
+			cond_resched_stall();
 		else
 			ino = cur_ino;
 	} else {
@@ -1122,7 +1123,6 @@ int f2fs_sync_dirty_inodes(struct f2fs_sb_info *sbi, enum inode_type type,
 		 * writebacking dentry pages in the freeing inode.
 		 */
 		f2fs_submit_merged_write(sbi, DATA);
-		cond_resched();
 	}
 	goto retry;
 }
@@ -1229,7 +1229,6 @@ static int block_operations(struct f2fs_sb_info *sbi)
 		f2fs_quota_sync(sbi->sb, -1);
 		if (locked)
 			up_read(&sbi->sb->s_umount);
-		cond_resched();
 		goto retry_flush_quotas;
 	}
 
@@ -1240,7 +1239,6 @@ static int block_operations(struct f2fs_sb_info *sbi)
 		err = f2fs_sync_dirty_inodes(sbi, DIR_INODE, true);
 		if (err)
 			return err;
-		cond_resched();
 		goto retry_flush_quotas;
 	}
 
@@ -1256,7 +1254,6 @@ static int block_operations(struct f2fs_sb_info *sbi)
 		err = f2fs_sync_inode_meta(sbi);
 		if (err)
 			return err;
-		cond_resched();
 		goto retry_flush_quotas;
 	}
 
@@ -1273,7 +1270,6 @@ static int block_operations(struct f2fs_sb_info *sbi)
 			f2fs_unlock_all(sbi);
 			return err;
 		}
-		cond_resched();
 		goto retry_flush_nodes;
 	}
 
diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index d820801f473e..39a2a974e087 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1941,7 +1941,6 @@ void f2fs_invalidate_compress_pages(struct f2fs_sb_info *sbi, nid_t ino)
 			folio_unlock(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	} while (index < end);
 }
 
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 916e317ac925..dfde82cab326 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2105,7 +2105,6 @@ int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	}
 
 prep_next:
-	cond_resched();
 	if (fatal_signal_pending(current))
 		ret = -EINTR;
 	else
@@ -3250,7 +3249,6 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
 				goto readd;
 		}
 		release_pages(pages, nr_pages);
-		cond_resched();
 	}
 #ifdef CONFIG_F2FS_FS_COMPRESSION
 	/* flush remained pages in compress cluster */
@@ -3981,7 +3979,6 @@ static int check_swap_activate(struct swap_info_struct *sis,
 	while (cur_lblock < last_lblock && cur_lblock < sis->max) {
 		struct f2fs_map_blocks map;
 retry:
-		cond_resched();
 
 		memset(&map, 0, sizeof(map));
 		map.m_lblk = cur_lblock;
diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 8aa29fe2e87b..fc15a05fa807 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -1090,7 +1090,6 @@ static int f2fs_readdir(struct file *file, struct dir_context *ctx)
 			err = -ERESTARTSYS;
 			goto out_free;
 		}
-		cond_resched();
 
 		/* readahead for multi pages of dir */
 		if (npages - n > 1 && !ra_has_index(ra, n))
diff --git a/fs/f2fs/extent_cache.c b/fs/f2fs/extent_cache.c
index 0e2d49140c07..b87946f33a5f 100644
--- a/fs/f2fs/extent_cache.c
+++ b/fs/f2fs/extent_cache.c
@@ -936,7 +936,6 @@ static unsigned int __shrink_extent_tree(struct f2fs_sb_info *sbi, int nr_shrink
 
 		if (node_cnt + tree_cnt >= nr_shrink)
 			goto unlock_out;
-		cond_resched();
 	}
 	mutex_unlock(&eti->extent_tree_lock);
 
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 6d688e42d89c..073e6fd1986d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2849,8 +2849,12 @@ static inline bool is_idle(struct f2fs_sb_info *sbi, int type)
 static inline void f2fs_radix_tree_insert(struct radix_tree_root *root,
 				unsigned long index, void *item)
 {
+	/*
+	 * Insert in a tight loop. The scheduler will
+	 * preempt when necessary.
+	 */
 	while (radix_tree_insert(root, index, item))
-		cond_resched();
+		;
 }
 
 #define RAW_IS_INODE(p)	((p)->footer.nid == (p)->footer.ino)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index ca5904129b16..0ac3dc5dafee 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -3922,7 +3922,6 @@ static int f2fs_sec_trim_file(struct file *filp, unsigned long arg)
 			ret = -EINTR;
 			goto out;
 		}
-		cond_resched();
 	}
 
 	if (len)
@@ -4110,7 +4109,6 @@ static int f2fs_ioc_decompress_file(struct file *filp)
 		count -= cluster_size;
 		page_idx += cluster_size;
 
-		cond_resched();
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
 			break;
@@ -4188,7 +4186,6 @@ static int f2fs_ioc_compress_file(struct file *filp)
 		count -= cluster_size;
 		page_idx += cluster_size;
 
-		cond_resched();
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
 			break;
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index ee2e1dd64f25..8187b6ad119a 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1579,7 +1579,6 @@ static struct page *last_fsync_dnode(struct f2fs_sb_info *sbi, nid_t ino)
 			unlock_page(page);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	return last_page;
 }
@@ -1841,7 +1840,6 @@ int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
 			}
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 
 		if (ret || marked)
 			break;
@@ -1944,7 +1942,6 @@ void f2fs_flush_inline_data(struct f2fs_sb_info *sbi)
 			unlock_page(page);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -2046,7 +2043,6 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
 				break;
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 
 		if (wbc->nr_to_write == 0) {
 			step = 2;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index a8c8232852bb..09667bd8ecf7 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2705,7 +2705,6 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 		towrite -= tocopy;
 		off += tocopy;
 		data += tocopy;
-		cond_resched();
 	}
 
 	if (len == towrite)
diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c
index 1db348f8f887..96d9f1632f2a 100644
--- a/fs/fat/fatent.c
+++ b/fs/fat/fatent.c
@@ -741,7 +741,6 @@ int fat_count_free_clusters(struct super_block *sb)
 			if (ops->ent_get(&fatent) == FAT_ENT_FREE)
 				free++;
 		} while (fat_ent_next(sbi, &fatent));
-		cond_resched();
 	}
 	sbi->free_clusters = free;
 	sbi->free_clus_valid = 1;
@@ -822,7 +821,6 @@ int fat_trim_fs(struct inode *inode, struct fstrim_range *range)
 		if (need_resched()) {
 			fatent_brelse(&fatent);
 			unlock_fat(sbi);
-			cond_resched();
 			lock_fat(sbi);
 		}
 	}
diff --git a/fs/file.c b/fs/file.c
index 3e4a4dfa38fc..8ae2cec580a9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -428,10 +428,8 @@ static struct fdtable *close_files(struct files_struct * files)
 		while (set) {
 			if (set & 1) {
 				struct file * file = xchg(&fdt->fd[i], NULL);
-				if (file) {
+				if (file)
 					filp_close(file, files);
-					cond_resched();
-				}
 			}
 			i++;
 			set >>= 1;
@@ -708,11 +706,9 @@ static inline void __range_close(struct files_struct *files, unsigned int fd,
 		if (file) {
 			spin_unlock(&files->file_lock);
 			filp_close(file, files);
-			cond_resched();
 			spin_lock(&files->file_lock);
 		} else if (need_resched()) {
 			spin_unlock(&files->file_lock);
-			cond_resched();
 			spin_lock(&files->file_lock);
 		}
 	}
@@ -845,7 +841,6 @@ void do_close_on_exec(struct files_struct *files)
 			__put_unused_fd(files, fd);
 			spin_unlock(&files->file_lock);
 			filp_close(file, files);
-			cond_resched();
 			spin_lock(&files->file_lock);
 		}
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c1af01b2c42d..bf311aeb058b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1914,7 +1914,6 @@ static long writeback_sb_inodes(struct super_block *sb,
 			 * give up the CPU.
 			 */
 			blk_flush_plug(current->plug, false);
-			cond_resched();
 		}
 
 		/*
@@ -2621,8 +2620,6 @@ static void wait_sb_inodes(struct super_block *sb)
 		 */
 		filemap_fdatawait_keep_errors(mapping);
 
-		cond_resched();
-
 		iput(inode);
 
 		rcu_read_lock();
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index c26d48355cc2..4d5bc99b6301 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -357,7 +357,6 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
 		if (ret > 0)
 			ret = 0;
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	if (!cycled && !done) {
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index ef7017fb6951..2eb057461023 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1592,7 +1592,6 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 			buf_in_tr = false;
 		}
 		gfs2_glock_dq_uninit(rd_gh);
-		cond_resched();
 		goto more_rgrps;
 	}
 out:
@@ -1962,7 +1961,6 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length)
 	if (current->journal_info) {
 		up_write(&ip->i_rw_mutex);
 		gfs2_trans_end(sdp);
-		cond_resched();
 	}
 	gfs2_quota_unhold(ip);
 out_metapath:
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 4a280be229a6..a1eca3d9857c 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -2073,7 +2073,7 @@ static void glock_hash_walk(glock_examiner examiner, const struct gfs2_sbd *sdp)
 		}
 
 		rhashtable_walk_stop(&iter);
-	} while (cond_resched(), gl == ERR_PTR(-EAGAIN));
+	} while (gl == ERR_PTR(-EAGAIN));
 
 	rhashtable_walk_exit(&iter);
 }
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index e5271ae87d1c..7567a29eeb21 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -143,7 +143,6 @@ __acquires(&sdp->sd_ail_lock)
 		ret = write_cache_pages(mapping, wbc, __gfs2_writepage, mapping);
 		if (need_resched()) {
 			blk_finish_plug(plug);
-			cond_resched();
 			blk_start_plug(plug);
 		}
 		spin_lock(&sdp->sd_ail_lock);
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 33ca04733e93..8ae07f0871b1 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1774,7 +1774,6 @@ static void gfs2_evict_inodes(struct super_block *sb)
 		iput(toput_inode);
 		toput_inode = inode;
 
-		cond_resched();
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
index d39246865c51..88459fea4548 100644
--- a/fs/hpfs/buffer.c
+++ b/fs/hpfs/buffer.c
@@ -77,8 +77,6 @@ void *hpfs_map_sector(struct super_block *s, unsigned secno, struct buffer_head
 
 	hpfs_prefetch_sectors(s, secno, ahead);
 
-	cond_resched();
-
 	*bhp = bh = sb_bread(s, hpfs_search_hotfix_map(s, secno));
 	if (bh != NULL)
 		return bh->b_data;
@@ -97,8 +95,6 @@ void *hpfs_get_sector(struct super_block *s, unsigned secno, struct buffer_head
 
 	hpfs_lock_assert(s);
 
-	cond_resched();
-
 	if ((*bhp = bh = sb_getblk(s, hpfs_search_hotfix_map(s, secno))) != NULL) {
 		if (!buffer_uptodate(bh)) wait_on_buffer(bh);
 		set_buffer_uptodate(bh);
@@ -118,8 +114,6 @@ void *hpfs_map_4sectors(struct super_block *s, unsigned secno, struct quad_buffe
 
 	hpfs_lock_assert(s);
 
-	cond_resched();
-
 	if (secno & 3) {
 		pr_err("%s(): unaligned read\n", __func__);
 		return NULL;
@@ -168,8 +162,6 @@ void *hpfs_map_4sectors(struct super_block *s, unsigned secno, struct quad_buffe
 void *hpfs_get_4sectors(struct super_block *s, unsigned secno,
                           struct quad_buffer_head *qbh)
 {
-	cond_resched();
-
 	hpfs_lock_assert(s);
 
 	if (secno & 3) {
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 316c4cebd3f3..21da053bdaaa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -689,7 +689,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	if (truncate_op)
@@ -867,8 +866,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		struct folio *folio;
 		unsigned long addr;
 
-		cond_resched();
-
 		/*
 		 * fallocate(2) manpage permits EINTR; we may have been
 		 * interrupted because we are using up too much memory.
diff --git a/fs/inode.c b/fs/inode.c
index 84bc3c76e5cc..f2898988bf40 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -695,7 +695,6 @@ static void dispose_list(struct list_head *head)
 		list_del_init(&inode->i_lru);
 
 		evict(inode);
-		cond_resched();
 	}
 }
 
@@ -737,7 +736,6 @@ void evict_inodes(struct super_block *sb)
 		 */
 		if (need_resched()) {
 			spin_unlock(&sb->s_inode_list_lock);
-			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
@@ -778,7 +776,6 @@ void invalidate_inodes(struct super_block *sb)
 		list_add(&inode->i_lru, &dispose);
 		if (need_resched()) {
 			spin_unlock(&sb->s_inode_list_lock);
-			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2bc0aa23fde3..a76faf26b06e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -927,7 +927,6 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		if (unlikely(copied != status))
 			iov_iter_revert(i, copied - status);
 
-		cond_resched();
 		if (unlikely(status == 0)) {
 			/*
 			 * A short copy made iomap_write_end() reject the
@@ -1296,8 +1295,6 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
-		cond_resched();
-
 		pos += bytes;
 		written += bytes;
 		length -= bytes;
@@ -1533,10 +1530,8 @@ iomap_finish_ioends(struct iomap_ioend *ioend, int error)
 	completions = iomap_finish_ioend(ioend, error);
 
 	while (!list_empty(&tmp)) {
-		if (completions > IOEND_BATCH_SIZE * 8) {
-			cond_resched();
+		if (completions > IOEND_BATCH_SIZE * 8)
 			completions = 0;
-		}
 		ioend = list_first_entry(&tmp, struct iomap_ioend, io_list);
 		list_del_init(&ioend->io_list);
 		completions += iomap_finish_ioend(ioend, error);
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 118699fff2f9..1f3c0813d0be 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -457,7 +457,6 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
 	}
 
 	spin_unlock(&journal->j_list_lock);
-	cond_resched();
 
 	if (*nr_to_scan && next_tid)
 		goto again;
@@ -529,7 +528,6 @@ void jbd2_journal_destroy_checkpoint(journal_t *journal)
 		}
 		__jbd2_journal_clean_checkpoint_list(journal, true);
 		spin_unlock(&journal->j_list_lock);
-		cond_resched();
 	}
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 8d6f934c3d95..db7052ee0c62 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -729,7 +729,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				bh->b_end_io = journal_end_buffer_io_sync;
 				submit_bh(REQ_OP_WRITE | REQ_SYNC, bh);
 			}
-			cond_resched();
 
 			/* Force a new descriptor to be generated next
                            time round the loop. */
@@ -811,7 +810,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 						    b_assoc_buffers);
 
 		wait_on_buffer(bh);
-		cond_resched();
 
 		if (unlikely(!buffer_uptodate(bh)))
 			err = -EIO;
@@ -854,7 +852,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 
 		bh = list_entry(log_bufs.prev, struct buffer_head, b_assoc_buffers);
 		wait_on_buffer(bh);
-		cond_resched();
 
 		if (unlikely(!buffer_uptodate(bh)))
 			err = -EIO;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index c269a7d29a46..fbc419d36cd0 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -509,8 +509,6 @@ static int do_one_pass(journal_t *journal,
 		struct buffer_head *	obh;
 		struct buffer_head *	nbh;
 
-		cond_resched();
-
 		/* If we already know where to stop the log traversal,
 		 * check right now that we haven't gone past the end of
 		 * the log. */
diff --git a/fs/jffs2/build.c b/fs/jffs2/build.c
index 6ae9d6fefb86..4f9539211306 100644
--- a/fs/jffs2/build.c
+++ b/fs/jffs2/build.c
@@ -121,10 +121,8 @@ static int jffs2_build_filesystem(struct jffs2_sb_info *c)
 	c->flags |= JFFS2_SB_FLAG_BUILDING;
 	/* Now scan the directory tree, increasing nlink according to every dirent found. */
 	for_each_inode(i, c, ic) {
-		if (ic->scan_dents) {
+		if (ic->scan_dents)
 			jffs2_build_inode_pass1(c, ic, &dir_hardlinks);
-			cond_resched();
-		}
 	}
 
 	dbg_fsbuild("pass 1 complete\n");
@@ -141,7 +139,6 @@ static int jffs2_build_filesystem(struct jffs2_sb_info *c)
 			continue;
 
 		jffs2_build_remove_unlinked_inode(c, ic, &dead_fds);
-		cond_resched();
 	}
 
 	dbg_fsbuild("pass 2a starting\n");
@@ -209,7 +206,6 @@ static int jffs2_build_filesystem(struct jffs2_sb_info *c)
 			jffs2_free_full_dirent(fd);
 		}
 		ic->scan_dents = NULL;
-		cond_resched();
 	}
 	ret = jffs2_build_xattr_subsystem(c);
 	if (ret)
diff --git a/fs/jffs2/erase.c b/fs/jffs2/erase.c
index acd32f05b519..a2706246a68e 100644
--- a/fs/jffs2/erase.c
+++ b/fs/jffs2/erase.c
@@ -143,8 +143,6 @@ int jffs2_erase_pending_blocks(struct jffs2_sb_info *c, int count)
 			BUG();
 		}
 
-		/* Be nice */
-		cond_resched();
 		mutex_lock(&c->erase_free_sem);
 		spin_lock(&c->erase_completion_lock);
 	}
@@ -387,7 +385,6 @@ static int jffs2_block_check_erase(struct jffs2_sb_info *c, struct jffs2_erasebl
 			}
 		}
 		ofs += readlen;
-		cond_resched();
 	}
 	ret = 0;
 fail:
diff --git a/fs/jffs2/gc.c b/fs/jffs2/gc.c
index 5c6602f3c189..3ba9054ac63c 100644
--- a/fs/jffs2/gc.c
+++ b/fs/jffs2/gc.c
@@ -923,8 +923,6 @@ static int jffs2_garbage_collect_deletion_dirent(struct jffs2_sb_info *c, struct
 
 		for (raw = f->inocache->nodes; raw != (void *)f->inocache; raw = raw->next_in_ino) {
 
-			cond_resched();
-
 			/* We only care about obsolete ones */
 			if (!(ref_obsolete(raw)))
 				continue;
diff --git a/fs/jffs2/nodelist.c b/fs/jffs2/nodelist.c
index b86c78d178c6..7a56a5fb1637 100644
--- a/fs/jffs2/nodelist.c
+++ b/fs/jffs2/nodelist.c
@@ -578,7 +578,6 @@ void jffs2_kill_fragtree(struct rb_root *root, struct jffs2_sb_info *c)
 		}
 
 		jffs2_free_node_frag(frag);
-		cond_resched();
 	}
 }
 
diff --git a/fs/jffs2/nodemgmt.c b/fs/jffs2/nodemgmt.c
index a7bbe879cfc3..5f9ab75540f4 100644
--- a/fs/jffs2/nodemgmt.c
+++ b/fs/jffs2/nodemgmt.c
@@ -185,8 +185,6 @@ int jffs2_reserve_space(struct jffs2_sb_info *c, uint32_t minsize,
 			} else if (ret)
 				return ret;
 
-			cond_resched();
-
 			if (signal_pending(current))
 				return -EINTR;
 
@@ -227,7 +225,14 @@ int jffs2_reserve_space_gc(struct jffs2_sb_info *c, uint32_t minsize,
 		spin_unlock(&c->erase_completion_lock);
 
 		if (ret == -EAGAIN)
-			cond_resched();
+			/*
+			 * The spin_unlock() above will implicitly reschedule
+			 * if one is needed.
+			 *
+			 * In case we did not reschedule, take a breather here
+			 * before retrying.
+			 */
+			cpu_relax();
 		else
 			break;
 	}
diff --git a/fs/jffs2/readinode.c b/fs/jffs2/readinode.c
index 03b4f99614be..f9fc1f6451f8 100644
--- a/fs/jffs2/readinode.c
+++ b/fs/jffs2/readinode.c
@@ -1013,8 +1013,6 @@ static int jffs2_get_inode_nodes(struct jffs2_sb_info *c, struct jffs2_inode_inf
 		valid_ref = jffs2_first_valid_node(ref->next_in_ino);
 		spin_unlock(&c->erase_completion_lock);
 
-		cond_resched();
-
 		/*
 		 * At this point we don't know the type of the node we're going
 		 * to read, so we do not know the size of its header. In order
diff --git a/fs/jffs2/scan.c b/fs/jffs2/scan.c
index 29671e33a171..aaf6b33ba200 100644
--- a/fs/jffs2/scan.c
+++ b/fs/jffs2/scan.c
@@ -143,8 +143,6 @@ int jffs2_scan_medium(struct jffs2_sb_info *c)
 	for (i=0; i<c->nr_blocks; i++) {
 		struct jffs2_eraseblock *jeb = &c->blocks[i];
 
-		cond_resched();
-
 		/* reset summary info for next eraseblock scan */
 		jffs2_sum_reset_collected(s);
 
@@ -621,8 +619,6 @@ static int jffs2_scan_eraseblock (struct jffs2_sb_info *c, struct jffs2_eraseblo
 		if (err)
 			return err;
 
-		cond_resched();
-
 		if (ofs & 3) {
 			pr_warn("Eep. ofs 0x%08x not word-aligned!\n", ofs);
 			ofs = PAD(ofs);
diff --git a/fs/jffs2/summary.c b/fs/jffs2/summary.c
index 4fe64519870f..5a4a6438a966 100644
--- a/fs/jffs2/summary.c
+++ b/fs/jffs2/summary.c
@@ -397,8 +397,6 @@ static int jffs2_sum_process_sum_data(struct jffs2_sb_info *c, struct jffs2_eras
 	for (i=0; i<je32_to_cpu(summary->sum_num); i++) {
 		dbg_summary("processing summary index %d\n", i);
 
-		cond_resched();
-
 		/* Make sure there's a spare ref for dirty space */
 		err = jffs2_prealloc_raw_node_refs(c, jeb, 2);
 		if (err)
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index ce4b4760fcb1..d30011f3e935 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -2833,12 +2833,11 @@ void txQuiesce(struct super_block *sb)
 		mutex_lock(&jfs_ip->commit_mutex);
 		txCommit(tid, 1, &ip, 0);
 		txEnd(tid);
+		/*
+		 * The mutex_unlock() reschedules if needed.
+		 */
 		mutex_unlock(&jfs_ip->commit_mutex);
-		/*
-		 * Just to be safe.  I don't know how
-		 * long we can run without blocking
-		 */
-		cond_resched();
+
 		TXN_LOCK();
 	}
 
@@ -2912,11 +2911,6 @@ int jfs_sync(void *arg)
 				mutex_unlock(&jfs_ip->commit_mutex);
 
 				iput(ip);
-				/*
-				 * Just to be safe.  I don't know how
-				 * long we can run without blocking
-				 */
-				cond_resched();
 				TXN_LOCK();
 			} else {
 				/* We can't get the commit mutex.  It may
diff --git a/fs/libfs.c b/fs/libfs.c
index 37f2d34ee090..c74cecca8557 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -125,9 +125,8 @@ static struct dentry *scan_positives(struct dentry *cursor,
 		if (need_resched()) {
 			list_move(&cursor->d_child, p);
 			p = &cursor->d_child;
-			spin_unlock(&dentry->d_lock);
-			cond_resched();
-			spin_lock(&dentry->d_lock);
+
+			cond_resched_lock(&dentry->d_lock);
 		}
 	}
 	spin_unlock(&dentry->d_lock);
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 2a4b8b549e93..451d554d3f55 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -322,7 +322,6 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
 		spin_unlock(&cache->c_list_lock);
 		__mb_cache_entry_free(cache, entry);
 		shrunk++;
-		cond_resched();
 		spin_lock(&cache->c_list_lock);
 	}
 	spin_unlock(&cache->c_list_lock);
diff --git a/fs/namei.c b/fs/namei.c
index 94565bd7e73f..e911d7f15dad 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1781,7 +1781,6 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
 
 	if (!(nd->flags & LOOKUP_RCU)) {
 		touch_atime(&last->link);
-		cond_resched();
 	} else if (atime_needs_update(&last->link, inode)) {
 		if (!try_to_unlazy(nd))
 			return ERR_PTR(-ECHILD);
diff --git a/fs/netfs/io.c b/fs/netfs/io.c
index 7f753380e047..fe9487237b5d 100644
--- a/fs/netfs/io.c
+++ b/fs/netfs/io.c
@@ -641,7 +641,6 @@ int netfs_begin_read(struct netfs_io_request *rreq, bool sync)
 			netfs_rreq_assess(rreq, false);
 			if (!test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags))
 				break;
-			cond_resched();
 		}
 
 		ret = rreq->error;
diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index cf7365581031..6b5b060b3658 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -650,7 +650,6 @@ static int nfs_server_return_marked_delegations(struct nfs_server *server,
 
 		err = nfs_end_delegation_return(inode, delegation, 0);
 		iput(inode);
-		cond_resched();
 		if (!err)
 			goto restart;
 		set_bit(NFS4CLNT_DELEGRETURN, &server->nfs_client->cl_state);
@@ -1186,7 +1185,6 @@ static int nfs_server_reap_unclaimed_delegations(struct nfs_server *server,
 			nfs_put_delegation(delegation);
 		}
 		iput(inode);
-		cond_resched();
 		goto restart;
 	}
 	rcu_read_unlock();
@@ -1318,7 +1316,6 @@ static int nfs_server_reap_expired_delegations(struct nfs_server *server,
 		put_cred(cred);
 		if (!nfs4_server_rebooted(server->nfs_client)) {
 			iput(inode);
-			cond_resched();
 			goto restart;
 		}
 		nfs_inode_mark_test_expired_delegation(server,inode);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 84343aefbbd6..10db43e1833a 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -2665,14 +2665,12 @@ static int pnfs_layout_return_unused_byserver(struct nfs_server *server,
 			spin_unlock(&inode->i_lock);
 			rcu_read_unlock();
 			pnfs_put_layout_hdr(lo);
-			cond_resched();
 			goto restart;
 		}
 		spin_unlock(&inode->i_lock);
 		rcu_read_unlock();
 		pnfs_send_layoutreturn(lo, &stateid, &cred, iomode, false);
 		pnfs_put_layout_hdr(lo);
-		cond_resched();
 		goto restart;
 	}
 	rcu_read_unlock();
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 9d82d50ce0b1..eec3d641998b 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1053,7 +1053,6 @@ nfs_scan_commit_list(struct list_head *src, struct list_head *dst,
 		ret++;
 		if ((ret == max) && !cinfo->dreq)
 			break;
-		cond_resched();
 	}
 	return ret;
 }
@@ -1890,8 +1889,6 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 		atomic_long_inc(&NFS_I(data->inode)->redirtied_pages);
 	next:
 		nfs_unlock_and_release_request(req);
-		/* Latency breaker */
-		cond_resched();
 	}
 	nfss = NFS_SERVER(data->inode);
 	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
@@ -1958,7 +1955,6 @@ static int __nfs_commit_inode(struct inode *inode, int how,
 		}
 		if (nscan < INT_MAX)
 			break;
-		cond_resched();
 	}
 	nfs_commit_end(cinfo.mds);
 	if (ret || !may_wait)
diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
index 13592e82eaf6..4ed6d5d23ade 100644
--- a/fs/nilfs2/btree.c
+++ b/fs/nilfs2/btree.c
@@ -2173,7 +2173,6 @@ static void nilfs_btree_lookup_dirty_buffers(struct nilfs_bmap *btree,
 			} while ((bh = bh->b_this_page) != head);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	for (level = NILFS_BTREE_LEVEL_NODE_MIN;
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 1a8bd5993476..a5780f54ac6d 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -1280,7 +1280,6 @@ int nilfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			}
 			blkoff += n;
 		}
-		cond_resched();
 	} while (true);
 
 	/* If ret is 1 then we just hit the end of the extent array */
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index b4e54d079b7d..71c5b6792e5f 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -277,7 +277,6 @@ int nilfs_copy_dirty_pages(struct address_space *dmap,
 		folio_unlock(folio);
 	}
 	folio_batch_release(&fbatch);
-	cond_resched();
 
 	if (likely(!err))
 		goto repeat;
@@ -346,7 +345,6 @@ void nilfs_copy_back_pages(struct address_space *dmap,
 		folio_unlock(folio);
 	}
 	folio_batch_release(&fbatch);
-	cond_resched();
 
 	goto repeat;
 }
@@ -382,7 +380,6 @@ void nilfs_clear_dirty_pages(struct address_space *mapping, bool silent)
 			folio_unlock(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -539,7 +536,6 @@ unsigned long nilfs_find_uncommitted_extent(struct inode *inode,
 	} while (++i < nr_folios);
 
 	folio_batch_release(&fbatch);
-	cond_resched();
 	goto repeat;
 
 out_locked:
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 7ec16879756e..45c65b450119 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -361,7 +361,6 @@ static void nilfs_transaction_lock(struct super_block *sb,
 		nilfs_segctor_do_immediate_flush(sci);
 
 		up_write(&nilfs->ns_segctor_sem);
-		cond_resched();
 	}
 	if (gcflag)
 		ti->ti_flags |= NILFS_TI_GC;
@@ -746,13 +745,11 @@ static size_t nilfs_lookup_dirty_data_buffers(struct inode *inode,
 			ndirties++;
 			if (unlikely(ndirties >= nlimit)) {
 				folio_batch_release(&fbatch);
-				cond_resched();
 				return ndirties;
 			}
 		} while (bh = bh->b_this_page, bh != head);
 	}
 	folio_batch_release(&fbatch);
-	cond_resched();
 	goto repeat;
 }
 
@@ -785,7 +782,6 @@ static void nilfs_lookup_dirty_node_buffers(struct inode *inode,
 			} while (bh != head);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 62fe0b679e58..64a66e1aeac4 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -805,7 +805,6 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
 		 * User can supply arbitrarily large buffer. Avoid softlockups
 		 * in case there are lots of available events.
 		 */
-		cond_resched();
 		event = get_one_event(group, count);
 		if (IS_ERR(event)) {
 			ret = PTR_ERR(event);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7974e91ffe13..a6aff29204f6 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -79,7 +79,6 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		cond_resched();
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c
index f79408f9127a..173f6fcfef54 100644
--- a/fs/ntfs/attrib.c
+++ b/fs/ntfs/attrib.c
@@ -2556,7 +2556,6 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		set_page_dirty(page);
 		put_page(page);
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 		if (idx == end)
 			goto done;
 		idx++;
@@ -2597,7 +2596,6 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		unlock_page(page);
 		put_page(page);
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	}
 	/* If there is a last partial page, need to do it the slow way. */
 	if (end_ofs) {
@@ -2614,7 +2612,6 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		set_page_dirty(page);
 		put_page(page);
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	}
 done:
 	ntfs_debug("Done.");
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index cbc545999cfe..a03ad2d7faf7 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -259,7 +259,6 @@ static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size)
 		 * files.
 		 */
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	} while (++index < end_index);
 	read_lock_irqsave(&ni->size_lock, flags);
 	BUG_ON(ni->initialized_size != new_init_size);
@@ -1868,7 +1867,6 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
 			iov_iter_revert(i, copied);
 			break;
 		}
-		cond_resched();
 		if (unlikely(copied < bytes)) {
 			iov_iter_revert(i, copied);
 			if (copied)
diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index 1f7a194983c5..cfb09f47a588 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -158,7 +158,6 @@ static int ntfs_extend_initialized_size(struct file *file,
 			break;
 
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	}
 
 	return 0;
@@ -241,7 +240,6 @@ static int ntfs_zero_range(struct inode *inode, u64 vbo, u64 vbo_to)
 
 		unlock_page(page);
 		put_page(page);
-		cond_resched();
 	}
 out:
 	mark_inode_dirty(inode);
@@ -1005,13 +1003,6 @@ static ssize_t ntfs_compress_write(struct kiocb *iocb, struct iov_iter *from)
 		if (err)
 			goto out;
 
-		/*
-		 * We can loop for a long time in here. Be nice and allow
-		 * us to schedule out to avoid softlocking if preempt
-		 * is disabled.
-		 */
-		cond_resched();
-
 		pos += copied;
 		written += copied;
 
diff --git a/fs/ntfs3/frecord.c b/fs/ntfs3/frecord.c
index dad976a68985..8fa4bb50b0b1 100644
--- a/fs/ntfs3/frecord.c
+++ b/fs/ntfs3/frecord.c
@@ -2265,8 +2265,6 @@ int ni_decompress_file(struct ntfs_inode *ni)
 
 		if (err)
 			goto out;
-
-		cond_resched();
 	}
 
 remove_wof:
diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index aef58f1395c8..2fccabc7aa51 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -7637,10 +7637,8 @@ int ocfs2_trim_mainbm(struct super_block *sb, struct fstrim_range *range)
 	 * main_bm related locks for avoiding the current IO starve, then go to
 	 * trim the next group
 	 */
-	if (ret >= 0 && group <= last_group) {
-		cond_resched();
+	if (ret >= 0 && group <= last_group)
 		goto next_group;
-	}
 out:
 	range->len = trimmed * sb->s_blocksize;
 	return ret;
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 960080753d3b..7bf6f46bd429 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -951,7 +951,12 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		if (ret == (ssize_t)-EAGAIN) {
 			mlog(0, "sendpage of size %zu to " SC_NODEF_FMT
 			     " returned EAGAIN\n", size, SC_NODEF_ARGS(sc));
-			cond_resched();
+
+			/*
+			 * Take a breather before retrying. Though maybe this
+			 * should be a wait on an event or a timeout?
+			 */
+			cpu_relax();
 			continue;
 		}
 		mlog(ML_ERROR, "sendpage of size %zu to " SC_NODEF_FMT
@@ -1929,7 +1934,6 @@ static void o2net_accept_many(struct work_struct *work)
 		o2net_accept_one(sock, &more);
 		if (!more)
 			break;
-		cond_resched();
 	}
 }
 
diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index eedf07ca23ca..271e0f7405e5 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -792,11 +792,10 @@ static int dlm_thread(void *data)
 		spin_unlock(&dlm->spinlock);
 		dlm_flush_asts(dlm);
 
-		/* yield and continue right away if there is more work to do */
-		if (!n) {
-			cond_resched();
+		/* An unlock above would have led to a yield if one was
+		 * needed. Continue right away if there is more to do */
+		if (!n)
 			continue;
-		}
 
 		wait_event_interruptible_timeout(dlm->dlm_thread_wq,
 						 !dlm_dirty_list_empty(dlm) ||
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index c45596c25c66..f977337a33db 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -940,6 +940,10 @@ static int ocfs2_zero_extend_range(struct inode *inode, u64 range_start,
 	BUG_ON(range_start >= range_end);
 
 	while (zero_pos < range_end) {
+		/*
+		 * If this is a very long extent, then we might be here
+		 * awhile. We should expect the scheduler to preempt us.
+		 */
 		next_pos = (zero_pos & PAGE_MASK) + PAGE_SIZE;
 		if (next_pos > range_end)
 			next_pos = range_end;
@@ -949,12 +953,6 @@ static int ocfs2_zero_extend_range(struct inode *inode, u64 range_start,
 			break;
 		}
 		zero_pos = next_pos;
-
-		/*
-		 * Very large extends have the potential to lock up
-		 * the cpu for extended periods of time.
-		 */
-		cond_resched();
 	}
 
 	return rc;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ffd54617c354..fec3dc6a887d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3532,7 +3532,6 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		char name[10 + 1];
 		unsigned int len;
 
-		cond_resched();
 		if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
 			continue;
 
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 6276b3938842..b014c44b96e9 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -272,7 +272,6 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 				     name, len, instantiate, p,
 				     &data))
 			goto out;
-		cond_resched();
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 23fc24d16b31..4625dea20bc6 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -491,7 +491,6 @@ static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 		if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
 			page_offline_thaw();
-			cond_resched();
 			page_offline_freeze();
 		}
 
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 195b077c0fac..14fd181baf57 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -80,8 +80,6 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
 		pfn++;
 		out++;
 		count -= KPMSIZE;
-
-		cond_resched();
 	}
 
 	*ppos += (char __user *)out - buf;
@@ -258,8 +256,6 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
 		pfn++;
 		out++;
 		count -= KPMSIZE;
-
-		cond_resched();
 	}
 
 	*ppos += (char __user *)out - buf;
@@ -313,8 +309,6 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
 		pfn++;
 		out++;
 		count -= KPMSIZE;
-
-		cond_resched();
 	}
 
 	*ppos += (char __user *)out - buf;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3dd5be96691b..49c2ebcb5fd9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -629,7 +629,6 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		smaps_pte_entry(pte, addr, walk);
 	pte_unmap_unlock(pte - 1, ptl);
 out:
-	cond_resched();
 	return 0;
 }
 
@@ -1210,7 +1209,6 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		ClearPageReferenced(page);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
 	return 0;
 }
 
@@ -1554,8 +1552,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-	cond_resched();
-
 	return err;
 }
 
@@ -1605,8 +1601,6 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 			frame++;
 	}
 
-	cond_resched();
-
 	return err;
 }
 #else
@@ -1899,7 +1893,6 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap_unlock(orig_pte, ptl);
-	cond_resched();
 	return 0;
 }
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 31e897ad5e6a..994d69edf349 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1068,7 +1068,6 @@ static int add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		cond_resched();
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 015bfe4e4524..74b503a46884 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -814,7 +814,6 @@ static int write_ordered_buffers(spinlock_t * lock,
 			if (chunk.nr)
 				write_ordered_chunk(&chunk);
 			wait_on_buffer(bh);
-			cond_resched();
 			spin_lock(lock);
 			goto loop_next;
 		}
@@ -1671,7 +1670,6 @@ static int write_one_transaction(struct super_block *s,
 		}
 next:
 		cn = cn->next;
-		cond_resched();
 	}
 	return ret;
 }
diff --git a/fs/select.c b/fs/select.c
index 0ee55af1a55c..1d05de51c543 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -573,7 +573,6 @@ static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
 				*routp = res_out;
 			if (res_ex)
 				*rexp = res_ex;
-			cond_resched();
 		}
 		wait->_qproc = NULL;
 		if (retval || timed_out || signal_pending(current))
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 2108b3b40ce9..da3b31b02b45 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -2713,7 +2713,6 @@ static void cifs_extend_writeback(struct address_space *mapping,
 		}
 
 		folio_batch_release(&batch);
-		cond_resched();
 	} while (!stop);
 
 	*_len = len;
@@ -2951,7 +2950,6 @@ static int cifs_writepages_region(struct address_space *mapping,
 		}
 
 		folio_batch_release(&fbatch);		
-		cond_resched();
 	} while (wbc->nr_to_write > 0);
 
 	*_next = start;
diff --git a/fs/splice.c b/fs/splice.c
index d983d375ff11..0b43bedbf36f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -604,7 +604,6 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 
 	splice_from_pipe_begin(sd);
 	do {
-		cond_resched();
 		ret = splice_from_pipe_next(pipe, sd);
 		if (ret > 0)
 			ret = splice_from_pipe_feed(pipe, sd, actor);
diff --git a/fs/ubifs/budget.c b/fs/ubifs/budget.c
index d76eb7b39f56..b9100c713964 100644
--- a/fs/ubifs/budget.c
+++ b/fs/ubifs/budget.c
@@ -477,7 +477,6 @@ int ubifs_budget_space(struct ubifs_info *c, struct ubifs_budget_req *req)
 	}
 
 	err = make_free_space(c);
-	cond_resched();
 	if (err == -EAGAIN) {
 		dbg_budg("try again");
 		goto again;
diff --git a/fs/ubifs/commit.c b/fs/ubifs/commit.c
index c4fc1047fc07..2fd6aef59b7d 100644
--- a/fs/ubifs/commit.c
+++ b/fs/ubifs/commit.c
@@ -309,7 +309,6 @@ int ubifs_bg_thread(void *info)
 			ubifs_ro_mode(c, err);
 
 		run_bg_commit(c);
-		cond_resched();
 	}
 
 	ubifs_msg(c, "background thread \"%s\" stops", c->bgt_name);
diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index eef9e527d9ff..add4b72fd52f 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -852,7 +852,6 @@ void ubifs_dump_leb(const struct ubifs_info *c, int lnum)
 	       sleb->nodes_cnt, sleb->endpt);
 
 	list_for_each_entry(snod, &sleb->nodes, list) {
-		cond_resched();
 		pr_err("Dumping node at LEB %d:%d len %d\n", lnum,
 		       snod->offs, snod->len);
 		ubifs_dump_node(c, snod->node, c->leb_size - snod->offs);
@@ -1622,8 +1621,6 @@ int dbg_walk_index(struct ubifs_info *c, dbg_leaf_callback leaf_cb,
 	while (1) {
 		int idx;
 
-		cond_resched();
-
 		if (znode_cb) {
 			err = znode_cb(c, znode, priv);
 			if (err) {
@@ -2329,7 +2326,6 @@ int dbg_check_data_nodes_order(struct ubifs_info *c, struct list_head *head)
 		ino_t inuma, inumb;
 		uint32_t blka, blkb;
 
-		cond_resched();
 		sa = container_of(cur, struct ubifs_scan_node, list);
 		sb = container_of(cur->next, struct ubifs_scan_node, list);
 
@@ -2396,7 +2392,6 @@ int dbg_check_nondata_nodes_order(struct ubifs_info *c, struct list_head *head)
 		ino_t inuma, inumb;
 		uint32_t hasha, hashb;
 
-		cond_resched();
 		sa = container_of(cur, struct ubifs_scan_node, list);
 		sb = container_of(cur->next, struct ubifs_scan_node, list);
 
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 2f48c58d47cd..7baa86efa471 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -683,7 +683,6 @@ static int ubifs_readdir(struct file *file, struct dir_context *ctx)
 		kfree(file->private_data);
 		ctx->pos = key_hash_flash(c, &dent->key);
 		file->private_data = dent;
-		cond_resched();
 	}
 
 out:
diff --git a/fs/ubifs/gc.c b/fs/ubifs/gc.c
index 3134d070fcc0..d85bcb64e9a8 100644
--- a/fs/ubifs/gc.c
+++ b/fs/ubifs/gc.c
@@ -109,7 +109,6 @@ static int data_nodes_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct ubifs_scan_node *sa, *sb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -153,7 +152,6 @@ static int nondata_nodes_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct ubifs_scan_node *sa, *sb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -305,7 +303,6 @@ static int move_node(struct ubifs_info *c, struct ubifs_scan_leb *sleb,
 {
 	int err, new_lnum = wbuf->lnum, new_offs = wbuf->offs + wbuf->used;
 
-	cond_resched();
 	err = ubifs_wbuf_write_nolock(wbuf, snod->node, snod->len);
 	if (err)
 		return err;
@@ -695,8 +692,6 @@ int ubifs_garbage_collect(struct ubifs_info *c, int anyway)
 		/* Maybe continue after find and break before find */
 		lp.lnum = -1;
 
-		cond_resched();
-
 		/* Give the commit an opportunity to run */
 		if (ubifs_gc_should_commit(c)) {
 			ret = -EAGAIN;
diff --git a/fs/ubifs/io.c b/fs/ubifs/io.c
index 01d8eb170382..4915ab97f7ce 100644
--- a/fs/ubifs/io.c
+++ b/fs/ubifs/io.c
@@ -683,8 +683,6 @@ int ubifs_bg_wbufs_sync(struct ubifs_info *c)
 	for (i = 0; i < c->jhead_cnt; i++) {
 		struct ubifs_wbuf *wbuf = &c->jheads[i].wbuf;
 
-		cond_resched();
-
 		/*
 		 * If the mutex is locked then wbuf is being changed, so
 		 * synchronization is not necessary.
diff --git a/fs/ubifs/lprops.c b/fs/ubifs/lprops.c
index 6d6cd85c2b4c..57e4d001125a 100644
--- a/fs/ubifs/lprops.c
+++ b/fs/ubifs/lprops.c
@@ -1113,8 +1113,6 @@ static int scan_check_cb(struct ubifs_info *c,
 	list_for_each_entry(snod, &sleb->nodes, list) {
 		int found, level = 0;
 
-		cond_resched();
-
 		if (is_idx == -1)
 			is_idx = (snod->type == UBIFS_IDX_NODE) ? 1 : 0;
 
diff --git a/fs/ubifs/lpt_commit.c b/fs/ubifs/lpt_commit.c
index c4d079328b92..0cadd08f6304 100644
--- a/fs/ubifs/lpt_commit.c
+++ b/fs/ubifs/lpt_commit.c
@@ -1483,7 +1483,6 @@ static int dbg_is_nnode_dirty(struct ubifs_info *c, int lnum, int offs)
 	for (; nnode; nnode = next_nnode(c, nnode, &hght)) {
 		struct ubifs_nbranch *branch;
 
-		cond_resched();
 		if (nnode->parent) {
 			branch = &nnode->parent->nbranch[nnode->iip];
 			if (branch->lnum != lnum || branch->offs != offs)
@@ -1517,7 +1516,6 @@ static int dbg_is_pnode_dirty(struct ubifs_info *c, int lnum, int offs)
 		struct ubifs_pnode *pnode;
 		struct ubifs_nbranch *branch;
 
-		cond_resched();
 		pnode = ubifs_pnode_lookup(c, i);
 		if (IS_ERR(pnode))
 			return PTR_ERR(pnode);
@@ -1673,7 +1671,6 @@ int dbg_check_ltab(struct ubifs_info *c)
 		pnode = ubifs_pnode_lookup(c, i);
 		if (IS_ERR(pnode))
 			return PTR_ERR(pnode);
-		cond_resched();
 	}
 
 	/* Check nodes */
diff --git a/fs/ubifs/orphan.c b/fs/ubifs/orphan.c
index 4909321d84cf..23572f418a8b 100644
--- a/fs/ubifs/orphan.c
+++ b/fs/ubifs/orphan.c
@@ -957,7 +957,6 @@ static int dbg_read_orphans(struct check_info *ci, struct ubifs_scan_leb *sleb)
 	int i, n, err;
 
 	list_for_each_entry(snod, &sleb->nodes, list) {
-		cond_resched();
 		if (snod->type != UBIFS_ORPH_NODE)
 			continue;
 		orph = snod->node;
diff --git a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c
index f0d51dd21c9e..6b1bf684ec14 100644
--- a/fs/ubifs/recovery.c
+++ b/fs/ubifs/recovery.c
@@ -638,8 +638,6 @@ struct ubifs_scan_leb *ubifs_recover_leb(struct ubifs_info *c, int lnum,
 		dbg_scan("look at LEB %d:%d (%d bytes left)",
 			 lnum, offs, len);
 
-		cond_resched();
-
 		/*
 		 * Scan quietly until there is an error from which we cannot
 		 * recover
@@ -999,8 +997,6 @@ static int clean_an_unclean_leb(struct ubifs_info *c,
 	while (len >= 8) {
 		int ret;
 
-		cond_resched();
-
 		/* Scan quietly until there is an error */
 		ret = ubifs_scan_a_node(c, buf, len, lnum, offs, quiet);
 
diff --git a/fs/ubifs/replay.c b/fs/ubifs/replay.c
index 4211e4456b1e..9a361d8f998e 100644
--- a/fs/ubifs/replay.c
+++ b/fs/ubifs/replay.c
@@ -305,7 +305,6 @@ static int replay_entries_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct replay_entry *ra, *rb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -332,8 +331,6 @@ static int apply_replay_list(struct ubifs_info *c)
 	list_sort(c, &c->replay_list, &replay_entries_cmp);
 
 	list_for_each_entry(r, &c->replay_list, list) {
-		cond_resched();
-
 		err = apply_replay_entry(c, r);
 		if (err)
 			return err;
@@ -722,8 +719,6 @@ static int replay_bud(struct ubifs_info *c, struct bud_entry *b)
 		u8 hash[UBIFS_HASH_ARR_SZ];
 		int deletion = 0;
 
-		cond_resched();
-
 		if (snod->sqnum >= SQNUM_WATERMARK) {
 			ubifs_err(c, "file system's life ended");
 			goto out_dump;
@@ -1060,8 +1055,6 @@ static int replay_log_leb(struct ubifs_info *c, int lnum, int offs, void *sbuf)
 	}
 
 	list_for_each_entry(snod, &sleb->nodes, list) {
-		cond_resched();
-
 		if (snod->sqnum >= SQNUM_WATERMARK) {
 			ubifs_err(c, "file system's life ended");
 			goto out_dump;
diff --git a/fs/ubifs/scan.c b/fs/ubifs/scan.c
index 84a9157dcc32..db3fc3297d1a 100644
--- a/fs/ubifs/scan.c
+++ b/fs/ubifs/scan.c
@@ -269,8 +269,6 @@ struct ubifs_scan_leb *ubifs_scan(const struct ubifs_info *c, int lnum,
 		dbg_scan("look at LEB %d:%d (%d bytes left)",
 			 lnum, offs, len);
 
-		cond_resched();
-
 		ret = ubifs_scan_a_node(c, buf, len, lnum, offs, quiet);
 		if (ret > 0) {
 			/* Padding bytes or a valid padding node */
diff --git a/fs/ubifs/shrinker.c b/fs/ubifs/shrinker.c
index d00a6f20ac7b..f381f844c321 100644
--- a/fs/ubifs/shrinker.c
+++ b/fs/ubifs/shrinker.c
@@ -125,7 +125,6 @@ static int shrink_tnc(struct ubifs_info *c, int nr, int age, int *contention)
 
 		zprev = znode;
 		znode = ubifs_tnc_levelorder_next(c, c->zroot.znode, znode);
-		cond_resched();
 	}
 
 	return total_freed;
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index b08fb28d16b5..0307d12d29d2 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -949,8 +949,6 @@ static int check_volume_empty(struct ubifs_info *c)
 			c->empty = 0;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	return 0;
diff --git a/fs/ubifs/tnc_commit.c b/fs/ubifs/tnc_commit.c
index a55e04822d16..97218e7d380d 100644
--- a/fs/ubifs/tnc_commit.c
+++ b/fs/ubifs/tnc_commit.c
@@ -857,8 +857,6 @@ static int write_index(struct ubifs_info *c)
 	while (1) {
 		u8 hash[UBIFS_HASH_ARR_SZ];
 
-		cond_resched();
-
 		znode = cnext;
 		idx = c->cbuf + used;
 
diff --git a/fs/ubifs/tnc_misc.c b/fs/ubifs/tnc_misc.c
index 4d686e34e64d..b92d2ca00a0b 100644
--- a/fs/ubifs/tnc_misc.c
+++ b/fs/ubifs/tnc_misc.c
@@ -235,7 +235,6 @@ long ubifs_destroy_tnc_subtree(const struct ubifs_info *c,
 			    !ubifs_zn_dirty(zn->zbranch[n].znode))
 				clean_freed += 1;
 
-			cond_resched();
 			kfree(zn->zbranch[n].znode);
 		}
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 56eaae9dac1a..ad8500e831ba 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -914,7 +914,6 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	mmap_write_lock(mm);
 	prev = NULL;
 	for_each_vma(vmi, vma) {
-		cond_resched();
 		BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
 		       !!(vma->vm_flags & __VM_UFFD_FLAGS));
 		if (vma->vm_userfaultfd_ctx.ctx != ctx) {
@@ -1277,7 +1276,6 @@ static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 		seq = read_seqcount_begin(&ctx->refile_seq);
 		need_wakeup = waitqueue_active(&ctx->fault_pending_wqh) ||
 			waitqueue_active(&ctx->fault_wqh);
-		cond_resched();
 	} while (read_seqcount_retry(&ctx->refile_seq, seq));
 	if (need_wakeup)
 		__wake_userfault(ctx, range);
@@ -1392,8 +1390,6 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	basic_ioctls = false;
 	cur = vma;
 	do {
-		cond_resched();
-
 		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
 		       !!(cur->vm_flags & __VM_UFFD_FLAGS));
 
@@ -1458,7 +1454,6 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 	ret = 0;
 	for_each_vma_range(vmi, vma, end) {
-		cond_resched();
 
 		BUG_ON(!vma_can_userfault(vma, vm_flags));
 		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
@@ -1603,8 +1598,6 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	found = false;
 	cur = vma;
 	do {
-		cond_resched();
-
 		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
 		       !!(cur->vm_flags & __VM_UFFD_FLAGS));
 
@@ -1629,8 +1622,6 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 
 	ret = 0;
 	for_each_vma_range(vmi, vma, end) {
-		cond_resched();
-
 		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
 
 		/*
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index c284f46d1b53..a13623717dd6 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -152,7 +152,6 @@ static int build_merkle_tree(struct file *filp,
 			err = -EINTR;
 			goto out;
 		}
-		cond_resched();
 	}
 	/* Finish all nonempty pending tree blocks. */
 	for (level = 0; level < num_levels; level++) {
diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
index f58432772d9e..1b0102faae6c 100644
--- a/fs/verity/read_metadata.c
+++ b/fs/verity/read_metadata.c
@@ -71,7 +71,6 @@ static int fsverity_read_merkle_tree(struct inode *inode,
 			err = -EINTR;
 			break;
 		}
-		cond_resched();
 		offs_in_page = 0;
 	}
 	return retval ? retval : err;
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index cabdc0e16838..97022145e888 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -16,13 +16,6 @@ xchk_should_terminate(
 	struct xfs_scrub	*sc,
 	int			*error)
 {
-	/*
-	 * If preemption is disabled, we need to yield to the scheduler every
-	 * few seconds so that we don't run afoul of the soft lockup watchdog
-	 * or RCU stall detector.
-	 */
-	cond_resched();
-
 	if (fatal_signal_pending(current)) {
 		if (*error == 0)
 			*error = -EINTR;
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index f0f532c10a5a..59deed2fae80 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -498,13 +498,6 @@ xfarray_sort_terminated(
 	struct xfarray_sortinfo	*si,
 	int			*error)
 {
-	/*
-	 * If preemption is disabled, we need to yield to the scheduler every
-	 * few seconds so that we don't run afoul of the soft lockup watchdog
-	 * or RCU stall detector.
-	 */
-	cond_resched();
-
 	if ((si->flags & XFARRAY_SORT_KILLABLE) &&
 	    fatal_signal_pending(current)) {
 		if (*error == 0)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 465d7630bb21..cba03bff03ab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -171,7 +171,6 @@ xfs_end_io(
 		list_del_init(&ioend->io_list);
 		iomap_ioend_try_merge(ioend, &tmp);
 		xfs_end_ioend(ioend);
-		cond_resched();
 	}
 }
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 3c210ac83713..d0ffbf581355 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1716,8 +1716,6 @@ xfs_icwalk_ag(
 		if (error == -EFSCORRUPTED)
 			break;
 
-		cond_resched();
-
 		if (icw && (icw->icw_flags & XFS_ICWALK_FLAG_SCAN_LIMIT)) {
 			icw->icw_scan_limit -= XFS_LOOKUP_BATCH;
 			if (icw->icw_scan_limit <= 0)
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index b3275e8d47b6..908881df15ed 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -420,7 +420,6 @@ xfs_iwalk_ag(
 		struct xfs_inobt_rec_incore	*irec;
 		xfs_ino_t			rec_fsino;
 
-		cond_resched();
 		if (xfs_pwork_want_abort(&iwag->pwork))
 			goto out;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 75/86] treewide: virt: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (16 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 74/86] treewide: fs: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 76/86] treewide: block: " Ankur Arora
                     ` (12 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Stefano Stabellini, Oleksandr Tyshchenko, Paolo Bonzini

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the cond_resched() calls here are from set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/xen/balloon.c      | 2 --
 drivers/xen/gntdev.c       | 2 --
 drivers/xen/xen-scsiback.c | 9 +++++----
 virt/kvm/pfncache.c        | 2 --
 4 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 586a1673459e..a57e516b36f5 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -550,8 +550,6 @@ static int balloon_thread(void *unused)
 		update_schedule();
 
 		mutex_unlock(&balloon_mutex);
-
-		cond_resched();
 	}
 }
 
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 61faea1f0663..cbf74a2b6a06 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -974,8 +974,6 @@ static long gntdev_ioctl_grant_copy(struct gntdev_priv *priv, void __user *u)
 		ret = gntdev_grant_copy_seg(&batch, &seg, &copy.segments[i].status);
 		if (ret < 0)
 			goto out;
-
-		cond_resched();
 	}
 	if (batch.nr_ops)
 		ret = gntdev_copy(&batch);
diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index 8b77e4c06e43..1ab88ba93166 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -814,9 +814,6 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info,
 			transport_generic_free_cmd(&pending_req->se_cmd, 0);
 			break;
 		}
-
-		/* Yield point for this unbounded loop. */
-		cond_resched();
 	}
 
 	gnttab_page_cache_shrink(&info->free_pages, scsiback_max_buffer_pages);
@@ -831,8 +828,12 @@ static irqreturn_t scsiback_irq_fn(int irq, void *dev_id)
 	int rc;
 	unsigned int eoi_flags = XEN_EOI_FLAG_SPURIOUS;
 
+	/*
+	 * Process cmds in a tight loop.  The scheduler can preempt when
+	 * it needs to.
+	 */
 	while ((rc = scsiback_do_cmd_fn(info, &eoi_flags)) > 0)
-		cond_resched();
+		;
 
 	/* In case of a ring error we keep the event channel masked. */
 	if (!rc)
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..cc757d5b4acc 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -178,8 +178,6 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 				gpc_unmap_khva(new_pfn, new_khva);
 
 			kvm_release_pfn_clean(new_pfn);
-
-			cond_resched();
 		}
 
 		/* We always request a writeable mapping */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 76/86] treewide: block: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (17 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 75/86] treewide: virt: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 77/86] treewide: netfilter: " Ankur Arora
                     ` (11 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Tejun Heo, Josef Bacik, Jens Axboe, cgroups, linux-block

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the uses here are in set-1 (some right after we give up the
lock, causing an explicit preemption check.)

We can remove all of them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: cgroups@vger.kernel.org
Cc: linux-block@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 block/blk-cgroup.c |  2 --
 block/blk-lib.c    | 11 -----------
 block/blk-mq.c     |  3 ---
 block/blk-zoned.c  |  6 ------
 4 files changed, 22 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4a42ea2972ad..145c378367ec 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -597,7 +597,6 @@ static void blkg_destroy_all(struct gendisk *disk)
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
 			spin_unlock_irq(&q->queue_lock);
-			cond_resched();
 			goto restart;
 		}
 	}
@@ -1234,7 +1233,6 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 			 * need to rescheduling to avoid softlockup.
 			 */
 			spin_unlock_irq(&blkcg->lock);
-			cond_resched();
 			spin_lock_irq(&blkcg->lock);
 			continue;
 		}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..0bb118e9748b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -69,14 +69,6 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		bio->bi_iter.bi_size = req_sects << 9;
 		sector += req_sects;
 		nr_sects -= req_sects;
-
-		/*
-		 * We can loop for a long time in here, if someone does
-		 * full device discards (like mkfs). Be nice and allow
-		 * us to schedule out to avoid softlocking if preempt
-		 * is disabled.
-		 */
-		cond_resched();
 	}
 
 	*biop = bio;
@@ -145,7 +137,6 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 			bio->bi_iter.bi_size = nr_sects << 9;
 			nr_sects = 0;
 		}
-		cond_resched();
 	}
 
 	*biop = bio;
@@ -189,7 +180,6 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
 			if (bi_size < sz)
 				break;
 		}
-		cond_resched();
 	}
 
 	*biop = bio;
@@ -336,7 +326,6 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 			bio_put(bio);
 			break;
 		}
-		cond_resched();
 	}
 	blk_finish_plug(&plug);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fafd54dce3c..f45ee6a69700 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1372,7 +1372,6 @@ static void blk_rq_poll_completion(struct request *rq, struct completion *wait)
 {
 	do {
 		blk_hctx_poll(rq->q, rq->mq_hctx, NULL, 0);
-		cond_resched();
 	} while (!completion_done(wait));
 }
 
@@ -4310,7 +4309,6 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		if (!__blk_mq_alloc_map_and_rqs(set, i))
 			goto out_unwind;
-		cond_resched();
 	}
 
 	return 0;
@@ -4425,7 +4423,6 @@ static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
 				__blk_mq_free_map_and_rqs(set, i);
 			return -ENOMEM;
 		}
-		cond_resched();
 	}
 
 done:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 619ee41a51cc..8005f55e22e5 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -208,9 +208,6 @@ static int blkdev_zone_reset_all_emulated(struct block_device *bdev,
 				   gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		sector += zone_sectors;
-
-		/* This may take a while, so be nice to others */
-		cond_resched();
 	}
 
 	if (bio) {
@@ -293,9 +290,6 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
 		bio = blk_next_bio(bio, bdev, 0, op | REQ_SYNC, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		sector += zone_sectors;
-
-		/* This may take a while, so be nice to others */
-		cond_resched();
 	}
 
 	ret = submit_bio_wait(bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 77/86] treewide: netfilter: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (18 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 76/86] treewide: block: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 78/86] treewide: net: " Ankur Arora
                     ` (10 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Florian Westphal, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Julian Anastasov, David S. Miller,
	Pablo Neira Ayuso, Jozsef Kadlecsik

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a lock
or enable bottom-halves, causing an explicit preemption check.)
We can remove all of them.

There's one case where we do "cond_resched(); cpu_relax()" while
spinning on a seqcount. Replace with cond_resched_stall().

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 net/netfilter/ipset/ip_set_core.c   | 1 -
 net/netfilter/ipvs/ip_vs_est.c      | 3 ---
 net/netfilter/nf_conncount.c        | 2 --
 net/netfilter/nf_conntrack_core.c   | 3 ---
 net/netfilter/nf_conntrack_ecache.c | 3 ---
 net/netfilter/nf_tables_api.c       | 2 --
 net/netfilter/nft_set_rbtree.c      | 2 --
 net/netfilter/x_tables.c            | 3 +--
 net/netfilter/xt_hashlimit.c        | 1 -
 9 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 35d2f9c9ada0..f584c5e756ae 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1703,7 +1703,6 @@ call_ad(struct net *net, struct sock *ctnl, struct sk_buff *skb,
 		if (retried) {
 			__ip_set_get_netlink(set);
 			nfnl_unlock(NFNL_SUBSYS_IPSET);
-			cond_resched();
 			nfnl_lock(NFNL_SUBSYS_IPSET);
 			__ip_set_put_netlink(set);
 		}
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index c5970ba416ae..5543efeeb3f7 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -622,7 +622,6 @@ static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs)
 			goto unlock;
 		}
 		mutex_unlock(&__ip_vs_mutex);
-		cond_resched();
 	}
 
 unlock:
@@ -681,7 +680,6 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
 
 		if (!ipvs->enable || kthread_should_stop())
 			goto stop;
-		cond_resched();
 
 		diff = ktime_to_ns(ktime_sub(t2, t1));
 		if (diff <= 1 * NSEC_PER_USEC) {
@@ -815,7 +813,6 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
 		 * and deleted (releasing kthread contexts)
 		 */
 		mutex_unlock(&__ip_vs_mutex);
-		cond_resched();
 		mutex_lock(&__ip_vs_mutex);
 
 		/* Current kt released ? */
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 5d8ed6c90b7e..e7bc39ca204d 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -473,8 +473,6 @@ static void tree_gc_worker(struct work_struct *work)
 	rcu_read_unlock();
 	local_bh_enable();
 
-	cond_resched();
-
 	spin_lock_bh(&nf_conncount_locks[tree]);
 	if (gc_count < ARRAY_SIZE(gc_nodes))
 		goto next; /* do not bother */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 9f6f2e643575..d2f38870bbab 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1563,7 +1563,6 @@ static void gc_worker(struct work_struct *work)
 		 * we will just continue with next hash slot.
 		 */
 		rcu_read_unlock();
-		cond_resched();
 		i++;
 
 		delta_time = nfct_time_stamp - end_time;
@@ -2393,7 +2392,6 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data),
 		}
 		spin_unlock(lockp);
 		local_bh_enable();
-		cond_resched();
 	}
 
 	return NULL;
@@ -2418,7 +2416,6 @@ static void nf_ct_iterate_cleanup(int (*iter)(struct nf_conn *i, void *data),
 
 		nf_ct_delete(ct, iter_data->portid, iter_data->report);
 		nf_ct_put(ct);
-		cond_resched();
 	}
 	mutex_unlock(&nf_conntrack_mutex);
 }
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index 69948e1d6974..b568e329bf22 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -84,7 +84,6 @@ static enum retry_state ecache_work_evict_list(struct nf_conntrack_net *cnet)
 
 		if (sent++ > 16) {
 			spin_unlock_bh(&cnet->ecache.dying_lock);
-			cond_resched();
 			goto next;
 		}
 	}
@@ -96,8 +95,6 @@ static enum retry_state ecache_work_evict_list(struct nf_conntrack_net *cnet)
 
 		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
 		nf_ct_put(ct);
-
-		cond_resched();
 	}
 
 	return ret;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 29c651804cb2..6ff5515d9b17 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3742,8 +3742,6 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
 		err = nft_chain_validate(&ctx, chain);
 		if (err < 0)
 			return err;
-
-		cond_resched();
 	}
 
 	return 0;
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index e34662f4a71e..9bdf7c0e0831 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -495,8 +495,6 @@ static int nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 		if (fatal_signal_pending(current))
 			return -EINTR;
 
-		cond_resched();
-
 		write_lock_bh(&priv->lock);
 		write_seqcount_begin(&priv->count);
 		err = __nft_rbtree_insert(net, set, rbe, ext);
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 21624d68314f..ab53adf6393d 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1433,8 +1433,7 @@ xt_replace_table(struct xt_table *table,
 
 		if (seq & 1) {
 			do {
-				cond_resched();
-				cpu_relax();
+				cond_resched_stall();
 			} while (seq == raw_read_seqcount(s));
 		}
 	}
diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 0859b8f76764..47a11d49231a 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -372,7 +372,6 @@ static void htable_selective_cleanup(struct xt_hashlimit_htable *ht, bool select
 				dsthash_free(ht, dh);
 		}
 		spin_unlock_bh(&ht->lock);
-		cond_resched();
 	}
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 78/86] treewide: net: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (19 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 77/86] treewide: netfilter: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 79/86] " Ankur Arora
                     ` (9 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Willem de Bruijn, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the uses here are in set-1 (some right after we give up a lock
or enable bottom-halves, causing an explicit preemption check.)

We can remove all of them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Eric Dumazet <edumazet@google.com> 
Cc: Jakub Kicinski <kuba@kernel.org> 
Cc: Paolo Abeni <pabeni@redhat.com> 
Cc: David Ahern <dsahern@kernel.org> 
Cc: Pablo Neira Ayuso <pablo@netfilter.org> 
Cc: Jozsef Kadlecsik <kadlec@netfilter.org> 
Cc: Florian Westphal <fw@strlen.de> 
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> 
Cc: Jamal Hadi Salim <jhs@mojatatu.com> 
Cc: Cong Wang <xiyou.wangcong@gmail.com> 
Cc: Jiri Pirko <jiri@resnulli.us> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 net/core/dev.c                  | 4 ----
 net/core/neighbour.c            | 1 -
 net/core/net_namespace.c        | 1 -
 net/core/netclassid_cgroup.c    | 1 -
 net/core/rtnetlink.c            | 1 -
 net/core/sock.c                 | 2 --
 net/ipv4/inet_connection_sock.c | 3 ---
 net/ipv4/inet_diag.c            | 1 -
 net/ipv4/inet_hashtables.c      | 1 -
 net/ipv4/inet_timewait_sock.c   | 1 -
 net/ipv4/inetpeer.c             | 1 -
 net/ipv4/netfilter/arp_tables.c | 2 --
 net/ipv4/netfilter/ip_tables.c  | 3 ---
 net/ipv4/nexthop.c              | 1 -
 net/ipv4/tcp_ipv4.c             | 2 --
 net/ipv4/udp.c                  | 2 --
 net/netlink/af_netlink.c        | 1 -
 net/sched/sch_api.c             | 3 ---
 net/socket.c                    | 2 --
 19 files changed, 33 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 9f3f8930c691..467715278307 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6304,7 +6304,6 @@ void napi_busy_loop(unsigned int napi_id,
 			if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 				preempt_enable();
 			rcu_read_unlock();
-			cond_resched();
 			if (loop_end(loop_end_arg, start_time))
 				return;
 			goto restart;
@@ -6709,8 +6708,6 @@ static int napi_threaded_poll(void *data)
 
 			if (!repoll)
 				break;
-
-			cond_resched();
 		}
 	}
 	return 0;
@@ -11478,7 +11475,6 @@ static void __net_exit default_device_exit_batch(struct list_head *net_list)
 	rtnl_lock();
 	list_for_each_entry(net, net_list, exit_list) {
 		default_device_exit_net(net);
-		cond_resched();
 	}
 
 	list_for_each_entry(net, net_list, exit_list) {
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index df81c1f0a570..86584a2ace2f 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1008,7 +1008,6 @@ static void neigh_periodic_work(struct work_struct *work)
 		 * grows while we are preempted.
 		 */
 		write_unlock_bh(&tbl->lock);
-		cond_resched();
 		write_lock_bh(&tbl->lock);
 		nht = rcu_dereference_protected(tbl->nht,
 						lockdep_is_held(&tbl->lock));
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index f4183c4c1ec8..5533e8268b30 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -168,7 +168,6 @@ static void ops_exit_list(const struct pernet_operations *ops,
 	if (ops->exit) {
 		list_for_each_entry(net, net_exit_list, exit_list) {
 			ops->exit(net);
-			cond_resched();
 		}
 	}
 	if (ops->exit_batch)
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index d6a70aeaa503..7162c3d30f1b 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -92,7 +92,6 @@ static void update_classid_task(struct task_struct *p, u32 classid)
 		task_lock(p);
 		fd = iterate_fd(p->files, fd, update_classid_sock, &ctx);
 		task_unlock(p);
-		cond_resched();
 	} while (fd);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 53c377d054f0..c4ff7b21f906 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -140,7 +140,6 @@ void __rtnl_unlock(void)
 		struct sk_buff *next = head->next;
 
 		kfree_skb(head);
-		cond_resched();
 		head = next;
 	}
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 16584e2dd648..c91f9fc687ba 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2982,8 +2982,6 @@ void __release_sock(struct sock *sk)
 			skb_mark_not_on_list(skb);
 			sk_backlog_rcv(sk, skb);
 
-			cond_resched();
-
 			skb = next;
 		} while (skb != NULL);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 394a498c2823..49b90cf913a0 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -389,7 +389,6 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 		goto success;
 next_port:
 		spin_unlock_bh(&head->lock);
-		cond_resched();
 	}
 
 	offset--;
@@ -1420,8 +1419,6 @@ void inet_csk_listen_stop(struct sock *sk)
 		bh_unlock_sock(child);
 		local_bh_enable();
 		sock_put(child);
-
-		cond_resched();
 	}
 	if (queue->fastopenq.rskq_rst_head) {
 		/* Free all the reqs queued in rskq_rst_head. */
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index e13a84433413..45d3c9027355 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -1147,7 +1147,6 @@ void inet_diag_dump_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *skb,
 		}
 		if (res < 0)
 			break;
-		cond_resched();
 		if (accum == SKARR_SZ) {
 			s_num = num + 1;
 			goto next_chunk;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 598c1b114d2c..47f86ce00704 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1080,7 +1080,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		goto ok;
 next_port:
 		spin_unlock_bh(&head->lock);
-		cond_resched();
 	}
 
 	offset++;
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index dd37a5bf6881..519c77bc15ec 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -288,7 +288,6 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
 	for (slot = 0; slot <= hashinfo->ehash_mask; slot++) {
 		struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
 restart_rcu:
-		cond_resched();
 		rcu_read_lock();
 restart:
 		sk_nulls_for_each_rcu(sk, node, &head->chain) {
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index e9fed83e9b3c..d32a70c27cbe 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -300,7 +300,6 @@ void inetpeer_invalidate_tree(struct inet_peer_base *base)
 		p = rb_next(p);
 		rb_erase(&peer->rb_node, &base->rb_root);
 		inet_putpeer(peer);
-		cond_resched();
 	}
 
 	base->total = 0;
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 2407066b0fec..3f8c9c4f3ce0 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -622,7 +622,6 @@ static void get_counters(const struct xt_table_info *t,
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
-			cond_resched();
 		}
 	}
 }
@@ -642,7 +641,6 @@ static void get_old_counters(const struct xt_table_info *t,
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i;
 		}
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 7da1df4997d0..f8b7ae5106be 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -761,7 +761,6 @@ get_counters(const struct xt_table_info *t,
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
-			cond_resched();
 		}
 	}
 }
@@ -781,8 +780,6 @@ static void get_old_counters(const struct xt_table_info *t,
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i; /* macro does multi eval of i */
 		}
-
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index bbff68b5b5d4..d0f009aea17e 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2424,7 +2424,6 @@ static void flush_all_nexthops(struct net *net)
 	while ((node = rb_first(root))) {
 		nh = rb_entry(node, struct nexthop, rb_node);
 		remove_nexthop(net, nh, NULL);
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4167e8a48b60..d2542780447c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2449,8 +2449,6 @@ static void *established_get_first(struct seq_file *seq)
 		struct hlist_nulls_node *node;
 		spinlock_t *lock = inet_ehash_lockp(hinfo, st->bucket);
 
-		cond_resched();
-
 		/* Lockless fast path for the common case of empty buckets */
 		if (empty_bucket(hinfo, st))
 			continue;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f39b9c844580..e01eca44559b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -281,7 +281,6 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 				snum += rand;
 			} while (snum != first);
 			spin_unlock_bh(&hslot->lock);
-			cond_resched();
 		} while (++first != last);
 		goto fail;
 	} else {
@@ -1890,7 +1889,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
 	kfree_skb(skb);
 
 	/* starting over for a new packet, but check if we need to yield */
-	cond_resched();
 	msg->msg_flags &= ~MSG_TRUNC;
 	goto try_again;
 }
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index eb086b06d60d..4e2ed0c5cf6e 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -843,7 +843,6 @@ static int netlink_autobind(struct socket *sock)
 	bool ok;
 
 retry:
-	cond_resched();
 	rcu_read_lock();
 	ok = !__netlink_lookup(table, portid, net);
 	rcu_read_unlock();
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index e9eaf637220e..06ec50c52ea8 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -772,7 +772,6 @@ static u32 qdisc_alloc_handle(struct net_device *dev)
 			autohandle = TC_H_MAKE(0x80000000U, 0);
 		if (!qdisc_lookup(dev, autohandle))
 			return autohandle;
-		cond_resched();
 	} while	(--i > 0);
 
 	return 0;
@@ -923,7 +922,6 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
 	u32 block_index;
 	__u32 qlen;
 
-	cond_resched();
 	nlh = nlmsg_put(skb, portid, seq, event, sizeof(*tcm), flags);
 	if (!nlh)
 		goto out_nlmsg_trim;
@@ -1888,7 +1886,6 @@ static int tc_fill_tclass(struct sk_buff *skb, struct Qdisc *q,
 	struct gnet_dump d;
 	const struct Qdisc_class_ops *cl_ops = q->ops->cl_ops;
 
-	cond_resched();
 	nlh = nlmsg_put(skb, portid, seq, event, sizeof(*tcm), flags);
 	if (!nlh)
 		goto out_nlmsg_trim;
diff --git a/net/socket.c b/net/socket.c
index c4a6f5532955..d6499c7c7869 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2709,7 +2709,6 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 		++datagrams;
 		if (msg_data_left(&msg_sys))
 			break;
-		cond_resched();
 	}
 
 	fput_light(sock->file, fput_needed);
@@ -2944,7 +2943,6 @@ static int do_recvmmsg(int fd, struct mmsghdr __user *mmsg,
 		/* Out of band data, return right away */
 		if (msg_sys.msg_flags & MSG_OOB)
 			break;
-		cond_resched();
 	}
 
 	if (err == 0)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 79/86] treewide: net: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (20 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 78/86] treewide: net: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-08 12:16     ` Eric Dumazet
  2023-11-07 23:08   ` [RFC PATCH 80/86] treewide: sound: " Ankur Arora
                     ` (8 subsequent siblings)
  30 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Marek Lindner, Simon Wunderlich, Antonio Quartulli,
	Sven Eckelmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Roopa Prabhu, Nikolay Aleksandrov, David Ahern,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Marcelo Ricardo Leitner, Xin Long, Trond Myklebust,
	Anna Schumaker, Jon Maloy, Ying Xue, Martin Schiller

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a
lock or enable bottom-halves, causing an explicit preemption check.)

We can remove all of them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Marek Lindner <mareklindner@neomailbox.ch> 
Cc: Simon Wunderlich <sw@simonwunderlich.de> 
Cc: Antonio Quartulli <a@unstable.cc> 
Cc: Sven Eckelmann <sven@narfation.org> 
Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Eric Dumazet <edumazet@google.com> 
Cc: Jakub Kicinski <kuba@kernel.org> 
Cc: Paolo Abeni <pabeni@redhat.com> 
Cc: Roopa Prabhu <roopa@nvidia.com> 
Cc: Nikolay Aleksandrov <razor@blackwall.org> 
Cc: David Ahern <dsahern@kernel.org> 
Cc: Pablo Neira Ayuso <pablo@netfilter.org> 
Cc: Jozsef Kadlecsik <kadlec@netfilter.org> 
Cc: Florian Westphal <fw@strlen.de> 
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> 
Cc: Matthieu Baerts <matttbe@kernel.org> 
Cc: Mat Martineau <martineau@kernel.org> 
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> 
Cc: Xin Long <lucien.xin@gmail.com> 
Cc: Trond Myklebust <trond.myklebust@hammerspace.com> 
Cc: Anna Schumaker <anna@kernel.org> 
Cc: Jon Maloy <jmaloy@redhat.com> 
Cc: Ying Xue <ying.xue@windriver.com> 
Cc: Martin Schiller <ms@dev.tdt.de> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 net/batman-adv/tp_meter.c       |  2 --
 net/bpf/test_run.c              |  1 -
 net/bridge/br_netlink.c         |  1 -
 net/ipv6/fib6_rules.c           |  1 -
 net/ipv6/netfilter/ip6_tables.c |  2 --
 net/ipv6/udp.c                  |  2 --
 net/mptcp/mptcp_diag.c          |  2 --
 net/mptcp/pm_netlink.c          |  5 -----
 net/mptcp/protocol.c            |  1 -
 net/rds/ib_recv.c               |  2 --
 net/rds/tcp.c                   |  2 +-
 net/rds/threads.c               |  1 -
 net/rxrpc/call_object.c         |  2 +-
 net/sctp/socket.c               |  1 -
 net/sunrpc/cache.c              | 11 +++++++++--
 net/sunrpc/sched.c              |  2 +-
 net/sunrpc/svc_xprt.c           |  1 -
 net/sunrpc/xprtsock.c           |  2 --
 net/tipc/core.c                 |  2 +-
 net/tipc/topsrv.c               |  3 ---
 net/unix/af_unix.c              |  5 ++---
 net/x25/af_x25.c                |  1 -
 22 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/net/batman-adv/tp_meter.c b/net/batman-adv/tp_meter.c
index 7f3dd3c393e0..a0b160088c33 100644
--- a/net/batman-adv/tp_meter.c
+++ b/net/batman-adv/tp_meter.c
@@ -877,8 +877,6 @@ static int batadv_tp_send(void *arg)
 		/* right-shift the TWND */
 		if (!err)
 			tp_vars->last_sent += payload_len;
-
-		cond_resched();
 	}
 
 out:
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0841f8d82419..f4558fdfdf74 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -81,7 +81,6 @@ static bool bpf_test_timer_continue(struct bpf_test_timer *t, int iterations,
 		/* During iteration: we need to reschedule between runs. */
 		t->time_spent += ktime_get_ns() - t->time_start;
 		bpf_test_timer_leave(t);
-		cond_resched();
 		bpf_test_timer_enter(t);
 	}
 
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 10f0d33d8ccf..f326b034245f 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -780,7 +780,6 @@ int br_process_vlan_info(struct net_bridge *br,
 					       v - 1, rtm_cmd);
 				v_change_start = 0;
 			}
-			cond_resched();
 		}
 		/* v_change_start is set only if the last/whole range changed */
 		if (v_change_start)
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 7c2003833010..528e6a582c21 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -500,7 +500,6 @@ static void __net_exit fib6_rules_net_exit_batch(struct list_head *net_list)
 	rtnl_lock();
 	list_for_each_entry(net, net_list, exit_list) {
 		fib_rules_unregister(net->ipv6.fib6_rules_ops);
-		cond_resched();
 	}
 	rtnl_unlock();
 }
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index fd9f049d6d41..704f14c4146f 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -778,7 +778,6 @@ get_counters(const struct xt_table_info *t,
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
-			cond_resched();
 		}
 	}
 }
@@ -798,7 +797,6 @@ static void get_old_counters(const struct xt_table_info *t,
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i;
 		}
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 86b5d509a468..032d4f7e6ed3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -443,8 +443,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	}
 	kfree_skb(skb);
 
-	/* starting over for a new packet, but check if we need to yield */
-	cond_resched();
 	msg->msg_flags &= ~MSG_TRUNC;
 	goto try_again;
 }
diff --git a/net/mptcp/mptcp_diag.c b/net/mptcp/mptcp_diag.c
index 8df1bdb647e2..82bf16511476 100644
--- a/net/mptcp/mptcp_diag.c
+++ b/net/mptcp/mptcp_diag.c
@@ -141,7 +141,6 @@ static void mptcp_diag_dump_listeners(struct sk_buff *skb, struct netlink_callba
 		spin_unlock(&ilb->lock);
 		rcu_read_unlock();
 
-		cond_resched();
 		diag_ctx->l_num = 0;
 	}
 
@@ -190,7 +189,6 @@ static void mptcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			diag_ctx->s_num--;
 			break;
 		}
-		cond_resched();
 	}
 
 	if ((r->idiag_states & TCPF_LISTEN) && r->id.idiag_dport == 0)
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 9661f3812682..b48d2636ce8d 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1297,7 +1297,6 @@ static int mptcp_nl_add_subflow_or_signal_addr(struct net *net)
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1443,7 +1442,6 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1478,7 +1476,6 @@ static int mptcp_nl_remove_id_zero_address(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1594,7 +1591,6 @@ static void mptcp_nl_remove_addrs_list(struct net *net,
 		}
 
 		sock_put(sk);
-		cond_resched();
 	}
 }
 
@@ -1878,7 +1874,6 @@ static int mptcp_nl_set_flags(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return ret;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 886ab689a8ae..8c4a51903b23 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3383,7 +3383,6 @@ static void mptcp_release_cb(struct sock *sk)
 		if (flags & BIT(MPTCP_RETRANSMIT))
 			__mptcp_retrans(sk);
 
-		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
 	}
 
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index e53b7f266bd7..d2111e895a10 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -459,8 +459,6 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp)
 	    rds_ib_ring_empty(&ic->i_recv_ring))) {
 		queue_delayed_work(rds_wq, &conn->c_recv_w, 1);
 	}
-	if (can_wait)
-		cond_resched();
 }
 
 /*
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2dba7505b414..9b4d07235904 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -530,7 +530,7 @@ static void rds_tcp_accept_worker(struct work_struct *work)
 					       rds_tcp_accept_w);
 
 	while (rds_tcp_accept_one(rtn->rds_tcp_listen_sock) == 0)
-		cond_resched();
+		cond_resched_stall();
 }
 
 void rds_tcp_accept_work(struct sock *sk)
diff --git a/net/rds/threads.c b/net/rds/threads.c
index 1f424cbfcbb4..2a75b48769e8 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -198,7 +198,6 @@ void rds_send_worker(struct work_struct *work)
 	if (rds_conn_path_state(cp) == RDS_CONN_UP) {
 		clear_bit(RDS_LL_SEND_FULL, &cp->cp_flags);
 		ret = rds_send_xmit(cp);
-		cond_resched();
 		rdsdebug("conn %p ret %d\n", cp->cp_conn, ret);
 		switch (ret) {
 		case -EAGAIN:
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 773eecd1e979..d2704a492a3c 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -755,7 +755,7 @@ void rxrpc_destroy_all_calls(struct rxrpc_net *rxnet)
 			       call->flags, call->events);
 
 			spin_unlock(&rxnet->call_lock);
-			cond_resched();
+			cpu_relax();
 			spin_lock(&rxnet->call_lock);
 		}
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 7f89e43154c0..448112919848 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -8364,7 +8364,6 @@ static int sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 			break;
 		next:
 			spin_unlock_bh(&head->lock);
-			cond_resched();
 		} while (--remaining > 0);
 
 		/* Exhausted local port range during search? */
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 95ff74706104..3bcacfbbf35f 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -521,10 +521,17 @@ static void do_cache_clean(struct work_struct *work)
  */
 void cache_flush(void)
 {
+	/*
+	 * We call cache_clean() in what is seemingly a tight loop. But,
+	 * the scheduler can always preempt us when we give up the spinlock
+	 * in cache_clean().
+	 */
+
 	while (cache_clean() != -1)
-		cond_resched();
+		;
+
 	while (cache_clean() != -1)
-		cond_resched();
+		;
 }
 EXPORT_SYMBOL_GPL(cache_flush);
 
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 6debf4fd42d4..5b7a3c8a271f 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -950,7 +950,7 @@ static void __rpc_execute(struct rpc_task *task)
 		 * Lockless check for whether task is sleeping or not.
 		 */
 		if (!RPC_IS_QUEUED(task)) {
-			cond_resched();
+			cond_resched_stall();
 			continue;
 		}
 
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 4cfe9640df48..d2486645d725 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -851,7 +851,6 @@ void svc_recv(struct svc_rqst *rqstp)
 		goto out;
 
 	try_to_freeze();
-	cond_resched();
 	if (kthread_should_stop())
 		goto out;
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index a15bf2ede89b..50c1f2556b3e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -776,7 +776,6 @@ static void xs_stream_data_receive(struct sock_xprt *transport)
 		if (ret < 0)
 			break;
 		read += ret;
-		cond_resched();
 	}
 	if (ret == -ESHUTDOWN)
 		kernel_sock_shutdown(transport->sock, SHUT_RDWR);
@@ -1412,7 +1411,6 @@ static void xs_udp_data_receive(struct sock_xprt *transport)
 			break;
 		xs_udp_data_read_skb(&transport->xprt, sk, skb);
 		consume_skb(skb);
-		cond_resched();
 	}
 	xs_poll_check_readable(transport);
 out:
diff --git a/net/tipc/core.c b/net/tipc/core.c
index 434e70eabe08..ed4cd5faa387 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -119,7 +119,7 @@ static void __net_exit tipc_exit_net(struct net *net)
 	tipc_crypto_stop(&tipc_net(net)->crypto_tx);
 #endif
 	while (atomic_read(&tn->wq_count))
-		cond_resched();
+		cond_resched_stall();
 }
 
 static void __net_exit tipc_pernet_pre_exit(struct net *net)
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index 8ee0c07d00e9..13cd3816fb52 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -277,7 +277,6 @@ static void tipc_conn_send_to_sock(struct tipc_conn *con)
 			ret = kernel_sendmsg(con->sock, &msg, &iov,
 					     1, sizeof(*evt));
 			if (ret == -EWOULDBLOCK || ret == 0) {
-				cond_resched();
 				return;
 			} else if (ret < 0) {
 				return tipc_conn_close(con);
@@ -288,7 +287,6 @@ static void tipc_conn_send_to_sock(struct tipc_conn *con)
 
 		/* Don't starve users filling buffers */
 		if (++count >= MAX_SEND_MSG_COUNT) {
-			cond_resched();
 			count = 0;
 		}
 		spin_lock_bh(&con->outqueue_lock);
@@ -426,7 +424,6 @@ static void tipc_conn_recv_work(struct work_struct *work)
 
 		/* Don't flood Rx machine */
 		if (++count >= MAX_RECV_MSG_COUNT) {
-			cond_resched();
 			count = 0;
 		}
 	}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3e8a04a13668..bb1367f93db2 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1184,10 +1184,9 @@ static int unix_autobind(struct sock *sk)
 		unix_table_double_unlock(net, old_hash, new_hash);
 
 		/* __unix_find_socket_byname() may take long time if many names
-		 * are already in use.
+		 * are already in use. The unlock above would have allowed the
+		 * scheduler to preempt if preemption was needed.
 		 */
-		cond_resched();
-
 		if (ordernum == lastnum) {
 			/* Give up if all names seems to be in use. */
 			err = -ENOSPC;
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 0fb5143bec7a..2a6b05bcb53d 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -343,7 +343,6 @@ static unsigned int x25_new_lci(struct x25_neigh *nb)
 			lci = 0;
 			break;
 		}
-		cond_resched();
 	}
 
 	return lci;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 80/86] treewide: sound: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (21 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 79/86] " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 81/86] treewide: md: " Ankur Arora
                     ` (7 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Jaroslav Kysela, Takashi Iwai

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most uses here are from set-1 when we are executing in extended
loops. Remove them.

In addition there are a few set-3 cases in the neighbourhood of
HW register access. Replace those instances with cond_resched_stall()

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 sound/arm/aaci.c                    | 2 +-
 sound/core/seq/seq_virmidi.c        | 2 --
 sound/hda/hdac_controller.c         | 1 -
 sound/isa/sb/emu8000_patch.c        | 5 -----
 sound/isa/sb/emu8000_pcm.c          | 2 +-
 sound/isa/wss/wss_lib.c             | 1 -
 sound/pci/echoaudio/echoaudio_dsp.c | 2 --
 sound/pci/ens1370.c                 | 1 -
 sound/pci/es1968.c                  | 2 +-
 sound/pci/lola/lola.c               | 1 -
 sound/pci/mixart/mixart_hwdep.c     | 2 +-
 sound/pci/pcxhr/pcxhr_core.c        | 5 -----
 sound/pci/vx222/vx222_ops.c         | 2 --
 sound/x86/intel_hdmi_audio.c        | 1 -
 14 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/sound/arm/aaci.c b/sound/arm/aaci.c
index 0817ad21af74..d216f4859e61 100644
--- a/sound/arm/aaci.c
+++ b/sound/arm/aaci.c
@@ -145,7 +145,7 @@ static unsigned short aaci_ac97_read(struct snd_ac97 *ac97, unsigned short reg)
 	timeout = FRAME_PERIOD_US * 8;
 	do {
 		udelay(1);
-		cond_resched();
+		cond_resched_stall();
 		v = readl(aaci->base + AACI_SLFR) & (SLFR_1RXV|SLFR_2RXV);
 	} while ((v != (SLFR_1RXV|SLFR_2RXV)) && --timeout);
 
diff --git a/sound/core/seq/seq_virmidi.c b/sound/core/seq/seq_virmidi.c
index 1b9260108e48..99226da86d3c 100644
--- a/sound/core/seq/seq_virmidi.c
+++ b/sound/core/seq/seq_virmidi.c
@@ -154,8 +154,6 @@ static void snd_vmidi_output_work(struct work_struct *work)
 			if (ret < 0)
 				break;
 		}
-		/* rawmidi input might be huge, allow to have a break */
-		cond_resched();
 	}
 }
 
diff --git a/sound/hda/hdac_controller.c b/sound/hda/hdac_controller.c
index 7f3a000fab0c..9b6df2f541ca 100644
--- a/sound/hda/hdac_controller.c
+++ b/sound/hda/hdac_controller.c
@@ -284,7 +284,6 @@ int snd_hdac_bus_get_response(struct hdac_bus *bus, unsigned int addr,
 			msleep(2); /* temporary workaround */
 		} else {
 			udelay(10);
-			cond_resched();
 		}
 	}
 
diff --git a/sound/isa/sb/emu8000_patch.c b/sound/isa/sb/emu8000_patch.c
index 8c1e7f2bfc34..d808c461be35 100644
--- a/sound/isa/sb/emu8000_patch.c
+++ b/sound/isa/sb/emu8000_patch.c
@@ -218,11 +218,6 @@ snd_emu8000_sample_new(struct snd_emux *rec, struct snd_sf_sample *sp,
 		offset++;
 		write_word(emu, &dram_offset, s);
 
-		/* we may take too long time in this loop.
-		 * so give controls back to kernel if needed.
-		 */
-		cond_resched();
-
 		if (i == sp->v.loopend &&
 		    (sp->v.mode_flags & (SNDRV_SFNT_SAMPLE_BIDIR_LOOP|SNDRV_SFNT_SAMPLE_REVERSE_LOOP)))
 		{
diff --git a/sound/isa/sb/emu8000_pcm.c b/sound/isa/sb/emu8000_pcm.c
index 9234d4fe8ada..fd18c7cf1812 100644
--- a/sound/isa/sb/emu8000_pcm.c
+++ b/sound/isa/sb/emu8000_pcm.c
@@ -404,7 +404,7 @@ static int emu8k_pcm_trigger(struct snd_pcm_substream *subs, int cmd)
  */
 #define CHECK_SCHEDULER() \
 do { \
-	cond_resched();\
+	cond_resched_stall();\
 	if (signal_pending(current))\
 		return -EAGAIN;\
 } while (0)
diff --git a/sound/isa/wss/wss_lib.c b/sound/isa/wss/wss_lib.c
index 026061b55ee9..97c74e8c26ee 100644
--- a/sound/isa/wss/wss_lib.c
+++ b/sound/isa/wss/wss_lib.c
@@ -1159,7 +1159,6 @@ static int snd_ad1848_probe(struct snd_wss *chip)
 	while (wss_inb(chip, CS4231P(REGSEL)) & CS4231_INIT) {
 		if (time_after(jiffies, timeout))
 			return -ENODEV;
-		cond_resched();
 	}
 	spin_lock_irqsave(&chip->reg_lock, flags);
 
diff --git a/sound/pci/echoaudio/echoaudio_dsp.c b/sound/pci/echoaudio/echoaudio_dsp.c
index 2a40091d472c..085b229c83b5 100644
--- a/sound/pci/echoaudio/echoaudio_dsp.c
+++ b/sound/pci/echoaudio/echoaudio_dsp.c
@@ -100,7 +100,6 @@ static int write_dsp(struct echoaudio *chip, u32 data)
 			return 0;
 		}
 		udelay(1);
-		cond_resched();
 	}
 
 	chip->bad_board = true;		/* Set true until DSP re-loaded */
@@ -123,7 +122,6 @@ static int read_dsp(struct echoaudio *chip, u32 *data)
 			return 0;
 		}
 		udelay(1);
-		cond_resched();
 	}
 
 	chip->bad_board = true;		/* Set true until DSP re-loaded */
diff --git a/sound/pci/ens1370.c b/sound/pci/ens1370.c
index 89210b2c7342..4948ae411a94 100644
--- a/sound/pci/ens1370.c
+++ b/sound/pci/ens1370.c
@@ -501,7 +501,6 @@ static unsigned int snd_es1371_wait_src_ready(struct ensoniq * ensoniq)
 		r = inl(ES_REG(ensoniq, 1371_SMPRATE));
 		if ((r & ES_1371_SRC_RAM_BUSY) == 0)
 			return r;
-		cond_resched();
 	}
 	dev_err(ensoniq->card->dev, "wait src ready timeout 0x%lx [0x%x]\n",
 		   ES_REG(ensoniq, 1371_SMPRATE), r);
diff --git a/sound/pci/es1968.c b/sound/pci/es1968.c
index 4bc0f53c223b..1598880cfeea 100644
--- a/sound/pci/es1968.c
+++ b/sound/pci/es1968.c
@@ -612,7 +612,7 @@ static int snd_es1968_ac97_wait(struct es1968 *chip)
 	while (timeout-- > 0) {
 		if (!(inb(chip->io_port + ESM_AC97_INDEX) & 1))
 			return 0;
-		cond_resched();
+		cond_resched_stall();
 	}
 	dev_dbg(chip->card->dev, "ac97 timeout\n");
 	return 1; /* timeout */
diff --git a/sound/pci/lola/lola.c b/sound/pci/lola/lola.c
index 1aa30e90b86a..3c18b5543512 100644
--- a/sound/pci/lola/lola.c
+++ b/sound/pci/lola/lola.c
@@ -166,7 +166,6 @@ static int rirb_get_response(struct lola *chip, unsigned int *val,
 		if (time_after(jiffies, timeout))
 			break;
 		udelay(20);
-		cond_resched();
 	}
 	dev_warn(chip->card->dev, "RIRB response error\n");
 	if (!chip->polling_mode) {
diff --git a/sound/pci/mixart/mixart_hwdep.c b/sound/pci/mixart/mixart_hwdep.c
index 689c0f995a9c..1906cb861002 100644
--- a/sound/pci/mixart/mixart_hwdep.c
+++ b/sound/pci/mixart/mixart_hwdep.c
@@ -41,7 +41,7 @@ static int mixart_wait_nice_for_register_value(struct mixart_mgr *mgr,
 	do {	/* we may take too long time in this loop.
 		 * so give controls back to kernel if needed.
 		 */
-		cond_resched();
+		cond_resched_stall();
 
 		read = readl_be( MIXART_MEM( mgr, offset ));
 		if(is_egal) {
diff --git a/sound/pci/pcxhr/pcxhr_core.c b/sound/pci/pcxhr/pcxhr_core.c
index 23f253effb4f..221eb6570c5e 100644
--- a/sound/pci/pcxhr/pcxhr_core.c
+++ b/sound/pci/pcxhr/pcxhr_core.c
@@ -304,8 +304,6 @@ int pcxhr_load_xilinx_binary(struct pcxhr_mgr *mgr,
 			PCXHR_OUTPL(mgr, PCXHR_PLX_CHIPSC, chipsc);
 			mask >>= 1;
 		}
-		/* don't take too much time in this loop... */
-		cond_resched();
 	}
 	chipsc &= ~(PCXHR_CHIPSC_DATA_CLK | PCXHR_CHIPSC_DATA_IN);
 	PCXHR_OUTPL(mgr, PCXHR_PLX_CHIPSC, chipsc);
@@ -356,9 +354,6 @@ static int pcxhr_download_dsp(struct pcxhr_mgr *mgr, const struct firmware *dsp)
 		PCXHR_OUTPB(mgr, PCXHR_DSP_TXH, data[0]);
 		PCXHR_OUTPB(mgr, PCXHR_DSP_TXM, data[1]);
 		PCXHR_OUTPB(mgr, PCXHR_DSP_TXL, data[2]);
-
-		/* don't take too much time in this loop... */
-		cond_resched();
 	}
 	/* give some time to boot the DSP */
 	msleep(PCXHR_WAIT_DEFAULT);
diff --git a/sound/pci/vx222/vx222_ops.c b/sound/pci/vx222/vx222_ops.c
index 3e7e928b24f8..84a59566b036 100644
--- a/sound/pci/vx222/vx222_ops.c
+++ b/sound/pci/vx222/vx222_ops.c
@@ -376,8 +376,6 @@ static int vx2_load_xilinx_binary(struct vx_core *chip, const struct firmware *x
 	for (i = 0; i < xilinx->size; i++, image++) {
 		if (put_xilinx_data(chip, port, 8, *image) < 0)
 			return -EINVAL;
-		/* don't take too much time in this loop... */
-		cond_resched();
 	}
 	put_xilinx_data(chip, port, 4, 0xff); /* end signature */
 
diff --git a/sound/x86/intel_hdmi_audio.c b/sound/x86/intel_hdmi_audio.c
index ab95fb34a635..e734d2f5f711 100644
--- a/sound/x86/intel_hdmi_audio.c
+++ b/sound/x86/intel_hdmi_audio.c
@@ -1020,7 +1020,6 @@ static void wait_clear_underrun_bit(struct snd_intelhad *intelhaddata)
 		if (!(val & AUD_HDMI_STATUS_MASK_UNDERRUN))
 			return;
 		udelay(100);
-		cond_resched();
 		had_write_register(intelhaddata, AUD_HDMI_STATUS, val);
 	}
 	dev_err(intelhaddata->dev, "Unable to clear UNDERRUN bits\n");
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 81/86] treewide: md: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (22 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 80/86] treewide: sound: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 82/86] treewide: mtd: " Ankur Arora
                     ` (6 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Coly Li, Kent Overstreet, Alasdair Kergon, Mike Snitzer

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Coly Li <colyli@suse.de> 
Cc: Kent Overstreet <kent.overstreet@gmail.com> 
Cc: Alasdair Kergon <agk@redhat.com> 
Cc: Mike Snitzer <snitzer@kernel.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/md/bcache/btree.c     |  5 -----
 drivers/md/bcache/journal.c   |  2 --
 drivers/md/bcache/sysfs.c     |  1 -
 drivers/md/bcache/writeback.c |  2 --
 drivers/md/dm-bufio.c         | 14 --------------
 drivers/md/dm-cache-target.c  |  4 ----
 drivers/md/dm-crypt.c         |  3 ---
 drivers/md/dm-integrity.c     |  3 ---
 drivers/md/dm-kcopyd.c        |  2 --
 drivers/md/dm-snap.c          |  1 -
 drivers/md/dm-stats.c         |  8 --------
 drivers/md/dm-thin.c          |  2 --
 drivers/md/dm-writecache.c    | 11 -----------
 drivers/md/dm.c               |  4 ----
 drivers/md/md.c               |  1 -
 drivers/md/raid1.c            |  2 --
 drivers/md/raid10.c           |  3 ---
 drivers/md/raid5.c            |  2 --
 18 files changed, 70 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index fd121a61f17c..b9389d3c39d7 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1826,7 +1826,6 @@ static void bch_btree_gc(struct cache_set *c)
 	do {
 		ret = bcache_btree_root(gc_root, c, &op, &writes, &stats);
 		closure_sync(&writes);
-		cond_resched();
 
 		if (ret == -EAGAIN)
 			schedule_timeout_interruptible(msecs_to_jiffies
@@ -1981,7 +1980,6 @@ static int bch_btree_check_thread(void *arg)
 				goto out;
 			}
 			skip_nr--;
-			cond_resched();
 		}
 
 		if (p) {
@@ -2005,7 +2003,6 @@ static int bch_btree_check_thread(void *arg)
 		}
 		p = NULL;
 		prev_idx = cur_idx;
-		cond_resched();
 	}
 
 out:
@@ -2670,8 +2667,6 @@ void bch_refill_keybuf(struct cache_set *c, struct keybuf *buf,
 	struct bkey start = buf->last_scanned;
 	struct refill refill;
 
-	cond_resched();
-
 	bch_btree_op_init(&refill.op, -1);
 	refill.nr_found	= 0;
 	refill.buf	= buf;
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index c182c21de2e8..5e06a665d082 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -384,8 +384,6 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 
 			BUG_ON(!bch_keylist_empty(&keylist));
 			keys++;
-
-			cond_resched();
 		}
 
 		if (i->pin)
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 0e2c1880f60b..d7e248b54abd 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -1030,7 +1030,6 @@ KTYPE(bch_cache_set_internal);
 
 static int __bch_cache_cmp(const void *l, const void *r)
 {
-	cond_resched();
 	return *((uint16_t *)r) - *((uint16_t *)l);
 }
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 24c049067f61..7da09bba3067 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -863,8 +863,6 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
 					     KEY_START(k), KEY_SIZE(k));
 
 	op->count++;
-	if (!(op->count % INIT_KEYS_EACH_TIME))
-		cond_resched();
 
 	return MAP_CONTINUE;
 }
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index bc309e41d074..0b8f3341fa79 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -294,8 +294,6 @@ static struct lru_entry *lru_evict(struct lru *lru, le_predicate pred, void *con
 		}
 
 		h = h->next;
-
-		cond_resched();
 	}
 
 	return NULL;
@@ -762,7 +760,6 @@ static void __cache_iterate(struct dm_buffer_cache *bc, int list_mode,
 		case IT_COMPLETE:
 			return;
 		}
-		cond_resched();
 
 		le = to_le(le->list.next);
 	} while (le != first);
@@ -890,8 +887,6 @@ static void __remove_range(struct dm_buffer_cache *bc,
 	struct dm_buffer *b;
 
 	while (true) {
-		cond_resched();
-
 		b = __find_next(root, begin);
 		if (!b || (b->block >= end))
 			break;
@@ -1435,7 +1430,6 @@ static void __flush_write_list(struct list_head *write_list)
 			list_entry(write_list->next, struct dm_buffer, write_list);
 		list_del(&b->write_list);
 		submit_io(b, REQ_OP_WRITE, write_endio);
-		cond_resched();
 	}
 	blk_finish_plug(&plug);
 }
@@ -1953,8 +1947,6 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 				submit_io(b, REQ_OP_READ, read_endio);
 			dm_bufio_release(b);
 
-			cond_resched();
-
 			if (!n_blocks)
 				goto flush_plug;
 			dm_bufio_lock(c);
@@ -2093,8 +2085,6 @@ int dm_bufio_write_dirty_buffers(struct dm_bufio_client *c)
 			cache_mark(&c->cache, b, LIST_CLEAN);
 
 		cache_put_and_wake(c, b);
-
-		cond_resched();
 	}
 	lru_iter_end(&it);
 
@@ -2350,7 +2340,6 @@ static void __scan(struct dm_bufio_client *c)
 
 			atomic_long_dec(&c->need_shrink);
 			freed++;
-			cond_resched();
 		}
 	}
 }
@@ -2659,8 +2648,6 @@ static unsigned long __evict_many(struct dm_bufio_client *c,
 
 		__make_buffer_clean(b);
 		__free_buffer_wake(b);
-
-		cond_resched();
 	}
 
 	return count;
@@ -2802,7 +2789,6 @@ static void evict_old(void)
 	while (dm_bufio_current_allocated > threshold) {
 		if (!__evict_a_few(64))
 			break;
-		cond_resched();
 	}
 	mutex_unlock(&dm_bufio_clients_lock);
 }
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 911f73f7ebba..df136b29471a 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -1829,7 +1829,6 @@ static void process_deferred_bios(struct work_struct *ws)
 
 		else
 			commit_needed = process_bio(cache, bio) || commit_needed;
-		cond_resched();
 	}
 
 	if (commit_needed)
@@ -1853,7 +1852,6 @@ static void requeue_deferred_bios(struct cache *cache)
 	while ((bio = bio_list_pop(&bios))) {
 		bio->bi_status = BLK_STS_DM_REQUEUE;
 		bio_endio(bio);
-		cond_resched();
 	}
 }
 
@@ -1894,8 +1892,6 @@ static void check_migrations(struct work_struct *ws)
 		r = mg_start(cache, op, NULL);
 		if (r)
 			break;
-
-		cond_resched();
 	}
 }
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 5315fd261c23..70a24ade34af 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1629,8 +1629,6 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 			atomic_dec(&ctx->cc_pending);
 			ctx->cc_sector += sector_step;
 			tag_offset++;
-			if (!atomic)
-				cond_resched();
 			continue;
 		/*
 		 * There was a data integrity error.
@@ -1965,7 +1963,6 @@ static int dmcrypt_write(void *data)
 			io = crypt_io_from_node(rb_first(&write_tree));
 			rb_erase(&io->rb_node, &write_tree);
 			kcryptd_io_write(io);
-			cond_resched();
 		} while (!RB_EMPTY_ROOT(&write_tree));
 		blk_finish_plug(&plug);
 	}
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index 97a8d5fc9ebb..63c88f23b585 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -2717,12 +2717,10 @@ static void integrity_recalc(struct work_struct *w)
 				       ic->sectors_per_block, BITMAP_OP_TEST_ALL_CLEAR)) {
 			logical_sector += ic->sectors_per_block;
 			n_sectors -= ic->sectors_per_block;
-			cond_resched();
 		}
 		while (block_bitmap_op(ic, ic->recalc_bitmap, logical_sector + n_sectors - ic->sectors_per_block,
 				       ic->sectors_per_block, BITMAP_OP_TEST_ALL_CLEAR)) {
 			n_sectors -= ic->sectors_per_block;
-			cond_resched();
 		}
 		get_area_and_offset(ic, logical_sector, &area, &offset);
 	}
@@ -2782,7 +2780,6 @@ static void integrity_recalc(struct work_struct *w)
 	}
 
 advance_and_next:
-	cond_resched();
 
 	spin_lock_irq(&ic->endio_wait.lock);
 	remove_range_unlocked(ic, &range);
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index d01807c50f20..8a91e83188e7 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -512,8 +512,6 @@ static int run_complete_job(struct kcopyd_job *job)
 	if (atomic_dec_and_test(&kc->nr_jobs))
 		wake_up(&kc->destroyq);
 
-	cond_resched();
-
 	return 0;
 }
 
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index bf7a574499a3..cd8891c12cca 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1762,7 +1762,6 @@ static void copy_callback(int read_err, unsigned long write_err, void *context)
 			s->exception_complete_sequence++;
 			rb_erase(&pe->out_of_order_node, &s->out_of_order_tree);
 			complete_exception(pe);
-			cond_resched();
 		}
 	} else {
 		struct rb_node *parent = NULL;
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c
index db2d997a6c18..d6878cb7b0ef 100644
--- a/drivers/md/dm-stats.c
+++ b/drivers/md/dm-stats.c
@@ -230,7 +230,6 @@ void dm_stats_cleanup(struct dm_stats *stats)
 				       atomic_read(&shared->in_flight[READ]),
 				       atomic_read(&shared->in_flight[WRITE]));
 			}
-			cond_resched();
 		}
 		dm_stat_free(&s->rcu_head);
 	}
@@ -336,7 +335,6 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
 	for (ni = 0; ni < n_entries; ni++) {
 		atomic_set(&s->stat_shared[ni].in_flight[READ], 0);
 		atomic_set(&s->stat_shared[ni].in_flight[WRITE], 0);
-		cond_resched();
 	}
 
 	if (s->n_histogram_entries) {
@@ -350,7 +348,6 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
 		for (ni = 0; ni < n_entries; ni++) {
 			s->stat_shared[ni].tmp.histogram = hi;
 			hi += s->n_histogram_entries + 1;
-			cond_resched();
 		}
 	}
 
@@ -372,7 +369,6 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
 			for (ni = 0; ni < n_entries; ni++) {
 				p[ni].histogram = hi;
 				hi += s->n_histogram_entries + 1;
-				cond_resched();
 			}
 		}
 	}
@@ -512,7 +508,6 @@ static int dm_stats_list(struct dm_stats *stats, const char *program,
 			}
 			DMEMIT("\n");
 		}
-		cond_resched();
 	}
 	mutex_unlock(&stats->mutex);
 
@@ -794,7 +789,6 @@ static void __dm_stat_clear(struct dm_stat *s, size_t idx_start, size_t idx_end,
 				local_irq_enable();
 			}
 		}
-		cond_resched();
 	}
 }
 
@@ -910,8 +904,6 @@ static int dm_stats_print(struct dm_stats *stats, int id,
 
 		if (unlikely(sz + 1 >= maxlen))
 			goto buffer_overflow;
-
-		cond_resched();
 	}
 
 	if (clear)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 07c7f9795b10..52e4a7dc6923 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2234,7 +2234,6 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 			throttle_work_update(&pool->throttle);
 			dm_pool_issue_prefetches(pool->pmd);
 		}
-		cond_resched();
 	}
 	blk_finish_plug(&plug);
 }
@@ -2317,7 +2316,6 @@ static void process_thin_deferred_cells(struct thin_c *tc)
 			else
 				pool->process_cell(tc, cell);
 		}
-		cond_resched();
 	} while (!list_empty(&cells));
 }
 
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 074cb785eafc..75ecc26915a1 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -321,8 +321,6 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 			while (daa-- && i < p) {
 				pages[i++] = pfn_t_to_page(pfn);
 				pfn.val++;
-				if (!(i & 15))
-					cond_resched();
 			}
 		} while (i < p);
 		wc->memory_map = vmap(pages, p, VM_MAP, PAGE_KERNEL);
@@ -819,7 +817,6 @@ static void writecache_flush(struct dm_writecache *wc)
 		if (writecache_entry_is_committed(wc, e2))
 			break;
 		e = e2;
-		cond_resched();
 	}
 	writecache_commit_flushed(wc, true);
 
@@ -848,7 +845,6 @@ static void writecache_flush(struct dm_writecache *wc)
 		if (unlikely(e->lru.prev == &wc->lru))
 			break;
 		e = container_of(e->lru.prev, struct wc_entry, lru);
-		cond_resched();
 	}
 
 	if (need_flush_after_free)
@@ -970,7 +966,6 @@ static int writecache_alloc_entries(struct dm_writecache *wc)
 
 		e->index = b;
 		e->write_in_progress = false;
-		cond_resched();
 	}
 
 	return 0;
@@ -1058,7 +1053,6 @@ static void writecache_resume(struct dm_target *ti)
 			e->original_sector = le64_to_cpu(wme.original_sector);
 			e->seq_count = le64_to_cpu(wme.seq_count);
 		}
-		cond_resched();
 	}
 #endif
 	for (b = 0; b < wc->n_blocks; b++) {
@@ -1093,7 +1087,6 @@ static void writecache_resume(struct dm_target *ti)
 				}
 			}
 		}
-		cond_resched();
 	}
 
 	if (need_flush) {
@@ -1824,7 +1817,6 @@ static void __writeback_throttle(struct dm_writecache *wc, struct writeback_list
 			wc_unlock(wc);
 		}
 	}
-	cond_resched();
 }
 
 static void __writecache_writeback_pmem(struct dm_writecache *wc, struct writeback_list *wbl)
@@ -2024,7 +2016,6 @@ static void writecache_writeback(struct work_struct *work)
 				     read_original_sector(wc, e))) {
 				BUG_ON(!f->write_in_progress);
 				list_move(&e->lru, &skipped);
-				cond_resched();
 				continue;
 			}
 		}
@@ -2079,7 +2070,6 @@ static void writecache_writeback(struct work_struct *work)
 				break;
 			}
 		}
-		cond_resched();
 	}
 
 	if (!list_empty(&skipped)) {
@@ -2168,7 +2158,6 @@ static int init_memory(struct dm_writecache *wc)
 
 	for (b = 0; b < wc->n_blocks; b++) {
 		write_original_sector_seq_count(wc, &wc->entries[b], -1, -1);
-		cond_resched();
 	}
 
 	writecache_flush_all_metadata(wc);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 64a1f306c96c..ac0aff4de190 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -996,7 +996,6 @@ static void dm_wq_requeue_work(struct work_struct *work)
 		io->next = NULL;
 		__dm_io_complete(io, false);
 		io = next;
-		cond_resched();
 	}
 }
 
@@ -1379,12 +1378,10 @@ static noinline void __set_swap_bios_limit(struct mapped_device *md, int latch)
 {
 	mutex_lock(&md->swap_bios_lock);
 	while (latch < md->swap_bios) {
-		cond_resched();
 		down(&md->swap_bios_semaphore);
 		md->swap_bios--;
 	}
 	while (latch > md->swap_bios) {
-		cond_resched();
 		up(&md->swap_bios_semaphore);
 		md->swap_bios++;
 	}
@@ -2583,7 +2580,6 @@ static void dm_wq_work(struct work_struct *work)
 			break;
 
 		submit_bio_noacct(bio);
-		cond_resched();
 	}
 }
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index a104a025084d..88e8148be28f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9048,7 +9048,6 @@ void md_do_sync(struct md_thread *thread)
 		 * about not overloading the IO subsystem. (things like an
 		 * e2fsck being done on the RAID array should execute fast)
 		 */
-		cond_resched();
 
 		recovery_done = io_sectors - atomic_read(&mddev->recovery_active);
 		currspeed = ((unsigned long)(recovery_done - mddev->resync_mark_cnt))/2
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2aabac773fe7..71bd8d8d1d1c 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -807,7 +807,6 @@ static void flush_bio_list(struct r1conf *conf, struct bio *bio)
 
 		raid1_submit_write(bio);
 		bio = next;
-		cond_resched();
 	}
 }
 
@@ -2613,7 +2612,6 @@ static void raid1d(struct md_thread *thread)
 		else
 			WARN_ON_ONCE(1);
 
-		cond_resched();
 		if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 023413120851..d41f856ebcf4 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -916,7 +916,6 @@ static void flush_pending_writes(struct r10conf *conf)
 
 			raid1_submit_write(bio);
 			bio = next;
-			cond_resched();
 		}
 		blk_finish_plug(&plug);
 	} else
@@ -1132,7 +1131,6 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 
 		raid1_submit_write(bio);
 		bio = next;
-		cond_resched();
 	}
 	kfree(plug);
 }
@@ -3167,7 +3165,6 @@ static void raid10d(struct md_thread *thread)
 		else
 			WARN_ON_ONCE(1);
 
-		cond_resched();
 		if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 284cd71bcc68..47b995c97363 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6727,8 +6727,6 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 		handle_stripe(batch[i]);
 	log_write_stripe_run(conf);
 
-	cond_resched();
-
 	spin_lock_irq(&conf->device_lock);
 	for (i = 0; i < batch_size; i++) {
 		hash = batch[i]->hash_lock_index;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 82/86] treewide: mtd: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (23 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 81/86] treewide: md: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-08 16:28     ` Miquel Raynal
  2023-11-07 23:08   ` [RFC PATCH 83/86] treewide: drm: " Ankur Arora
                     ` (5 subsequent siblings)
  30 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Miquel Raynal, Vignesh Raghavendra, Kyungmin Park, Tudor Ambarus,
	Pratyush Yadav

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a lock
or enable bottom-halves, causing an explicit preemption check.)

There are a few cases from set-3. Replace them with
cond_resched_stall(). Some of those places, however, have wait-times
milliseconds, so maybe we should just have an msleep() there?

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/mtd/chips/cfi_cmdset_0001.c        |  6 ------
 drivers/mtd/chips/cfi_cmdset_0002.c        |  1 -
 drivers/mtd/chips/cfi_util.c               |  2 +-
 drivers/mtd/devices/spear_smi.c            |  2 +-
 drivers/mtd/devices/sst25l.c               |  3 +--
 drivers/mtd/devices/st_spi_fsm.c           |  4 ----
 drivers/mtd/inftlcore.c                    |  5 -----
 drivers/mtd/lpddr/lpddr_cmds.c             |  6 +-----
 drivers/mtd/mtd_blkdevs.c                  |  1 -
 drivers/mtd/nand/onenand/onenand_base.c    | 18 +-----------------
 drivers/mtd/nand/onenand/onenand_samsung.c |  8 +++++++-
 drivers/mtd/nand/raw/diskonchip.c          |  4 ++--
 drivers/mtd/nand/raw/fsmc_nand.c           |  3 +--
 drivers/mtd/nand/raw/hisi504_nand.c        |  2 +-
 drivers/mtd/nand/raw/nand_base.c           |  3 +--
 drivers/mtd/nand/raw/nand_legacy.c         | 17 +++++++++++++++--
 drivers/mtd/spi-nor/core.c                 |  8 +++++++-
 drivers/mtd/tests/mtd_test.c               |  2 --
 drivers/mtd/tests/mtd_test.h               |  2 +-
 drivers/mtd/tests/pagetest.c               |  1 -
 drivers/mtd/tests/readtest.c               |  2 --
 drivers/mtd/tests/torturetest.c            |  1 -
 drivers/mtd/ubi/attach.c                   | 10 ----------
 drivers/mtd/ubi/build.c                    |  2 --
 drivers/mtd/ubi/cdev.c                     |  4 ----
 drivers/mtd/ubi/eba.c                      |  8 --------
 drivers/mtd/ubi/misc.c                     |  2 --
 drivers/mtd/ubi/vtbl.c                     |  6 ------
 drivers/mtd/ubi/wl.c                       | 13 -------------
 29 files changed, 40 insertions(+), 106 deletions(-)

diff --git a/drivers/mtd/chips/cfi_cmdset_0001.c b/drivers/mtd/chips/cfi_cmdset_0001.c
index 11b06fefaa0e..c6abed74e4df 100644
--- a/drivers/mtd/chips/cfi_cmdset_0001.c
+++ b/drivers/mtd/chips/cfi_cmdset_0001.c
@@ -1208,7 +1208,6 @@ static int __xipram xip_wait_for_operation(
 			local_irq_enable();
 			mutex_unlock(&chip->mutex);
 			xip_iprefetch();
-			cond_resched();
 
 			/*
 			 * We're back.  However someone else might have
@@ -1337,7 +1336,6 @@ static int inval_cache_and_wait_for_operation(
 			sleep_time = 1000000/HZ;
 		} else {
 			udelay(1);
-			cond_resched();
 			timeo--;
 		}
 		mutex_lock(&chip->mutex);
@@ -1913,10 +1911,6 @@ static int cfi_intelext_writev (struct mtd_info *mtd, const struct kvec *vecs,
 				return 0;
 		}
 
-		/* Be nice and reschedule with the chip in a usable state for other
-		   processes. */
-		cond_resched();
-
 	} while (len);
 
 	return 0;
diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c b/drivers/mtd/chips/cfi_cmdset_0002.c
index df589d9b4d70..f6d8f8ccbe3f 100644
--- a/drivers/mtd/chips/cfi_cmdset_0002.c
+++ b/drivers/mtd/chips/cfi_cmdset_0002.c
@@ -1105,7 +1105,6 @@ static void __xipram xip_udelay(struct map_info *map, struct flchip *chip,
 			local_irq_enable();
 			mutex_unlock(&chip->mutex);
 			xip_iprefetch();
-			cond_resched();
 
 			/*
 			 * We're back.  However someone else might have
diff --git a/drivers/mtd/chips/cfi_util.c b/drivers/mtd/chips/cfi_util.c
index 140c69a67e82..c178dae31a59 100644
--- a/drivers/mtd/chips/cfi_util.c
+++ b/drivers/mtd/chips/cfi_util.c
@@ -28,7 +28,7 @@ void cfi_udelay(int us)
 		msleep(DIV_ROUND_UP(us, 1000));
 	} else {
 		udelay(us);
-		cond_resched();
+		cond_resched_stall();
 	}
 }
 EXPORT_SYMBOL(cfi_udelay);
diff --git a/drivers/mtd/devices/spear_smi.c b/drivers/mtd/devices/spear_smi.c
index 0a35e5236ae5..9b4d226633a9 100644
--- a/drivers/mtd/devices/spear_smi.c
+++ b/drivers/mtd/devices/spear_smi.c
@@ -278,7 +278,7 @@ static int spear_smi_wait_till_ready(struct spear_smi *dev, u32 bank,
 			return 0;
 		}
 
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after_eq(jiffies, finish));
 
 	dev_err(&dev->pdev->dev, "smi controller is busy, timeout\n");
diff --git a/drivers/mtd/devices/sst25l.c b/drivers/mtd/devices/sst25l.c
index 8813994ce9f4..ff16147d9bdd 100644
--- a/drivers/mtd/devices/sst25l.c
+++ b/drivers/mtd/devices/sst25l.c
@@ -132,8 +132,7 @@ static int sst25l_wait_till_ready(struct sst25l_flash *flash)
 			return err;
 		if (!(status & SST25L_STATUS_BUSY))
 			return 0;
-
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after_eq(jiffies, deadline));
 
 	return -ETIMEDOUT;
diff --git a/drivers/mtd/devices/st_spi_fsm.c b/drivers/mtd/devices/st_spi_fsm.c
index 95530cbbb1e0..a0f5874c1941 100644
--- a/drivers/mtd/devices/st_spi_fsm.c
+++ b/drivers/mtd/devices/st_spi_fsm.c
@@ -738,8 +738,6 @@ static void stfsm_wait_seq(struct stfsm *fsm)
 
 		if (stfsm_is_idle(fsm))
 			return;
-
-		cond_resched();
 	}
 
 	dev_err(fsm->dev, "timeout on sequence completion\n");
@@ -901,8 +899,6 @@ static uint8_t stfsm_wait_busy(struct stfsm *fsm)
 		if (!timeout)
 			/* Restart */
 			writel(seq->seq_cfg, fsm->base + SPI_FAST_SEQ_CFG);
-
-		cond_resched();
 	}
 
 	dev_err(fsm->dev, "timeout on wait_busy\n");
diff --git a/drivers/mtd/inftlcore.c b/drivers/mtd/inftlcore.c
index 9739387cff8c..c757b8a25748 100644
--- a/drivers/mtd/inftlcore.c
+++ b/drivers/mtd/inftlcore.c
@@ -732,11 +732,6 @@ static void INFTL_trydeletechain(struct INFTLrecord *inftl, unsigned thisVUC)
 
 		/* Now sort out whatever was pointing to it... */
 		*prevEUN = BLOCK_NIL;
-
-		/* Ideally we'd actually be responsive to new
-		   requests while we're doing this -- if there's
-		   free space why should others be made to wait? */
-		cond_resched();
 	}
 
 	inftl->VUtable[thisVUC] = BLOCK_NIL;
diff --git a/drivers/mtd/lpddr/lpddr_cmds.c b/drivers/mtd/lpddr/lpddr_cmds.c
index 3c3939bc2dad..ad8992d24082 100644
--- a/drivers/mtd/lpddr/lpddr_cmds.c
+++ b/drivers/mtd/lpddr/lpddr_cmds.c
@@ -161,7 +161,7 @@ static int wait_for_ready(struct map_info *map, struct flchip *chip,
 			sleep_time = 1000000/HZ;
 		} else {
 			udelay(1);
-			cond_resched();
+			cond_resched_stall();
 			timeo--;
 		}
 		mutex_lock(&chip->mutex);
@@ -677,10 +677,6 @@ static int lpddr_writev(struct mtd_info *mtd, const struct kvec *vecs,
 		(*retlen) += size;
 		len -= size;
 
-		/* Be nice and reschedule with the chip in a usable
-		 * state for other processes */
-		cond_resched();
-
 	} while (len);
 
 	return 0;
diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c
index ff18636e0889..96bff5627a31 100644
--- a/drivers/mtd/mtd_blkdevs.c
+++ b/drivers/mtd/mtd_blkdevs.c
@@ -158,7 +158,6 @@ static void mtd_blktrans_work(struct mtd_blktrans_dev *dev)
 		}
 
 		background_done = 0;
-		cond_resched();
 		spin_lock_irq(&dev->queue_lock);
 	}
 }
diff --git a/drivers/mtd/nand/onenand/onenand_base.c b/drivers/mtd/nand/onenand/onenand_base.c
index f66385faf631..97d07e4cc150 100644
--- a/drivers/mtd/nand/onenand/onenand_base.c
+++ b/drivers/mtd/nand/onenand/onenand_base.c
@@ -567,7 +567,7 @@ static int onenand_wait(struct mtd_info *mtd, int state)
 			break;
 
 		if (state != FL_READING && state != FL_PREPARING_ERASE)
-			cond_resched();
+			cond_resched_stall();
 	}
 	/* To get correct interrupt status in timeout case */
 	interrupt = this->read_word(this->base + ONENAND_REG_INTERRUPT);
@@ -1143,8 +1143,6 @@ static int onenand_mlc_read_ops_nolock(struct mtd_info *mtd, loff_t from,
 	stats = mtd->ecc_stats;
 
 	while (read < len) {
-		cond_resched();
-
 		thislen = min_t(int, writesize, len - read);
 
 		column = from & (writesize - 1);
@@ -1307,7 +1305,6 @@ static int onenand_read_ops_nolock(struct mtd_info *mtd, loff_t from,
 		buf += thislen;
 		thislen = min_t(int, writesize, len - read);
 		column = 0;
-		cond_resched();
 		/* Now wait for load */
 		ret = this->wait(mtd, FL_READING);
 		onenand_update_bufferram(mtd, from, !ret);
@@ -1378,8 +1375,6 @@ static int onenand_read_oob_nolock(struct mtd_info *mtd, loff_t from,
 	readcmd = ONENAND_IS_4KB_PAGE(this) ? ONENAND_CMD_READ : ONENAND_CMD_READOOB;
 
 	while (read < len) {
-		cond_resched();
-
 		thislen = oobsize - column;
 		thislen = min_t(int, thislen, len);
 
@@ -1565,8 +1560,6 @@ int onenand_bbt_read_oob(struct mtd_info *mtd, loff_t from,
 	readcmd = ONENAND_IS_4KB_PAGE(this) ? ONENAND_CMD_READ : ONENAND_CMD_READOOB;
 
 	while (read < len) {
-		cond_resched();
-
 		thislen = mtd->oobsize - column;
 		thislen = min_t(int, thislen, len);
 
@@ -1838,8 +1831,6 @@ static int onenand_write_ops_nolock(struct mtd_info *mtd, loff_t to,
 			thislen = min_t(int, mtd->writesize - column, len - written);
 			thisooblen = min_t(int, oobsize - oobcolumn, ooblen - oobwritten);
 
-			cond_resched();
-
 			this->command(mtd, ONENAND_CMD_BUFFERRAM, to, thislen);
 
 			/* Partial page write */
@@ -2022,8 +2013,6 @@ static int onenand_write_oob_nolock(struct mtd_info *mtd, loff_t to,
 	while (written < len) {
 		int thislen = min_t(int, oobsize, len - written);
 
-		cond_resched();
-
 		this->command(mtd, ONENAND_CMD_BUFFERRAM, to, mtd->oobsize);
 
 		/* We send data to spare ram with oobsize
@@ -2232,7 +2221,6 @@ static int onenand_multiblock_erase(struct mtd_info *mtd,
 		}
 
 		/* last block of 64-eb series */
-		cond_resched();
 		this->command(mtd, ONENAND_CMD_ERASE, addr, block_size);
 		onenand_invalidate_bufferram(mtd, addr, block_size);
 
@@ -2288,8 +2276,6 @@ static int onenand_block_by_block_erase(struct mtd_info *mtd,
 
 	/* Loop through the blocks */
 	while (len) {
-		cond_resched();
-
 		/* Check if we have a bad block, we do not erase bad blocks */
 		if (onenand_block_isbad_nolock(mtd, addr, 0)) {
 			printk(KERN_WARNING "%s: attempt to erase a bad block "
@@ -2799,8 +2785,6 @@ static int onenand_otp_write_oob_nolock(struct mtd_info *mtd, loff_t to,
 	while (written < len) {
 		int thislen = min_t(int, oobsize, len - written);
 
-		cond_resched();
-
 		block = (int) (to >> this->erase_shift);
 		/*
 		 * Write 'DFS, FBA' of Flash
diff --git a/drivers/mtd/nand/onenand/onenand_samsung.c b/drivers/mtd/nand/onenand/onenand_samsung.c
index fd6890a03d55..2e0c8f50d77d 100644
--- a/drivers/mtd/nand/onenand/onenand_samsung.c
+++ b/drivers/mtd/nand/onenand/onenand_samsung.c
@@ -338,8 +338,14 @@ static int s3c_onenand_wait(struct mtd_info *mtd, int state)
 		if (stat & flags)
 			break;
 
+		/*
+		 * Use a cond_resched_stall() to avoid spinning in
+		 * a tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should timeout or event wait?
+		 */
 		if (state != FL_READING)
-			cond_resched();
+			cond_resched_stall();
 	}
 	/* To get correct interrupt status in timeout case */
 	stat = s3c_read_reg(INT_ERR_STAT_OFFSET);
diff --git a/drivers/mtd/nand/raw/diskonchip.c b/drivers/mtd/nand/raw/diskonchip.c
index 5d2ddb037a9a..930b4fdf75e0 100644
--- a/drivers/mtd/nand/raw/diskonchip.c
+++ b/drivers/mtd/nand/raw/diskonchip.c
@@ -248,7 +248,7 @@ static int _DoC_WaitReady(struct doc_priv *doc)
 				return -EIO;
 			}
 			udelay(1);
-			cond_resched();
+			cond_resched_stall();
 		}
 	} else {
 		while (!(ReadDOC(docptr, CDSNControl) & CDSN_CTRL_FR_B)) {
@@ -257,7 +257,7 @@ static int _DoC_WaitReady(struct doc_priv *doc)
 				return -EIO;
 			}
 			udelay(1);
-			cond_resched();
+			cond_resched_stall();
 		}
 	}
 
diff --git a/drivers/mtd/nand/raw/fsmc_nand.c b/drivers/mtd/nand/raw/fsmc_nand.c
index 811982da3557..20e88e98e517 100644
--- a/drivers/mtd/nand/raw/fsmc_nand.c
+++ b/drivers/mtd/nand/raw/fsmc_nand.c
@@ -398,8 +398,7 @@ static int fsmc_read_hwecc_ecc4(struct nand_chip *chip, const u8 *data,
 	do {
 		if (readl_relaxed(host->regs_va + STS) & FSMC_CODE_RDY)
 			break;
-
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after_eq(jiffies, deadline));
 
 	if (time_after_eq(jiffies, deadline)) {
diff --git a/drivers/mtd/nand/raw/hisi504_nand.c b/drivers/mtd/nand/raw/hisi504_nand.c
index fe291a2e5c77..bf669b1750f8 100644
--- a/drivers/mtd/nand/raw/hisi504_nand.c
+++ b/drivers/mtd/nand/raw/hisi504_nand.c
@@ -819,7 +819,7 @@ static int hisi_nfc_suspend(struct device *dev)
 		if (((hinfc_read(host, HINFC504_STATUS) & 0x1) == 0x0) &&
 		    (hinfc_read(host, HINFC504_DMA_CTRL) &
 		     HINFC504_DMA_CTRL_DMA_START)) {
-			cond_resched();
+			cond_resched_stall();
 			return 0;
 		}
 	}
diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
index 1fcac403cee6..656126b05f09 100644
--- a/drivers/mtd/nand/raw/nand_base.c
+++ b/drivers/mtd/nand/raw/nand_base.c
@@ -730,8 +730,7 @@ int nand_gpio_waitrdy(struct nand_chip *chip, struct gpio_desc *gpiod,
 	do {
 		if (gpiod_get_value_cansleep(gpiod))
 			return 0;
-
-		cond_resched();
+		cond_resched_stall();
 	} while	(time_before(jiffies, timeout_ms));
 
 	return gpiod_get_value_cansleep(gpiod) ? 0 : -ETIMEDOUT;
diff --git a/drivers/mtd/nand/raw/nand_legacy.c b/drivers/mtd/nand/raw/nand_legacy.c
index 743792edf98d..aaef537b46c3 100644
--- a/drivers/mtd/nand/raw/nand_legacy.c
+++ b/drivers/mtd/nand/raw/nand_legacy.c
@@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
 	do {
 		if (chip->legacy.dev_ready(chip))
 			return;
-		cond_resched();
+		/*
+		 * Use a cond_resched_stall() to avoid spinning in
+		 * a tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should timeout or event wait?
+		 */
+		cond_resched_stall();
 	} while (time_before(jiffies, timeo));
 
 	if (!chip->legacy.dev_ready(chip))
@@ -565,7 +571,14 @@ static int nand_wait(struct nand_chip *chip)
 				if (status & NAND_STATUS_READY)
 					break;
 			}
-			cond_resched();
+
+			/*
+			 * Use a cond_resched_stall() to avoid spinning in
+			 * a tight loop.
+			 * Though, given that the timeout is in milliseconds,
+			 * maybe this should timeout or event wait?
+			 */
+			cond_resched_stall();
 		} while (time_before(jiffies, timeo));
 	}
 
diff --git a/drivers/mtd/spi-nor/core.c b/drivers/mtd/spi-nor/core.c
index 1b0c6770c14e..e32e6eebb0e2 100644
--- a/drivers/mtd/spi-nor/core.c
+++ b/drivers/mtd/spi-nor/core.c
@@ -730,7 +730,13 @@ static int spi_nor_wait_till_ready_with_timeout(struct spi_nor *nor,
 		if (ret)
 			return 0;
 
-		cond_resched();
+		/*
+		 * Use a cond_resched_stall() to avoid spinning in
+		 * a tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should timeout or event wait?
+		 */
+		cond_resched_stall();
 	}
 
 	dev_dbg(nor->dev, "flash operation timed out\n");
diff --git a/drivers/mtd/tests/mtd_test.c b/drivers/mtd/tests/mtd_test.c
index c84250beffdc..5bb0c6ef7df9 100644
--- a/drivers/mtd/tests/mtd_test.c
+++ b/drivers/mtd/tests/mtd_test.c
@@ -51,7 +51,6 @@ int mtdtest_scan_for_bad_eraseblocks(struct mtd_info *mtd, unsigned char *bbt,
 		bbt[i] = is_block_bad(mtd, eb + i) ? 1 : 0;
 		if (bbt[i])
 			bad += 1;
-		cond_resched();
 	}
 	pr_info("scanned %d eraseblocks, %d are bad\n", i, bad);
 
@@ -70,7 +69,6 @@ int mtdtest_erase_good_eraseblocks(struct mtd_info *mtd, unsigned char *bbt,
 		err = mtdtest_erase_eraseblock(mtd, eb + i);
 		if (err)
 			return err;
-		cond_resched();
 	}
 
 	return 0;
diff --git a/drivers/mtd/tests/mtd_test.h b/drivers/mtd/tests/mtd_test.h
index 5a6e3bbe0474..4742f53c6f25 100644
--- a/drivers/mtd/tests/mtd_test.h
+++ b/drivers/mtd/tests/mtd_test.h
@@ -4,7 +4,7 @@
 
 static inline int mtdtest_relax(void)
 {
-	cond_resched();
+	cond_resched_stall();
 	if (signal_pending(current)) {
 		pr_info("aborting test due to pending signal!\n");
 		return -EINTR;
diff --git a/drivers/mtd/tests/pagetest.c b/drivers/mtd/tests/pagetest.c
index 8eb40b6e6dfa..79330c0ccd85 100644
--- a/drivers/mtd/tests/pagetest.c
+++ b/drivers/mtd/tests/pagetest.c
@@ -43,7 +43,6 @@ static int write_eraseblock(int ebnum)
 	loff_t addr = (loff_t)ebnum * mtd->erasesize;
 
 	prandom_bytes_state(&rnd_state, writebuf, mtd->erasesize);
-	cond_resched();
 	return mtdtest_write(mtd, addr, mtd->erasesize, writebuf);
 }
 
diff --git a/drivers/mtd/tests/readtest.c b/drivers/mtd/tests/readtest.c
index 99670ef91f2b..c862d9a6dc1d 100644
--- a/drivers/mtd/tests/readtest.c
+++ b/drivers/mtd/tests/readtest.c
@@ -91,7 +91,6 @@ static void dump_eraseblock(int ebnum)
 		for (j = 0; j < 32 && i < n; j++, i++)
 			p += sprintf(p, "%02x", (unsigned int)iobuf[i]);
 		printk(KERN_CRIT "%s\n", line);
-		cond_resched();
 	}
 	if (!mtd->oobsize)
 		return;
@@ -106,7 +105,6 @@ static void dump_eraseblock(int ebnum)
 				p += sprintf(p, "%02x",
 					     (unsigned int)iobuf1[i]);
 			printk(KERN_CRIT "%s\n", line);
-			cond_resched();
 		}
 }
 
diff --git a/drivers/mtd/tests/torturetest.c b/drivers/mtd/tests/torturetest.c
index 841689b4d86d..94cf4f6c6c4c 100644
--- a/drivers/mtd/tests/torturetest.c
+++ b/drivers/mtd/tests/torturetest.c
@@ -390,7 +390,6 @@ static void report_corrupt(unsigned char *read, unsigned char *written)
 	       " what was read from flash and what was expected\n");
 
 	for (i = 0; i < check_len; i += pgsize) {
-		cond_resched();
 		bytes = bits = 0;
 		first = countdiffs(written, read, i, pgsize, &bytes,
 				   &bits);
diff --git a/drivers/mtd/ubi/attach.c b/drivers/mtd/ubi/attach.c
index ae5abe492b52..0994d2d8edf0 100644
--- a/drivers/mtd/ubi/attach.c
+++ b/drivers/mtd/ubi/attach.c
@@ -1390,8 +1390,6 @@ static int scan_all(struct ubi_device *ubi, struct ubi_attach_info *ai,
 		goto out_ech;
 
 	for (pnum = start; pnum < ubi->peb_count; pnum++) {
-		cond_resched();
-
 		dbg_gen("process PEB %d", pnum);
 		err = scan_peb(ubi, ai, pnum, false);
 		if (err < 0)
@@ -1504,8 +1502,6 @@ static int scan_fast(struct ubi_device *ubi, struct ubi_attach_info **ai)
 		goto out_ech;
 
 	for (pnum = 0; pnum < UBI_FM_MAX_START; pnum++) {
-		cond_resched();
-
 		dbg_gen("process PEB %d", pnum);
 		err = scan_peb(ubi, scan_ai, pnum, true);
 		if (err < 0)
@@ -1674,8 +1670,6 @@ static int self_check_ai(struct ubi_device *ubi, struct ubi_attach_info *ai)
 	ubi_rb_for_each_entry(rb1, av, &ai->volumes, rb) {
 		int leb_count = 0;
 
-		cond_resched();
-
 		vols_found += 1;
 
 		if (ai->is_empty) {
@@ -1715,8 +1709,6 @@ static int self_check_ai(struct ubi_device *ubi, struct ubi_attach_info *ai)
 
 		last_aeb = NULL;
 		ubi_rb_for_each_entry(rb2, aeb, &av->root, u.rb) {
-			cond_resched();
-
 			last_aeb = aeb;
 			leb_count += 1;
 
@@ -1790,8 +1782,6 @@ static int self_check_ai(struct ubi_device *ubi, struct ubi_attach_info *ai)
 		ubi_rb_for_each_entry(rb2, aeb, &av->root, u.rb) {
 			int vol_type;
 
-			cond_resched();
-
 			last_aeb = aeb;
 
 			err = ubi_io_read_vid_hdr(ubi, aeb->pnum, vidb, 1);
diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
index 8ee51e49fced..52740f461259 100644
--- a/drivers/mtd/ubi/build.c
+++ b/drivers/mtd/ubi/build.c
@@ -1257,8 +1257,6 @@ static int __init ubi_init(void)
 		struct mtd_dev_param *p = &mtd_dev_param[i];
 		struct mtd_info *mtd;
 
-		cond_resched();
-
 		mtd = open_mtd_device(p->name);
 		if (IS_ERR(mtd)) {
 			err = PTR_ERR(mtd);
diff --git a/drivers/mtd/ubi/cdev.c b/drivers/mtd/ubi/cdev.c
index f43430b9c1e6..e60c0ad0eeb4 100644
--- a/drivers/mtd/ubi/cdev.c
+++ b/drivers/mtd/ubi/cdev.c
@@ -209,8 +209,6 @@ static ssize_t vol_cdev_read(struct file *file, __user char *buf, size_t count,
 	lnum = div_u64_rem(*offp, vol->usable_leb_size, &off);
 
 	do {
-		cond_resched();
-
 		if (off + len >= vol->usable_leb_size)
 			len = vol->usable_leb_size - off;
 
@@ -289,8 +287,6 @@ static ssize_t vol_cdev_direct_write(struct file *file, const char __user *buf,
 	len = count > tbuf_size ? tbuf_size : count;
 
 	while (count) {
-		cond_resched();
-
 		if (off + len >= vol->usable_leb_size)
 			len = vol->usable_leb_size - off;
 
diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
index 655ff41863e2..f1e097503826 100644
--- a/drivers/mtd/ubi/eba.c
+++ b/drivers/mtd/ubi/eba.c
@@ -1408,9 +1408,7 @@ int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
 		aldata_size = data_size =
 			ubi_calc_data_len(ubi, ubi->peb_buf, data_size);
 
-	cond_resched();
 	crc = crc32(UBI_CRC32_INIT, ubi->peb_buf, data_size);
-	cond_resched();
 
 	/*
 	 * It may turn out to be that the whole @from physical eraseblock
@@ -1432,8 +1430,6 @@ int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
 		goto out_unlock_buf;
 	}
 
-	cond_resched();
-
 	/* Read the VID header back and check if it was written correctly */
 	err = ubi_io_read_vid_hdr(ubi, to, vidb, 1);
 	if (err) {
@@ -1454,8 +1450,6 @@ int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
 				err = MOVE_TARGET_WR_ERR;
 			goto out_unlock_buf;
 		}
-
-		cond_resched();
 	}
 
 	ubi_assert(vol->eba_tbl->entries[lnum].pnum == from);
@@ -1640,8 +1634,6 @@ int ubi_eba_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 		if (!vol)
 			continue;
 
-		cond_resched();
-
 		tbl = ubi_eba_create_table(vol, vol->reserved_pebs);
 		if (IS_ERR(tbl)) {
 			err = PTR_ERR(tbl);
diff --git a/drivers/mtd/ubi/misc.c b/drivers/mtd/ubi/misc.c
index 1794d66b6eb7..8751337a8101 100644
--- a/drivers/mtd/ubi/misc.c
+++ b/drivers/mtd/ubi/misc.c
@@ -61,8 +61,6 @@ int ubi_check_volume(struct ubi_device *ubi, int vol_id)
 	for (i = 0; i < vol->used_ebs; i++) {
 		int size;
 
-		cond_resched();
-
 		if (i == vol->used_ebs - 1)
 			size = vol->last_eb_bytes;
 		else
diff --git a/drivers/mtd/ubi/vtbl.c b/drivers/mtd/ubi/vtbl.c
index f700f0e4f2ec..6e0d8b3109d5 100644
--- a/drivers/mtd/ubi/vtbl.c
+++ b/drivers/mtd/ubi/vtbl.c
@@ -163,8 +163,6 @@ static int vtbl_check(const struct ubi_device *ubi,
 	const char *name;
 
 	for (i = 0; i < ubi->vtbl_slots; i++) {
-		cond_resched();
-
 		reserved_pebs = be32_to_cpu(vtbl[i].reserved_pebs);
 		alignment = be32_to_cpu(vtbl[i].alignment);
 		data_pad = be32_to_cpu(vtbl[i].data_pad);
@@ -526,8 +524,6 @@ static int init_volumes(struct ubi_device *ubi,
 	struct ubi_volume *vol;
 
 	for (i = 0; i < ubi->vtbl_slots; i++) {
-		cond_resched();
-
 		if (be32_to_cpu(vtbl[i].reserved_pebs) == 0)
 			continue; /* Empty record */
 
@@ -736,8 +732,6 @@ static int check_attaching_info(const struct ubi_device *ubi,
 	}
 
 	for (i = 0; i < ubi->vtbl_slots + UBI_INT_VOL_COUNT; i++) {
-		cond_resched();
-
 		av = ubi_find_av(ai, i);
 		vol = ubi->volumes[i];
 		if (!vol) {
diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
index 26a214f016c1..5ff22ac93ba9 100644
--- a/drivers/mtd/ubi/wl.c
+++ b/drivers/mtd/ubi/wl.c
@@ -190,8 +190,6 @@ static int do_work(struct ubi_device *ubi)
 	int err;
 	struct ubi_work *wrk;
 
-	cond_resched();
-
 	/*
 	 * @ubi->work_sem is used to synchronize with the workers. Workers take
 	 * it in read mode, so many of them may be doing works at a time. But
@@ -519,7 +517,6 @@ static void serve_prot_queue(struct ubi_device *ubi)
 			 * too long.
 			 */
 			spin_unlock(&ubi->wl_lock);
-			cond_resched();
 			goto repeat;
 		}
 	}
@@ -1703,8 +1700,6 @@ int ubi_thread(void *u)
 			}
 		} else
 			failures = 0;
-
-		cond_resched();
 	}
 
 	dbg_wl("background thread \"%s\" is killed", ubi->bgt_name);
@@ -1805,8 +1800,6 @@ int ubi_wl_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 
 	ubi->free_count = 0;
 	list_for_each_entry_safe(aeb, tmp, &ai->erase, u.list) {
-		cond_resched();
-
 		err = erase_aeb(ubi, aeb, false);
 		if (err)
 			goto out_free;
@@ -1815,8 +1808,6 @@ int ubi_wl_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 	}
 
 	list_for_each_entry(aeb, &ai->free, u.list) {
-		cond_resched();
-
 		e = kmem_cache_alloc(ubi_wl_entry_slab, GFP_KERNEL);
 		if (!e) {
 			err = -ENOMEM;
@@ -1837,8 +1828,6 @@ int ubi_wl_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 
 	ubi_rb_for_each_entry(rb1, av, &ai->volumes, rb) {
 		ubi_rb_for_each_entry(rb2, aeb, &av->root, u.rb) {
-			cond_resched();
-
 			e = kmem_cache_alloc(ubi_wl_entry_slab, GFP_KERNEL);
 			if (!e) {
 				err = -ENOMEM;
@@ -1864,8 +1853,6 @@ int ubi_wl_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 	}
 
 	list_for_each_entry(aeb, &ai->fastmap, u.list) {
-		cond_resched();
-
 		e = ubi_find_fm_block(ubi, aeb->pnum);
 
 		if (e) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 83/86] treewide: drm: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (24 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 82/86] treewide: mtd: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 84/86] treewide: net: " Ankur Arora
                     ` (4 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Inki Dae, Jagan Teki, Marek Szyprowski, Andrzej Hajda,
	Neil Armstrong, Robert Foss, David Airlie, Daniel Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a lock
or enable bottom-halves, causing an explicit preemption check.)

There are a few cases from set-3. Replace them with
cond_resched_stall().

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Inki Dae <inki.dae@samsung.com> 
Cc: Jagan Teki <jagan@amarulasolutions.com> 
Cc: Marek Szyprowski <m.szyprowski@samsung.com> 
Cc: Andrzej Hajda <andrzej.hajda@intel.com> 
Cc: Neil Armstrong <neil.armstrong@linaro.org> 
Cc: Robert Foss <rfoss@kernel.org> 
Cc: David Airlie <airlied@gmail.com> 
Cc: Daniel Vetter <daniel@ffwll.ch> 
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> 
Cc: Maxime Ripard <mripard@kernel.org> 
Cc: Thomas Zimmermann <tzimmermann@suse.de> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/gpu/drm/bridge/samsung-dsim.c         |  2 +-
 drivers/gpu/drm/drm_buddy.c                   |  1 -
 drivers/gpu/drm/drm_gem.c                     |  1 -
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.c    |  1 -
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c     |  2 --
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |  6 ----
 .../drm/i915/gem/selftests/i915_gem_mman.c    |  5 ----
 drivers/gpu/drm/i915/gt/intel_breadcrumbs.c   |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt.c            |  2 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c       |  4 ---
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  4 ---
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |  2 --
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  2 --
 drivers/gpu/drm/i915/gt/selftest_migrate.c    |  2 --
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |  4 ---
 drivers/gpu/drm/i915/i915_active.c            |  2 +-
 drivers/gpu/drm/i915/i915_gem_evict.c         |  2 --
 drivers/gpu/drm/i915/i915_gpu_error.c         | 18 ++++--------
 drivers/gpu/drm/i915/intel_uncore.c           |  1 -
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |  2 --
 drivers/gpu/drm/i915/selftests/i915_request.c |  2 --
 .../gpu/drm/i915/selftests/i915_selftest.c    |  3 --
 drivers/gpu/drm/i915/selftests/i915_vma.c     |  9 ------
 .../gpu/drm/i915/selftests/igt_flush_test.c   |  2 --
 .../drm/i915/selftests/intel_memory_region.c  |  4 ---
 drivers/gpu/drm/tests/drm_buddy_test.c        |  5 ----
 drivers/gpu/drm/tests/drm_mm_test.c           | 29 -------------------
 28 files changed, 11 insertions(+), 110 deletions(-)

diff --git a/drivers/gpu/drm/bridge/samsung-dsim.c b/drivers/gpu/drm/bridge/samsung-dsim.c
index cf777bdb25d2..ae537b9bf8df 100644
--- a/drivers/gpu/drm/bridge/samsung-dsim.c
+++ b/drivers/gpu/drm/bridge/samsung-dsim.c
@@ -1013,7 +1013,7 @@ static int samsung_dsim_wait_for_hdr_fifo(struct samsung_dsim *dsi)
 		if (reg & DSIM_SFR_HEADER_EMPTY)
 			return 0;
 
-		if (!cond_resched())
+		if (!cond_resched_stall())
 			usleep_range(950, 1050);
 	} while (--timeout);
 
diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index e6f5ba5f4baf..fe401d18bf4d 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -311,7 +311,6 @@ void drm_buddy_free_list(struct drm_buddy *mm, struct list_head *objects)
 
 	list_for_each_entry_safe(block, on, objects, link) {
 		drm_buddy_free_block(mm, block);
-		cond_resched();
 	}
 	INIT_LIST_HEAD(objects);
 }
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 44a948b80ee1..881caa4b48a9 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -506,7 +506,6 @@ static void drm_gem_check_release_batch(struct folio_batch *fbatch)
 {
 	check_move_unevictable_folios(fbatch);
 	__folio_batch_release(fbatch);
-	cond_resched();
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 5a687a3686bd..0b16689423b4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1812,7 +1812,7 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
 		err = eb_copy_relocations(eb);
 		have_copy = err == 0;
 	} else {
-		cond_resched();
+		cond_resched_stall();
 		err = 0;
 	}
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index ef9346ed6d0f..172eee1e8889 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -414,7 +414,6 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 
 		/* But keep the pointer alive for RCU-protected lookups */
 		call_rcu(&obj->rcu, __i915_gem_free_object_rcu);
-		cond_resched();
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 73a4a4eb29e0..38ea2fc206e0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -26,7 +26,6 @@ static void check_release_folio_batch(struct folio_batch *fbatch)
 {
 	check_move_unevictable_folios(fbatch);
 	__folio_batch_release(fbatch);
-	cond_resched();
 }
 
 void shmem_sg_free_table(struct sg_table *st, struct address_space *mapping,
@@ -108,7 +107,6 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
 		gfp_t gfp = noreclaim;
 
 		do {
-			cond_resched();
 			folio = shmem_read_folio_gfp(mapping, i, gfp);
 			if (!IS_ERR(folio))
 				break;
diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
index 6b9f6cf50bf6..fae0fa993404 100644
--- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
+++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
@@ -1447,8 +1447,6 @@ static int igt_ppgtt_smoke_huge(void *arg)
 
 		if (err)
 			break;
-
-		cond_resched();
 	}
 
 	return err;
@@ -1538,8 +1536,6 @@ static int igt_ppgtt_sanity_check(void *arg)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -1738,8 +1734,6 @@ static int igt_ppgtt_mixed(void *arg)
 			break;
 
 		addr += obj->base.size;
-
-		cond_resched();
 	}
 
 	i915_gem_context_unlock_engines(ctx);
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
index 72957a36a36b..c994071532cf 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
@@ -221,7 +221,6 @@ static int check_partial_mappings(struct drm_i915_gem_object *obj,
 		u32 *cpu;
 
 		GEM_BUG_ON(view.partial.size > nreal);
-		cond_resched();
 
 		vma = i915_gem_object_ggtt_pin(obj, &view, 0, 0, PIN_MAPPABLE);
 		if (IS_ERR(vma)) {
@@ -1026,8 +1025,6 @@ static void igt_close_objects(struct drm_i915_private *i915,
 		i915_gem_object_put(obj);
 	}
 
-	cond_resched();
-
 	i915_gem_drain_freed_objects(i915);
 }
 
@@ -1041,8 +1038,6 @@ static void igt_make_evictable(struct list_head *objects)
 			i915_gem_object_unpin_pages(obj);
 		i915_gem_object_unlock(obj);
 	}
-
-	cond_resched();
 }
 
 static int igt_fill_mappable(struct intel_memory_region *mr,
diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index ecc990ec1b95..e016f1203f7c 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -315,7 +315,7 @@ void __intel_breadcrumbs_park(struct intel_breadcrumbs *b)
 		local_irq_disable();
 		signal_irq_work(&b->irq_work);
 		local_irq_enable();
-		cond_resched();
+		cond_resched_stall();
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 449f0b7fc843..40cfdf4f5fff 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -664,7 +664,7 @@ int intel_gt_wait_for_idle(struct intel_gt *gt, long timeout)
 
 	while ((timeout = intel_gt_retire_requests_timeout(gt, timeout,
 							   &remaining_timeout)) > 0) {
-		cond_resched();
+		cond_resched_stall();
 		if (signal_pending(current))
 			return -EINTR;
 	}
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 576e5ef0289b..cc3f62d5c28f 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -906,8 +906,6 @@ intel_context_migrate_copy(struct intel_context *ce,
 			err = -EINVAL;
 			break;
 		}
-
-		cond_resched();
 	} while (1);
 
 out_ce:
@@ -1067,8 +1065,6 @@ intel_context_migrate_clear(struct intel_context *ce,
 		i915_request_add(rq);
 		if (err || !it.sg || !sg_dma_len(it.sg))
 			break;
-
-		cond_resched();
 	} while (1);
 
 out_ce:
diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index 4202df5b8c12..52c8fa3e5cad 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -60,8 +60,6 @@ static int wait_for_submit(struct intel_engine_cs *engine,
 
 		if (done)
 			return -ETIME;
-
-		cond_resched();
 	} while (1);
 }
 
@@ -72,7 +70,6 @@ static int wait_for_reset(struct intel_engine_cs *engine,
 	timeout += jiffies;
 
 	do {
-		cond_resched();
 		intel_engine_flush_submission(engine);
 
 		if (READ_ONCE(engine->execlists.pending[0]))
@@ -1373,7 +1370,6 @@ static int live_timeslice_queue(void *arg)
 
 		/* Wait until we ack the release_queue and start timeslicing */
 		do {
-			cond_resched();
 			intel_engine_flush_submission(engine);
 		} while (READ_ONCE(engine->execlists.pending[0]));
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 0dd4d00ee894..e751ed2cf8b2 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -939,8 +939,6 @@ static void active_engine(struct kthread_work *work)
 			pr_err("[%s] Request put failed: %d!\n", engine->name, err);
 			break;
 		}
-
-		cond_resched();
 	}
 
 	for (count = 0; count < ARRAY_SIZE(rq); count++) {
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index 5f826b6dcf5d..83a42492f0d0 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -70,8 +70,6 @@ static int wait_for_submit(struct intel_engine_cs *engine,
 
 		if (done)
 			return -ETIME;
-
-		cond_resched();
 	} while (1);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_migrate.c b/drivers/gpu/drm/i915/gt/selftest_migrate.c
index 3def5ca72dec..9dfa70699df9 100644
--- a/drivers/gpu/drm/i915/gt/selftest_migrate.c
+++ b/drivers/gpu/drm/i915/gt/selftest_migrate.c
@@ -210,8 +210,6 @@ static int intel_context_copy_ccs(struct intel_context *ce,
 		i915_request_add(rq);
 		if (err || !it.sg || !sg_dma_len(it.sg))
 			break;
-
-		cond_resched();
 	} while (1);
 
 out_ce:
diff --git a/drivers/gpu/drm/i915/gt/selftest_timeline.c b/drivers/gpu/drm/i915/gt/selftest_timeline.c
index fa36cf920bde..15b8fd41ad90 100644
--- a/drivers/gpu/drm/i915/gt/selftest_timeline.c
+++ b/drivers/gpu/drm/i915/gt/selftest_timeline.c
@@ -352,7 +352,6 @@ static int bench_sync(void *arg)
 		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
 
 	mock_timeline_fini(&tl);
-	cond_resched();
 
 	mock_timeline_init(&tl, 0);
 
@@ -382,7 +381,6 @@ static int bench_sync(void *arg)
 		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
 
 	mock_timeline_fini(&tl);
-	cond_resched();
 
 	mock_timeline_init(&tl, 0);
 
@@ -405,7 +403,6 @@ static int bench_sync(void *arg)
 	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
 		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
 	mock_timeline_fini(&tl);
-	cond_resched();
 
 	/* Benchmark searching for a known context id and changing the seqno */
 	for (last_order = 1, order = 1; order < 32;
@@ -434,7 +431,6 @@ static int bench_sync(void *arg)
 			__func__, count, order,
 			(long long)div64_ul(ktime_to_ns(kt), count));
 		mock_timeline_fini(&tl);
-		cond_resched();
 	}
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 5ec293011d99..810251c33495 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -865,7 +865,7 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
 
 	/* Wait until the previous preallocation is completed */
 	while (!llist_empty(&ref->preallocated_barriers))
-		cond_resched();
+		cond_resched_stall();
 
 	/*
 	 * Preallocate a node for each physical engine supporting the target
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index c02ebd6900ae..1a600f42a3ad 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -267,8 +267,6 @@ i915_gem_evict_something(struct i915_address_space *vm,
 	if (ret)
 		return ret;
 
-	cond_resched();
-
 	flags |= PIN_NONBLOCK;
 	goto search_again;
 
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 4008bb09fdb5..410072145d4d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -320,8 +320,6 @@ static int compress_page(struct i915_vma_compress *c,
 
 		if (zlib_deflate(zstream, Z_NO_FLUSH) != Z_OK)
 			return -EIO;
-
-		cond_resched();
 	} while (zstream->avail_in);
 
 	/* Fallback to uncompressed if we increase size? */
@@ -408,7 +406,6 @@ static int compress_page(struct i915_vma_compress *c,
 	if (!(wc && i915_memcpy_from_wc(ptr, src, PAGE_SIZE)))
 		memcpy(ptr, src, PAGE_SIZE);
 	list_add_tail(&virt_to_page(ptr)->lru, &dst->page_list);
-	cond_resched();
 
 	return 0;
 }
@@ -2325,13 +2322,6 @@ void intel_klog_error_capture(struct intel_gt *gt,
 						 l_count, line++, ptr2);
 					ptr[pos] = chr;
 					ptr2 = ptr + pos;
-
-					/*
-					 * If spewing large amounts of data via a serial console,
-					 * this can be a very slow process. So be friendly and try
-					 * not to cause 'softlockup on CPU' problems.
-					 */
-					cond_resched();
 				}
 
 				if (ptr2 < (ptr + count))
@@ -2352,8 +2342,12 @@ void intel_klog_error_capture(struct intel_gt *gt,
 				got--;
 			}
 
-			/* As above. */
-			cond_resched();
+			/*
+			 * If spewing large amounts of data via a serial console,
+			 * this can be a very slow process. So be friendly and try
+			 * not to cause 'softlockup on CPU' problems.
+			 */
+			cond_resched_stall();
 		}
 
 		if (got)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index dfefad5a5fec..d2e74cfb1aac 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -487,7 +487,6 @@ intel_uncore_forcewake_reset(struct intel_uncore *uncore)
 		}
 
 		spin_unlock_irqrestore(&uncore->lock, irqflags);
-		cond_resched();
 	}
 
 	drm_WARN_ON(&uncore->i915->drm, active_domains);
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
index 5c397a2df70e..4b497e969a33 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
@@ -201,7 +201,6 @@ static int igt_ppgtt_alloc(void *arg)
 		}
 
 		ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash, 0, size);
-		cond_resched();
 
 		ppgtt->vm.clear_range(&ppgtt->vm, 0, size);
 
@@ -224,7 +223,6 @@ static int igt_ppgtt_alloc(void *arg)
 
 		ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash,
 					    last, size - last);
-		cond_resched();
 
 		i915_vm_free_pt_stash(&ppgtt->vm, &stash);
 	}
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index a9b79888c193..43bb54fc8c78 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -438,8 +438,6 @@ static void __igt_breadcrumbs_smoketest(struct kthread_work *work)
 
 		num_fences += count;
 		num_waits++;
-
-		cond_resched();
 	}
 
 	atomic_long_add(num_fences, &t->num_fences);
diff --git a/drivers/gpu/drm/i915/selftests/i915_selftest.c b/drivers/gpu/drm/i915/selftests/i915_selftest.c
index ee79e0809a6d..17e6bbc3c87e 100644
--- a/drivers/gpu/drm/i915/selftests/i915_selftest.c
+++ b/drivers/gpu/drm/i915/selftests/i915_selftest.c
@@ -179,7 +179,6 @@ static int __run_selftests(const char *name,
 		if (!st->enabled)
 			continue;
 
-		cond_resched();
 		if (signal_pending(current))
 			return -EINTR;
 
@@ -381,7 +380,6 @@ int __i915_subtests(const char *caller,
 	int err;
 
 	for (; count--; st++) {
-		cond_resched();
 		if (signal_pending(current))
 			return -EINTR;
 
@@ -414,7 +412,6 @@ bool __igt_timeout(unsigned long timeout, const char *fmt, ...)
 	va_list va;
 
 	if (!signal_pending(current)) {
-		cond_resched();
 		if (time_before(jiffies, timeout))
 			return false;
 	}
diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
index 71b52d5efef4..1bacdcd77c5b 100644
--- a/drivers/gpu/drm/i915/selftests/i915_vma.c
+++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
@@ -197,8 +197,6 @@ static int igt_vma_create(void *arg)
 			list_del_init(&ctx->link);
 			mock_context_close(ctx);
 		}
-
-		cond_resched();
 	}
 
 end:
@@ -347,8 +345,6 @@ static int igt_vma_pin1(void *arg)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 	err = 0;
@@ -697,7 +693,6 @@ static int igt_vma_rotate_remap(void *arg)
 						pr_err("Unbinding returned %i\n", err);
 						goto out_object;
 					}
-					cond_resched();
 				}
 			}
 		}
@@ -858,8 +853,6 @@ static int igt_vma_partial(void *arg)
 					pr_err("Unbinding returned %i\n", err);
 					goto out_object;
 				}
-
-				cond_resched();
 			}
 		}
 
@@ -1085,8 +1078,6 @@ static int igt_vma_remapped_gtt(void *arg)
 				}
 			}
 			i915_vma_unpin_iomap(vma);
-
-			cond_resched();
 		}
 	}
 
diff --git a/drivers/gpu/drm/i915/selftests/igt_flush_test.c b/drivers/gpu/drm/i915/selftests/igt_flush_test.c
index 29110abb4fe0..fbc1b606df29 100644
--- a/drivers/gpu/drm/i915/selftests/igt_flush_test.c
+++ b/drivers/gpu/drm/i915/selftests/igt_flush_test.c
@@ -22,8 +22,6 @@ int igt_flush_test(struct drm_i915_private *i915)
 		if (intel_gt_is_wedged(gt))
 			ret = -EIO;
 
-		cond_resched();
-
 		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
 			pr_err("%pS timed out, cancelling all further testing.\n",
 			       __builtin_return_address(0));
diff --git a/drivers/gpu/drm/i915/selftests/intel_memory_region.c b/drivers/gpu/drm/i915/selftests/intel_memory_region.c
index d985d9bae2e8..3fce433284bd 100644
--- a/drivers/gpu/drm/i915/selftests/intel_memory_region.c
+++ b/drivers/gpu/drm/i915/selftests/intel_memory_region.c
@@ -46,8 +46,6 @@ static void close_objects(struct intel_memory_region *mem,
 		i915_gem_object_put(obj);
 	}
 
-	cond_resched();
-
 	i915_gem_drain_freed_objects(i915);
 }
 
@@ -1290,8 +1288,6 @@ static int _perf_memcpy(struct intel_memory_region *src_mr,
 			div64_u64(mul_u32_u32(4 * size,
 					      1000 * 1000 * 1000),
 				  t[1] + 2 * t[2] + t[3]) >> 20);
-
-		cond_resched();
 	}
 
 	i915_gem_object_unpin_map(dst);
diff --git a/drivers/gpu/drm/tests/drm_buddy_test.c b/drivers/gpu/drm/tests/drm_buddy_test.c
index 09ee6f6af896..7ee65bad4bb7 100644
--- a/drivers/gpu/drm/tests/drm_buddy_test.c
+++ b/drivers/gpu/drm/tests/drm_buddy_test.c
@@ -29,7 +29,6 @@ static bool __timeout(unsigned long timeout, const char *fmt, ...)
 	va_list va;
 
 	if (!signal_pending(current)) {
-		cond_resched();
 		if (time_before(jiffies, timeout))
 			return false;
 	}
@@ -485,8 +484,6 @@ static void drm_test_buddy_alloc_smoke(struct kunit *test)
 
 		if (err || timeout)
 			break;
-
-		cond_resched();
 	}
 
 	kfree(order);
@@ -681,8 +678,6 @@ static void drm_test_buddy_alloc_range(struct kunit *test)
 		rem -= size;
 		if (!rem)
 			break;
-
-		cond_resched();
 	}
 
 	drm_buddy_free_list(&mm, &blocks);
diff --git a/drivers/gpu/drm/tests/drm_mm_test.c b/drivers/gpu/drm/tests/drm_mm_test.c
index 05d5e7af6d25..7d11740ef599 100644
--- a/drivers/gpu/drm/tests/drm_mm_test.c
+++ b/drivers/gpu/drm/tests/drm_mm_test.c
@@ -474,8 +474,6 @@ static void drm_test_mm_reserve(struct kunit *test)
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_reserve(test, count, size - 1));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_reserve(test, count, size));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_reserve(test, count, size + 1));
-
-		cond_resched();
 	}
 }
 
@@ -645,8 +643,6 @@ static int __drm_test_mm_insert(struct kunit *test, unsigned int count, u64 size
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-
-		cond_resched();
 	}
 
 	ret = 0;
@@ -671,8 +667,6 @@ static void drm_test_mm_insert(struct kunit *test)
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size - 1, false));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size, false));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size + 1, false));
-
-		cond_resched();
 	}
 }
 
@@ -693,8 +687,6 @@ static void drm_test_mm_replace(struct kunit *test)
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size - 1, true));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size, true));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size + 1, true));
-
-		cond_resched();
 	}
 }
 
@@ -882,8 +874,6 @@ static int __drm_test_mm_insert_range(struct kunit *test, unsigned int count, u6
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-
-		cond_resched();
 	}
 
 	ret = 0;
@@ -942,8 +932,6 @@ static void drm_test_mm_insert_range(struct kunit *test)
 								    max / 2, max));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert_range(test, count, size,
 								    max / 4 + 1, 3 * max / 4 - 1));
-
-		cond_resched();
 	}
 }
 
@@ -1086,8 +1074,6 @@ static void drm_test_mm_align(struct kunit *test)
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-
-		cond_resched();
 	}
 
 out:
@@ -1122,8 +1108,6 @@ static void drm_test_mm_align_pot(struct kunit *test, int max)
 			KUNIT_FAIL(test, "insert failed with alignment=%llx [%d]", align, bit);
 			goto out;
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -1465,8 +1449,6 @@ static void drm_test_mm_evict(struct kunit *test)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -1547,8 +1529,6 @@ static void drm_test_mm_evict_range(struct kunit *test)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -1658,7 +1638,6 @@ static void drm_test_mm_topdown(struct kunit *test)
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-		cond_resched();
 	}
 
 out:
@@ -1750,7 +1729,6 @@ static void drm_test_mm_bottomup(struct kunit *test)
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-		cond_resched();
 	}
 
 out:
@@ -1968,8 +1946,6 @@ static void drm_test_mm_color(struct kunit *test)
 			drm_mm_remove_node(node);
 			kfree(node);
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -2038,7 +2014,6 @@ static int evict_color(struct kunit *test, struct drm_mm *mm, u64 range_start,
 		}
 	}
 
-	cond_resched();
 	return 0;
 }
 
@@ -2110,8 +2085,6 @@ static void drm_test_mm_color_evict(struct kunit *test)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -2196,8 +2169,6 @@ static void drm_test_mm_color_evict_range(struct kunit *test)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 84/86] treewide: net: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (25 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 83/86] treewide: drm: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:08   ` [RFC PATCH 85/86] treewide: drivers: " Ankur Arora
                     ` (3 subsequent siblings)
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Michael S. Tsirkin, Jason Wang, Jason A. Donenfeld, Kalle Valo,
	Larry Finger, Ryder Lee, Loic Poulain, Sergey Ryazanov

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a lock,
causing an explicit preemption check.)

There are some uses from set-3 where we busy wait: ex. mlx4/mlx5
drivers, mtk_mdio_busy_wait() and similar. Replaced with
cond_resched_stall().  Some of those places, however, have wait-times
in milliseconds, so maybe we should just be a timed-wait?

Note: there are also a few other cases, where I've replaced by
cond_resched_stall() (ex mhi_net_rx_refill_work() or
broadcom/b43::lo_measure_feedthrough()) where it doesn't seem
like the right thing.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Eric Dumazet <edumazet@google.com> 
Cc: Jakub Kicinski <kuba@kernel.org> 
Cc: Paolo Abeni <pabeni@redhat.com> 
Cc: Felix Fietkau <nbd@nbd.name> 
Cc: John Crispin <john@phrozen.org> 
Cc: Sean Wang <sean.wang@mediatek.com> 
Cc: Mark Lee <Mark-MC.Lee@mediatek.com> 
Cc: Lorenzo Bianconi <lorenzo@kernel.org> 
Cc: Matthias Brugger <matthias.bgg@gmail.com> 
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> 
Cc: "Michael S. Tsirkin" <mst@redhat.com> 
Cc: Jason Wang <jasowang@redhat.com> 
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> 
Cc: Kalle Valo <kvalo@kernel.org> 
Cc: Larry Finger <Larry.Finger@lwfinger.net> 
Cc: Ryder Lee <ryder.lee@mediatek.com> 
Cc: Loic Poulain <loic.poulain@linaro.org> 
Cc: Sergey Ryazanov <ryazanov.s.a@gmail.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/net/dummy.c                                 |  1 -
 drivers/net/ethernet/broadcom/tg3.c                 |  2 +-
 drivers/net/ethernet/intel/e1000/e1000_hw.c         |  3 ---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c         |  2 +-
 drivers/net/ethernet/mellanox/mlx4/catas.c          |  2 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c            | 13 ++++++-------
 .../net/ethernet/mellanox/mlx4/resource_tracker.c   |  9 ++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c       |  4 +---
 drivers/net/ethernet/mellanox/mlx5/core/fw.c        |  3 +--
 drivers/net/ethernet/mellanox/mlxsw/i2c.c           |  5 -----
 drivers/net/ethernet/mellanox/mlxsw/pci.c           |  2 --
 drivers/net/ethernet/pasemi/pasemi_mac.c            |  3 ---
 .../net/ethernet/qlogic/netxen/netxen_nic_init.c    |  2 --
 .../net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c   |  1 -
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c    |  1 -
 .../net/ethernet/qlogic/qlcnic/qlcnic_minidump.c    |  2 --
 drivers/net/ethernet/sfc/falcon/falcon.c            |  6 ------
 drivers/net/ifb.c                                   |  1 -
 drivers/net/ipvlan/ipvlan_core.c                    |  1 -
 drivers/net/macvlan.c                               |  2 --
 drivers/net/mhi_net.c                               |  4 ++--
 drivers/net/netdevsim/fib.c                         |  1 -
 drivers/net/virtio_net.c                            |  2 --
 drivers/net/wireguard/ratelimiter.c                 |  2 --
 drivers/net/wireguard/receive.c                     |  3 ---
 drivers/net/wireguard/send.c                        |  4 ----
 drivers/net/wireless/broadcom/b43/lo.c              |  6 +++---
 drivers/net/wireless/broadcom/b43/pio.c             |  1 -
 drivers/net/wireless/broadcom/b43legacy/phy.c       |  5 -----
 .../wireless/broadcom/brcm80211/brcmfmac/cfg80211.c |  1 -
 drivers/net/wireless/cisco/airo.c                   |  2 --
 drivers/net/wireless/intel/iwlwifi/pcie/trans.c     |  2 --
 drivers/net/wireless/marvell/mwl8k.c                |  2 --
 drivers/net/wireless/mediatek/mt76/util.c           |  1 -
 drivers/net/wwan/mhi_wwan_mbim.c                    |  2 +-
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c          |  3 ---
 drivers/net/xen-netback/netback.c                   |  1 -
 drivers/net/xen-netback/rx.c                        |  2 --
 38 files changed, 25 insertions(+), 84 deletions(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index c4b1b0aa438a..dfebf6387d8a 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -182,7 +182,6 @@ static int __init dummy_init_module(void)
 
 	for (i = 0; i < numdummies && !err; i++) {
 		err = dummy_init_one();
-		cond_resched();
 	}
 	if (err < 0)
 		__rtnl_link_unregister(&dummy_link_ops);
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..ad511d721db3 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -12040,7 +12040,7 @@ static int tg3_get_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 				ret = -EINTR;
 				goto eeprom_done;
 			}
-			cond_resched();
+			cond_resched_stall();
 		}
 	}
 	eeprom->len += i;
diff --git a/drivers/net/ethernet/intel/e1000/e1000_hw.c b/drivers/net/ethernet/intel/e1000/e1000_hw.c
index 4542e2bc28e8..22a419bdc6b7 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_hw.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_hw.c
@@ -3937,7 +3937,6 @@ static s32 e1000_do_read_eeprom(struct e1000_hw *hw, u16 offset, u16 words,
 			 */
 			data[i] = e1000_shift_in_ee_bits(hw, 16);
 			e1000_standby_eeprom(hw);
-			cond_resched();
 		}
 	}
 
@@ -4088,7 +4087,6 @@ static s32 e1000_write_eeprom_spi(struct e1000_hw *hw, u16 offset, u16 words,
 			return -E1000_ERR_EEPROM;
 
 		e1000_standby_eeprom(hw);
-		cond_resched();
 
 		/*  Send the WRITE ENABLE command (8 bit opcode )  */
 		e1000_shift_out_ee_bits(hw, EEPROM_WREN_OPCODE_SPI,
@@ -4198,7 +4196,6 @@ static s32 e1000_write_eeprom_microwire(struct e1000_hw *hw, u16 offset,
 
 		/* Recover from write */
 		e1000_standby_eeprom(hw);
-		cond_resched();
 
 		words_written++;
 	}
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 20afe79f380a..26a9f293ed32 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -309,7 +309,7 @@ static int mtk_mdio_busy_wait(struct mtk_eth *eth)
 			return 0;
 		if (time_after(jiffies, t_start + PHY_IAC_TIMEOUT))
 			break;
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	dev_err(eth->dev, "mdio: MDIO timeout\n");
diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c
index 0d8a362c2673..f013eb3fa6f8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/catas.c
+++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
@@ -148,7 +148,7 @@ static int mlx4_reset_slave(struct mlx4_dev *dev)
 			mlx4_warn(dev, "VF Reset succeed\n");
 			return 0;
 		}
-		cond_resched();
+		cond_resched_stall();
 	}
 	mlx4_err(dev, "Fail to send reset over the communication channel\n");
 	return -ETIMEDOUT;
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index f5b1f8c7834f..259918642b50 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -312,7 +312,8 @@ static int mlx4_comm_cmd_poll(struct mlx4_dev *dev, u8 cmd, u16 param,
 
 	end = msecs_to_jiffies(timeout) + jiffies;
 	while (comm_pending(dev) && time_before(jiffies, end))
-		cond_resched();
+		cond_resched_stall();
+
 	ret_from_pending = comm_pending(dev);
 	if (ret_from_pending) {
 		/* check if the slave is trying to boot in the middle of
@@ -387,7 +388,7 @@ static int mlx4_comm_cmd_wait(struct mlx4_dev *dev, u8 vhcr_cmd,
 	if (!(dev->persist->state & MLX4_DEVICE_STATE_INTERNAL_ERROR)) {
 		end = msecs_to_jiffies(timeout) + jiffies;
 		while (comm_pending(dev) && time_before(jiffies, end))
-			cond_resched();
+			cond_resched_stall();
 	}
 	goto out;
 
@@ -470,7 +471,7 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param,
 			mlx4_err(dev, "%s:cmd_pending failed\n", __func__);
 			goto out;
 		}
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/*
@@ -621,8 +622,7 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
 			err = mlx4_internal_err_ret_value(dev, op, op_modifier);
 			goto out;
 		}
-
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	if (cmd_pending(dev)) {
@@ -2324,8 +2324,7 @@ static int sync_toggles(struct mlx4_dev *dev)
 			priv->cmd.comm_toggle = rd_toggle >> 31;
 			return 0;
 		}
-
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/*
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 771b92019af1..c8127acea986 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -4649,7 +4649,14 @@ static int move_all_busy(struct mlx4_dev *dev, int slave,
 		if (time_after(jiffies, begin + 5 * HZ))
 			break;
 		if (busy)
-			cond_resched();
+			/*
+			 * Giving up the spinlock in _move_all_busy() will
+			 * reschedule if needed.
+			 * Add a cpu_relax() here to ensure that we give
+			 * others a chance to acquire the lock.
+			 */
+			cpu_relax();
+
 	} while (busy);
 
 	if (busy)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index c22b0ad0c870..3c5bfa8eda00 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -285,7 +285,7 @@ static void poll_timeout(struct mlx5_cmd_work_ent *ent)
 			ent->ret = 0;
 			return;
 		}
-		cond_resched();
+		cond_resched_stall();
 	} while (time_before(jiffies, poll_end));
 
 	ent->ret = -ETIMEDOUT;
@@ -1773,13 +1773,11 @@ void mlx5_cmd_flush(struct mlx5_core_dev *dev)
 	for (i = 0; i < cmd->vars.max_reg_cmds; i++) {
 		while (down_trylock(&cmd->vars.sem)) {
 			mlx5_cmd_trigger_completions(dev);
-			cond_resched();
 		}
 	}
 
 	while (down_trylock(&cmd->vars.pages_sem)) {
 		mlx5_cmd_trigger_completions(dev);
-		cond_resched();
 	}
 
 	/* Unlock cmdif */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index 58f4c0d0fafa..a08ca20ceeda 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -373,8 +373,7 @@ int mlx5_cmd_fast_teardown_hca(struct mlx5_core_dev *dev)
 	do {
 		if (mlx5_get_nic_state(dev) == MLX5_NIC_IFC_DISABLED)
 			break;
-
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after(jiffies, end));
 
 	if (mlx5_get_nic_state(dev) != MLX5_NIC_IFC_DISABLED) {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/i2c.c b/drivers/net/ethernet/mellanox/mlxsw/i2c.c
index d23f293e285c..1a11f8cd6bb9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/i2c.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/i2c.c
@@ -180,7 +180,6 @@ static int mlxsw_i2c_wait_go_bit(struct i2c_client *client,
 				break;
 			}
 		}
-		cond_resched();
 	} while ((time_before(jiffies, end)) || (i++ < MLXSW_I2C_RETRY));
 
 	if (wait_done) {
@@ -361,8 +360,6 @@ mlxsw_i2c_write(struct device *dev, size_t in_mbox_size, u8 *in_mbox, int num,
 			err = i2c_transfer(client->adapter, &write_tran, 1);
 			if (err == 1)
 				break;
-
-			cond_resched();
 		} while ((time_before(jiffies, end)) ||
 			 (j++ < MLXSW_I2C_RETRY));
 
@@ -473,8 +470,6 @@ mlxsw_i2c_cmd(struct device *dev, u16 opcode, u32 in_mod, size_t in_mbox_size,
 					   ARRAY_SIZE(read_tran));
 			if (err == ARRAY_SIZE(read_tran))
 				break;
-
-			cond_resched();
 		} while ((time_before(jiffies, end)) ||
 			 (j++ < MLXSW_I2C_RETRY));
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 51eea1f0529c..8124b27d0eaa 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -1455,7 +1455,6 @@ static int mlxsw_pci_sys_ready_wait(struct mlxsw_pci *mlxsw_pci,
 		val = mlxsw_pci_read32(mlxsw_pci, FW_READY);
 		if ((val & MLXSW_PCI_FW_READY_MASK) == MLXSW_PCI_FW_READY_MAGIC)
 			return 0;
-		cond_resched();
 	} while (time_before(jiffies, end));
 
 	*p_sys_status = val & MLXSW_PCI_FW_READY_MASK;
@@ -1824,7 +1823,6 @@ static int mlxsw_pci_cmd_exec(void *bus_priv, u16 opcode, u8 opcode_mod,
 				*p_status = ctrl >> MLXSW_PCI_CIR_CTRL_STATUS_SHIFT;
 				break;
 			}
-			cond_resched();
 		} while (time_before(jiffies, end));
 	} else {
 		wait_event_timeout(mlxsw_pci->cmd.wait, *p_wait_done, timeout);
diff --git a/drivers/net/ethernet/pasemi/pasemi_mac.c b/drivers/net/ethernet/pasemi/pasemi_mac.c
index ed7dd0a04235..3ec6ac758878 100644
--- a/drivers/net/ethernet/pasemi/pasemi_mac.c
+++ b/drivers/net/ethernet/pasemi/pasemi_mac.c
@@ -1225,7 +1225,6 @@ static void pasemi_mac_pause_txchan(struct pasemi_mac *mac)
 		sta = read_dma_reg(PAS_DMA_TXCHAN_TCMDSTA(txch));
 		if (!(sta & PAS_DMA_TXCHAN_TCMDSTA_ACT))
 			break;
-		cond_resched();
 	}
 
 	if (sta & PAS_DMA_TXCHAN_TCMDSTA_ACT)
@@ -1246,7 +1245,6 @@ static void pasemi_mac_pause_rxchan(struct pasemi_mac *mac)
 		sta = read_dma_reg(PAS_DMA_RXCHAN_CCMDSTA(rxch));
 		if (!(sta & PAS_DMA_RXCHAN_CCMDSTA_ACT))
 			break;
-		cond_resched();
 	}
 
 	if (sta & PAS_DMA_RXCHAN_CCMDSTA_ACT)
@@ -1265,7 +1263,6 @@ static void pasemi_mac_pause_rxint(struct pasemi_mac *mac)
 		sta = read_dma_reg(PAS_DMA_RXINT_RCMDSTA(mac->dma_if));
 		if (!(sta & PAS_DMA_RXINT_RCMDSTA_ACT))
 			break;
-		cond_resched();
 	}
 
 	if (sta & PAS_DMA_RXINT_RCMDSTA_ACT)
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
index 35ec9aab3dc7..c26c43a7a83c 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
@@ -326,8 +326,6 @@ static int netxen_wait_rom_done(struct netxen_adapter *adapter)
 	long timeout = 0;
 	long done = 0;
 
-	cond_resched();
-
 	while (done == 0) {
 		done = NXRD32(adapter, NETXEN_ROMUSB_GLB_STATUS);
 		done &= 2;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c
index c95d56e56c59..359db1fa500f 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c
@@ -2023,7 +2023,6 @@ static void qlcnic_83xx_exec_template_cmd(struct qlcnic_adapter *p_dev,
 			break;
 		}
 		entry += p_hdr->size;
-		cond_resched();
 	}
 	p_dev->ahw->reset.seq_index = index;
 }
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
index 09f20c794754..110b1ea921e5 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
@@ -295,7 +295,6 @@ static int qlcnic_wait_rom_done(struct qlcnic_adapter *adapter)
 	long done = 0;
 	int err = 0;
 
-	cond_resched();
 	while (done == 0) {
 		done = QLCRD32(adapter, QLCNIC_ROMUSB_GLB_STATUS, &err);
 		done &= 2;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
index 7ecb3dfe30bd..38b4f56fc464 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
@@ -702,7 +702,6 @@ static u32 qlcnic_read_memory_test_agent(struct qlcnic_adapter *adapter,
 		addr += 16;
 		reg_read -= 16;
 		ret += 16;
-		cond_resched();
 	}
 out:
 	mutex_unlock(&adapter->ahw->mem_lock);
@@ -1383,7 +1382,6 @@ int qlcnic_dump_fw(struct qlcnic_adapter *adapter)
 		buf_offset += entry->hdr.cap_size;
 		entry_offset += entry->hdr.offset;
 		buffer = fw_dump->data + buf_offset;
-		cond_resched();
 	}
 
 	fw_dump->clr = 1;
diff --git a/drivers/net/ethernet/sfc/falcon/falcon.c b/drivers/net/ethernet/sfc/falcon/falcon.c
index 7a1c9337081b..44cc6e1bef57 100644
--- a/drivers/net/ethernet/sfc/falcon/falcon.c
+++ b/drivers/net/ethernet/sfc/falcon/falcon.c
@@ -630,8 +630,6 @@ falcon_spi_read(struct ef4_nic *efx, const struct falcon_spi_device *spi,
 			break;
 		pos += block_len;
 
-		/* Avoid locking up the system */
-		cond_resched();
 		if (signal_pending(current)) {
 			rc = -EINTR;
 			break;
@@ -723,8 +721,6 @@ falcon_spi_write(struct ef4_nic *efx, const struct falcon_spi_device *spi,
 
 		pos += block_len;
 
-		/* Avoid locking up the system */
-		cond_resched();
 		if (signal_pending(current)) {
 			rc = -EINTR;
 			break;
@@ -839,8 +835,6 @@ falcon_spi_erase(struct falcon_mtd_partition *part, loff_t start, size_t len)
 		if (memcmp(empty, buffer, block_len))
 			return -EIO;
 
-		/* Avoid locking up the system */
-		cond_resched();
 		if (signal_pending(current))
 			return -EINTR;
 	}
diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 78253ad57b2e..ffd23d862967 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -434,7 +434,6 @@ static int __init ifb_init_module(void)
 
 	for (i = 0; i < numifbs && !err; i++) {
 		err = ifb_init_one(i);
-		cond_resched();
 	}
 	if (err)
 		__rtnl_link_unregister(&ifb_link_ops);
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index c0c49f181367..91a4d1bda8a0 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -292,7 +292,6 @@ void ipvlan_process_multicast(struct work_struct *work)
 				kfree_skb(skb);
 		}
 		dev_put(dev);
-		cond_resched();
 	}
 }
 
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 02bd201bc7e5..120af3235f4d 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -341,8 +341,6 @@ static void macvlan_process_broadcast(struct work_struct *w)
 		if (src)
 			dev_put(src->dev);
 		consume_skb(skb);
-
-		cond_resched();
 	}
 }
 
diff --git a/drivers/net/mhi_net.c b/drivers/net/mhi_net.c
index ae169929a9d8..cbb59a94b083 100644
--- a/drivers/net/mhi_net.c
+++ b/drivers/net/mhi_net.c
@@ -291,9 +291,9 @@ static void mhi_net_rx_refill_work(struct work_struct *work)
 		}
 
 		/* Do not hog the CPU if rx buffers are consumed faster than
-		 * queued (unlikely).
+		 * queued (uhlikely).
 		 */
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/* If we're still starved of rx buffers, reschedule later */
diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c
index a1f91ff8ec56..7b7a37b247d1 100644
--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -1492,7 +1492,6 @@ static void nsim_fib_event_work(struct work_struct *work)
 		nsim_fib_event(fib_event);
 		list_del(&fib_event->list);
 		kfree(fib_event);
-		cond_resched();
 	}
 	mutex_unlock(&data->fib_lock);
 }
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d67f742fbd4c..d0d7cd077a85 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4015,7 +4015,6 @@ static void free_unused_bufs(struct virtnet_info *vi)
 		struct virtqueue *vq = vi->sq[i].vq;
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
 			virtnet_sq_free_unused_buf(vq, buf);
-		cond_resched();
 	}
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
@@ -4023,7 +4022,6 @@ static void free_unused_bufs(struct virtnet_info *vi)
 
 		while ((buf = virtnet_rq_detach_unused_buf(rq)) != NULL)
 			virtnet_rq_free_unused_buf(rq->vq, buf);
-		cond_resched();
 	}
 }
 
diff --git a/drivers/net/wireguard/ratelimiter.c b/drivers/net/wireguard/ratelimiter.c
index dd55e5c26f46..c9c411ec377a 100644
--- a/drivers/net/wireguard/ratelimiter.c
+++ b/drivers/net/wireguard/ratelimiter.c
@@ -74,8 +74,6 @@ static void wg_ratelimiter_gc_entries(struct work_struct *work)
 		}
 #endif
 		spin_unlock(&table_lock);
-		if (likely(work))
-			cond_resched();
 	}
 	if (likely(work))
 		queue_delayed_work(system_power_efficient_wq, &gc_work, HZ);
diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index 0b3f0c843550..8468b041e786 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -213,7 +213,6 @@ void wg_packet_handshake_receive_worker(struct work_struct *work)
 		wg_receive_handshake_packet(wg, skb);
 		dev_kfree_skb(skb);
 		atomic_dec(&wg->handshake_queue_len);
-		cond_resched();
 	}
 }
 
@@ -501,8 +500,6 @@ void wg_packet_decrypt_worker(struct work_struct *work)
 			likely(decrypt_packet(skb, PACKET_CB(skb)->keypair)) ?
 				PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
 		wg_queue_enqueue_per_peer_rx(skb, state);
-		if (need_resched())
-			cond_resched();
 	}
 }
 
diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index 95c853b59e1d..aa122729d802 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -279,8 +279,6 @@ void wg_packet_tx_worker(struct work_struct *work)
 
 		wg_noise_keypair_put(keypair, false);
 		wg_peer_put(peer);
-		if (need_resched())
-			cond_resched();
 	}
 }
 
@@ -303,8 +301,6 @@ void wg_packet_encrypt_worker(struct work_struct *work)
 			}
 		}
 		wg_queue_enqueue_per_peer_tx(first, state);
-		if (need_resched())
-			cond_resched();
 	}
 }
 
diff --git a/drivers/net/wireless/broadcom/b43/lo.c b/drivers/net/wireless/broadcom/b43/lo.c
index 338b6545a1e7..0fc018a706f3 100644
--- a/drivers/net/wireless/broadcom/b43/lo.c
+++ b/drivers/net/wireless/broadcom/b43/lo.c
@@ -112,10 +112,10 @@ static u16 lo_measure_feedthrough(struct b43_wldev *dev,
 	udelay(21);
 	feedthrough = b43_phy_read(dev, B43_PHY_LO_LEAKAGE);
 
-	/* This is a good place to check if we need to relax a bit,
+	/* This is a good place to check if we need to relax a bit
 	 * as this is the main function called regularly
-	 * in the LO calibration. */
-	cond_resched();
+	 * in the L0 calibration. */
+	cond_resched_stall();
 
 	return feedthrough;
 }
diff --git a/drivers/net/wireless/broadcom/b43/pio.c b/drivers/net/wireless/broadcom/b43/pio.c
index 8c28a9250cd1..44f5920ab6ff 100644
--- a/drivers/net/wireless/broadcom/b43/pio.c
+++ b/drivers/net/wireless/broadcom/b43/pio.c
@@ -768,7 +768,6 @@ void b43_pio_rx(struct b43_pio_rxqueue *q)
 		stop = !pio_rx_frame(q);
 		if (stop)
 			break;
-		cond_resched();
 		if (WARN_ON_ONCE(++count > 10000))
 			break;
 	}
diff --git a/drivers/net/wireless/broadcom/b43legacy/phy.c b/drivers/net/wireless/broadcom/b43legacy/phy.c
index c1395e622759..d6d2cf2a38fe 100644
--- a/drivers/net/wireless/broadcom/b43legacy/phy.c
+++ b/drivers/net/wireless/broadcom/b43legacy/phy.c
@@ -1113,7 +1113,6 @@ static u16 b43legacy_phy_lo_b_r15_loop(struct b43legacy_wldev *dev)
 		ret += b43legacy_phy_read(dev, 0x002C);
 	}
 	local_irq_restore(flags);
-	cond_resched();
 
 	return ret;
 }
@@ -1242,7 +1241,6 @@ u16 b43legacy_phy_lo_g_deviation_subval(struct b43legacy_wldev *dev,
 	}
 	ret = b43legacy_phy_read(dev, 0x002D);
 	local_irq_restore(flags);
-	cond_resched();
 
 	return ret;
 }
@@ -1580,7 +1578,6 @@ void b43legacy_phy_lo_g_measure(struct b43legacy_wldev *dev)
 			b43legacy_radio_write16(dev, 0x43, i);
 			b43legacy_radio_write16(dev, 0x52, phy->txctl2);
 			udelay(10);
-			cond_resched();
 
 			b43legacy_phy_set_baseband_attenuation(dev, j * 2);
 
@@ -1631,7 +1628,6 @@ void b43legacy_phy_lo_g_measure(struct b43legacy_wldev *dev)
 					      phy->txctl2
 					      | (3/*txctl1*/ << 4));
 			udelay(10);
-			cond_resched();
 
 			b43legacy_phy_set_baseband_attenuation(dev, j * 2);
 
@@ -1654,7 +1650,6 @@ void b43legacy_phy_lo_g_measure(struct b43legacy_wldev *dev)
 		b43legacy_phy_write(dev, 0x0812, (r27 << 8) | 0xA2);
 		udelay(2);
 		b43legacy_phy_write(dev, 0x0812, (r27 << 8) | 0xA3);
-		cond_resched();
 	} else
 		b43legacy_phy_write(dev, 0x0015, r27 | 0xEFA0);
 	b43legacy_phy_lo_adjust(dev, is_initializing);
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index 2a90bb24ba77..3cc5476c529d 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -3979,7 +3979,6 @@ static int brcmf_cfg80211_sched_scan_stop(struct wiphy *wiphy,
 static __always_inline void brcmf_delay(u32 ms)
 {
 	if (ms < 1000 / HZ) {
-		cond_resched();
 		mdelay(ms);
 	} else {
 		msleep(ms);
diff --git a/drivers/net/wireless/cisco/airo.c b/drivers/net/wireless/cisco/airo.c
index dbd13f7aa3e6..f15a55138dd9 100644
--- a/drivers/net/wireless/cisco/airo.c
+++ b/drivers/net/wireless/cisco/airo.c
@@ -3988,8 +3988,6 @@ static u16 issuecommand(struct airo_info *ai, Cmd *pCmd, Resp *pRsp,
 		if ((IN4500(ai, COMMAND)) == pCmd->cmd)
 			// PC4500 didn't notice command, try again
 			OUT4500(ai, COMMAND, pCmd->cmd);
-		if (may_sleep && (max_tries & 255) == 0)
-			cond_resched();
 	}
 
 	if (max_tries == -1) {
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
index 198933f853c5..9ab63ff0b6aa 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
@@ -2309,8 +2309,6 @@ static int iwl_trans_pcie_read_mem(struct iwl_trans *trans, u32 addr,
 			}
 			iwl_trans_release_nic_access(trans);
 
-			if (resched)
-				cond_resched();
 		} else {
 			return -EBUSY;
 		}
diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c
index 13bcb123d122..9b4341da3163 100644
--- a/drivers/net/wireless/marvell/mwl8k.c
+++ b/drivers/net/wireless/marvell/mwl8k.c
@@ -632,7 +632,6 @@ mwl8k_send_fw_load_cmd(struct mwl8k_priv *priv, void *data, int length)
 				break;
 			}
 		}
-		cond_resched();
 		udelay(1);
 	} while (--loops);
 
@@ -795,7 +794,6 @@ static int mwl8k_load_firmware(struct ieee80211_hw *hw)
 			break;
 		}
 
-		cond_resched();
 		udelay(1);
 	} while (--loops);
 
diff --git a/drivers/net/wireless/mediatek/mt76/util.c b/drivers/net/wireless/mediatek/mt76/util.c
index fc76c66ff1a5..54ffe67d1365 100644
--- a/drivers/net/wireless/mediatek/mt76/util.c
+++ b/drivers/net/wireless/mediatek/mt76/util.c
@@ -130,7 +130,6 @@ int __mt76_worker_fn(void *ptr)
 		set_bit(MT76_WORKER_RUNNING, &w->state);
 		set_current_state(TASK_RUNNING);
 		w->fn(w);
-		cond_resched();
 		clear_bit(MT76_WORKER_RUNNING, &w->state);
 	}
 
diff --git a/drivers/net/wwan/mhi_wwan_mbim.c b/drivers/net/wwan/mhi_wwan_mbim.c
index 3f72ae943b29..d8aaf476f25d 100644
--- a/drivers/net/wwan/mhi_wwan_mbim.c
+++ b/drivers/net/wwan/mhi_wwan_mbim.c
@@ -400,7 +400,7 @@ static void mhi_net_rx_refill_work(struct work_struct *work)
 		/* Do not hog the CPU if rx buffers are consumed faster than
 		 * queued (unlikely).
 		 */
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/* If we're still starved of rx buffers, reschedule later */
diff --git a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
index 8dab025a088a..52420b1f3669 100644
--- a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
+++ b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
@@ -423,7 +423,6 @@ static void t7xx_do_tx_hw_push(struct dpmaif_ctrl *dpmaif_ctrl)
 		drb_send_cnt = t7xx_txq_burst_send_skb(txq);
 		if (drb_send_cnt <= 0) {
 			usleep_range(10, 20);
-			cond_resched();
 			continue;
 		}
 
@@ -437,8 +436,6 @@ static void t7xx_do_tx_hw_push(struct dpmaif_ctrl *dpmaif_ctrl)
 
 		t7xx_dpmaif_ul_update_hw_drb_cnt(&dpmaif_ctrl->hw_info, txq->index,
 						 drb_send_cnt * DPMAIF_UL_DRB_SIZE_WORD);
-
-		cond_resched();
 	} while (!t7xx_tx_lists_are_all_empty(dpmaif_ctrl) && !kthread_should_stop() &&
 		 (dpmaif_ctrl->state == DPMAIF_STATE_PWRON));
 }
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 88f760a7cbc3..a540e95ba58f 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1571,7 +1571,6 @@ int xenvif_dealloc_kthread(void *data)
 			break;
 
 		xenvif_tx_dealloc_action(queue);
-		cond_resched();
 	}
 
 	/* Unmap anything remaining*/
diff --git a/drivers/net/xen-netback/rx.c b/drivers/net/xen-netback/rx.c
index 0ba754ebc5ba..bccefaec5312 100644
--- a/drivers/net/xen-netback/rx.c
+++ b/drivers/net/xen-netback/rx.c
@@ -669,8 +669,6 @@ int xenvif_kthread_guest_rx(void *data)
 		 * slots.
 		 */
 		xenvif_rx_queue_drop_expired(queue);
-
-		cond_resched();
 	}
 
 	/* Bin any remaining skbs */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 85/86] treewide: drivers: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (26 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 84/86] treewide: net: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-08  0:48     ` Chris Packham
  2023-11-09 23:25     ` Dmitry Torokhov
  2023-11-07 23:08   ` [RFC PATCH 86/86] sched: " Ankur Arora
                     ` (2 subsequent siblings)
  30 siblings, 2 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora,
	Oded Gabbay, Miguel Ojeda, Jens Axboe, Minchan Kim,
	Sergey Senozhatsky, Sudip Mukherjee, Theodore Ts'o,
	Jason A. Donenfeld, Amit Shah, Gonglei, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Davidlohr Bueso, Jonathan Cameron,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Sumit Semwal, Christian König, Andi Shyti,
	Ray Jui, Scott Branden, Chris Packham, Shawn Guo, Sascha Hauer,
	Junxian Huang, Dmitry Torokhov, Will Deacon, Joerg Roedel,
	Mauro Carvalho Chehab, Srinivas Pandruvada, Hans de Goede,
	Ilpo Järvinen, Mark Gross, Finn Thain, Michael Schmitz,
	James E.J. Bottomley, Martin K. Petersen, Kashyap Desai,
	Sumit Saxena, Shivasharan S, Mark Brown, Neil Armstrong,
	Jens Wiklander, Alex Williamson, Helge Deller, David Hildenbrand

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

The cond_resched() calls here are all kinds. Those from set-1
or set-2 are quite straight-forward to handle.

There are quite a few from set-3, where as noted above, we
use cond_resched() as if it were a amulent. Which I supppose
it is, in that it wards off softlockup or RCU splats.

Those are now cond_resched_stall(), but in most cases, given
that the timeouts are in milliseconds, they could be easily
timed waits.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Oded Gabbay <ogabbay@kernel.org> 
Cc: Miguel Ojeda <ojeda@kernel.org> 
Cc: Jens Axboe <axboe@kernel.dk> 
Cc: Minchan Kim <minchan@kernel.org> 
Cc: Sergey Senozhatsky <senozhatsky@chromium.org> 
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> 
Cc: "Theodore Ts'o" <tytso@mit.edu> 
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> 
Cc: Amit Shah <amit@kernel.org> 
Cc: Gonglei <arei.gonglei@huawei.com> 
Cc: "Michael S. Tsirkin" <mst@redhat.com> 
Cc: Jason Wang <jasowang@redhat.com> 
Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Davidlohr Bueso <dave@stgolabs.net> 
Cc: Jonathan Cameron <jonathan.cameron@huawei.com> 
Cc: Dave Jiang <dave.jiang@intel.com> 
Cc: Alison Schofield <alison.schofield@intel.com> 
Cc: Vishal Verma <vishal.l.verma@intel.com> 
Cc: Ira Weiny <ira.weiny@intel.com> 
Cc: Dan Williams <dan.j.williams@intel.com> 
Cc: Sumit Semwal <sumit.semwal@linaro.org> 
Cc: "Christian König" <christian.koenig@amd.com> 
Cc: Andi Shyti <andi.shyti@kernel.org> 
Cc: Ray Jui <rjui@broadcom.com> 
Cc: Scott Branden <sbranden@broadcom.com> 
Cc: Chris Packham <chris.packham@alliedtelesis.co.nz> 
Cc: Shawn Guo <shawnguo@kernel.org> 
Cc: Sascha Hauer <s.hauer@pengutronix.de> 
Cc: Junxian Huang <huangjunxian6@hisilicon.com> 
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> 
Cc: Will Deacon <will@kernel.org> 
Cc: Joerg Roedel <joro@8bytes.org> 
Cc: Mauro Carvalho Chehab <mchehab@kernel.org> 
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> 
Cc: Hans de Goede <hdegoede@redhat.com> 
Cc: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com> 
Cc: Mark Gross <markgross@kernel.org> 
Cc: Finn Thain <fthain@linux-m68k.org> 
Cc: Michael Schmitz <schmitzmic@gmail.com> 
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> 
Cc: "Martin K. Petersen" <martin.petersen@oracle.com> 
Cc: Kashyap Desai <kashyap.desai@broadcom.com> 
Cc: Sumit Saxena <sumit.saxena@broadcom.com> 
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> 
Cc: Mark Brown <broonie@kernel.org> 
Cc: Neil Armstrong <neil.armstrong@linaro.org> 
Cc: Jens Wiklander <jens.wiklander@linaro.org> 
Cc: Alex Williamson <alex.williamson@redhat.com> 
Cc: Helge Deller <deller@gmx.de> 
Cc: David Hildenbrand <david@redhat.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/accel/ivpu/ivpu_drv.c                      |  2 --
 drivers/accel/ivpu/ivpu_gem.c                      |  1 -
 drivers/accel/ivpu/ivpu_pm.c                       |  8 ++++++--
 drivers/accel/qaic/qaic_data.c                     |  2 --
 drivers/auxdisplay/charlcd.c                       | 11 -----------
 drivers/base/power/domain.c                        |  1 -
 drivers/block/aoe/aoecmd.c                         |  3 +--
 drivers/block/brd.c                                |  1 -
 drivers/block/drbd/drbd_bitmap.c                   |  4 ----
 drivers/block/drbd/drbd_debugfs.c                  |  1 -
 drivers/block/loop.c                               |  3 ---
 drivers/block/xen-blkback/blkback.c                |  3 ---
 drivers/block/zram/zram_drv.c                      |  2 --
 drivers/bluetooth/virtio_bt.c                      |  1 -
 drivers/char/hw_random/arm_smccc_trng.c            |  1 -
 drivers/char/lp.c                                  |  2 --
 drivers/char/mem.c                                 |  4 ----
 drivers/char/mwave/3780i.c                         |  4 +---
 drivers/char/ppdev.c                               |  4 ----
 drivers/char/random.c                              |  2 --
 drivers/char/virtio_console.c                      |  1 -
 drivers/crypto/virtio/virtio_crypto_core.c         |  1 -
 drivers/cxl/pci.c                                  |  1 -
 drivers/dma-buf/selftest.c                         |  1 -
 drivers/dma-buf/st-dma-fence-chain.c               |  1 -
 drivers/fsi/fsi-sbefifo.c                          | 14 ++++++++++++--
 drivers/i2c/busses/i2c-bcm-iproc.c                 |  9 +++++++--
 drivers/i2c/busses/i2c-highlander.c                |  9 +++++++--
 drivers/i2c/busses/i2c-ibm_iic.c                   | 11 +++++++----
 drivers/i2c/busses/i2c-mpc.c                       |  2 +-
 drivers/i2c/busses/i2c-mxs.c                       |  9 ++++++++-
 drivers/i2c/busses/scx200_acb.c                    |  9 +++++++--
 drivers/infiniband/core/umem.c                     |  1 -
 drivers/infiniband/hw/hfi1/driver.c                |  1 -
 drivers/infiniband/hw/hfi1/firmware.c              |  2 +-
 drivers/infiniband/hw/hfi1/init.c                  |  1 -
 drivers/infiniband/hw/hfi1/ruc.c                   |  1 -
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c         |  5 ++++-
 drivers/infiniband/hw/qib/qib_init.c               |  1 -
 drivers/infiniband/sw/rxe/rxe_qp.c                 |  3 +--
 drivers/infiniband/sw/rxe/rxe_task.c               |  4 ++--
 drivers/input/evdev.c                              |  1 -
 drivers/input/keyboard/clps711x-keypad.c           |  2 +-
 drivers/input/misc/uinput.c                        |  1 -
 drivers/input/mousedev.c                           |  1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c        |  2 --
 drivers/media/i2c/vpx3220.c                        |  3 ---
 drivers/media/pci/cobalt/cobalt-i2c.c              |  4 ++--
 drivers/misc/bcm-vk/bcm_vk_dev.c                   |  3 +--
 drivers/misc/bcm-vk/bcm_vk_msg.c                   |  3 +--
 drivers/misc/genwqe/card_base.c                    |  3 +--
 drivers/misc/genwqe/card_ddcb.c                    |  6 ------
 drivers/misc/genwqe/card_dev.c                     |  2 --
 drivers/misc/vmw_balloon.c                         |  4 ----
 drivers/mmc/host/mmc_spi.c                         |  3 ---
 drivers/nvdimm/btt.c                               |  2 --
 drivers/nvme/target/zns.c                          |  2 --
 drivers/parport/parport_ip32.c                     |  1 -
 drivers/parport/parport_pc.c                       |  4 ----
 drivers/pci/pci-sysfs.c                            |  1 -
 drivers/pci/proc.c                                 |  1 -
 .../x86/intel/speed_select_if/isst_if_mbox_pci.c   |  4 ++--
 drivers/s390/cio/css.c                             |  8 --------
 drivers/scsi/NCR5380.c                             |  2 --
 drivers/scsi/megaraid.c                            |  1 -
 drivers/scsi/qedi/qedi_main.c                      |  1 -
 drivers/scsi/qla2xxx/qla_nx.c                      |  2 --
 drivers/scsi/qla2xxx/qla_sup.c                     |  5 -----
 drivers/scsi/qla4xxx/ql4_nx.c                      |  1 -
 drivers/scsi/xen-scsifront.c                       |  2 +-
 drivers/spi/spi-lantiq-ssc.c                       |  3 +--
 drivers/spi/spi-meson-spifc.c                      |  2 +-
 drivers/spi/spi.c                                  |  2 +-
 drivers/staging/rtl8723bs/core/rtw_mlme_ext.c      |  2 +-
 drivers/staging/rtl8723bs/core/rtw_pwrctrl.c       |  2 --
 drivers/tee/optee/ffa_abi.c                        |  1 -
 drivers/tee/optee/smc_abi.c                        |  1 -
 drivers/tty/hvc/hvc_console.c                      |  6 ++----
 drivers/tty/tty_buffer.c                           |  3 ---
 drivers/tty/tty_io.c                               |  1 -
 drivers/usb/gadget/udc/max3420_udc.c               |  1 -
 drivers/usb/host/max3421-hcd.c                     |  2 +-
 drivers/usb/host/xen-hcd.c                         |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c                |  2 --
 drivers/vfio/vfio_iommu_type1.c                    |  7 -------
 drivers/vhost/vhost.c                              |  1 -
 drivers/video/console/vgacon.c                     |  4 ----
 drivers/virtio/virtio_mem.c                        |  8 --------
 88 files changed, 82 insertions(+), 190 deletions(-)

diff --git a/drivers/accel/ivpu/ivpu_drv.c b/drivers/accel/ivpu/ivpu_drv.c
index 7e9359611d69..479801a1d961 100644
--- a/drivers/accel/ivpu/ivpu_drv.c
+++ b/drivers/accel/ivpu/ivpu_drv.c
@@ -314,8 +314,6 @@ static int ivpu_wait_for_ready(struct ivpu_device *vdev)
 		ret = ivpu_ipc_receive(vdev, &cons, &ipc_hdr, NULL, 0);
 		if (ret != -ETIMEDOUT || time_after_eq(jiffies, timeout))
 			break;
-
-		cond_resched();
 	}
 
 	ivpu_ipc_consumer_del(vdev, &cons);
diff --git a/drivers/accel/ivpu/ivpu_gem.c b/drivers/accel/ivpu/ivpu_gem.c
index d09f13b35902..06e4c1eceae8 100644
--- a/drivers/accel/ivpu/ivpu_gem.c
+++ b/drivers/accel/ivpu/ivpu_gem.c
@@ -156,7 +156,6 @@ static int __must_check internal_alloc_pages_locked(struct ivpu_bo *bo)
 			ret = -ENOMEM;
 			goto err_free_pages;
 		}
-		cond_resched();
 	}
 
 	bo->pages = pages;
diff --git a/drivers/accel/ivpu/ivpu_pm.c b/drivers/accel/ivpu/ivpu_pm.c
index ffff2496e8e8..aa9cc4a1903c 100644
--- a/drivers/accel/ivpu/ivpu_pm.c
+++ b/drivers/accel/ivpu/ivpu_pm.c
@@ -105,7 +105,7 @@ static void ivpu_pm_recovery_work(struct work_struct *work)
 retry:
 	ret = pci_try_reset_function(to_pci_dev(vdev->drm.dev));
 	if (ret == -EAGAIN && !drm_dev_is_unplugged(&vdev->drm)) {
-		cond_resched();
+		cond_resched_stall();
 		goto retry;
 	}
 
@@ -146,7 +146,11 @@ int ivpu_pm_suspend_cb(struct device *dev)
 
 	timeout = jiffies + msecs_to_jiffies(vdev->timeout.tdr);
 	while (!ivpu_hw_is_idle(vdev)) {
-		cond_resched();
+
+		/* The timeout is in thousands of msecs. Maybe this should be a
+		 * timed wait instead?
+		 */
+		cond_resched_stall();
 		if (time_after_eq(jiffies, timeout)) {
 			ivpu_err(vdev, "Failed to enter idle on system suspend\n");
 			return -EBUSY;
diff --git a/drivers/accel/qaic/qaic_data.c b/drivers/accel/qaic/qaic_data.c
index f4b06792c6f1..d06fd9d765f2 100644
--- a/drivers/accel/qaic/qaic_data.c
+++ b/drivers/accel/qaic/qaic_data.c
@@ -1516,7 +1516,6 @@ void irq_polling_work(struct work_struct *work)
 			return;
 		}
 
-		cond_resched();
 		usleep_range(datapath_poll_interval_us, 2 * datapath_poll_interval_us);
 	}
 }
@@ -1547,7 +1546,6 @@ irqreturn_t dbc_irq_threaded_fn(int irq, void *data)
 
 	if (!event_count) {
 		event_count = NUM_EVENTS;
-		cond_resched();
 	}
 
 	/*
diff --git a/drivers/auxdisplay/charlcd.c b/drivers/auxdisplay/charlcd.c
index 6d309e4971b6..cb1213e292f4 100644
--- a/drivers/auxdisplay/charlcd.c
+++ b/drivers/auxdisplay/charlcd.c
@@ -470,14 +470,6 @@ static ssize_t charlcd_write(struct file *file, const char __user *buf,
 	char c;
 
 	for (; count-- > 0; (*ppos)++, tmp++) {
-		if (((count + 1) & 0x1f) == 0) {
-			/*
-			 * charlcd_write() is invoked as a VFS->write() callback
-			 * and as such it is always invoked from preemptible
-			 * context and may sleep.
-			 */
-			cond_resched();
-		}
 
 		if (get_user(c, tmp))
 			return -EFAULT;
@@ -539,9 +531,6 @@ static void charlcd_puts(struct charlcd *lcd, const char *s)
 	int count = strlen(s);
 
 	for (; count-- > 0; tmp++) {
-		if (((count + 1) & 0x1f) == 0)
-			cond_resched();
-
 		charlcd_write_char(lcd, *tmp);
 	}
 }
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 5cb2023581d4..6b77bdfe1de9 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -2696,7 +2696,6 @@ static void genpd_dev_pm_detach(struct device *dev, bool power_off)
 			break;
 
 		mdelay(i);
-		cond_resched();
 	}
 
 	if (ret < 0) {
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index d7317425be51..d212b0df661f 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -1235,8 +1235,7 @@ kthread(void *vp)
 		if (!more) {
 			schedule();
 			remove_wait_queue(k->waitq, &wait);
-		} else
-			cond_resched();
+		}
 	} while (!kthread_should_stop());
 	complete(&k->rendez);	/* tell spawner we're stopping */
 	return 0;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 970bd6ff38c4..be1577cd4d4b 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -111,7 +111,6 @@ static void brd_free_pages(struct brd_device *brd)
 
 	xa_for_each(&brd->brd_pages, idx, page) {
 		__free_page(page);
-		cond_resched();
 	}
 
 	xa_destroy(&brd->brd_pages);
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 85ca000a0564..f12de044c540 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -563,7 +563,6 @@ static unsigned long bm_count_bits(struct drbd_bitmap *b)
 		p_addr = __bm_map_pidx(b, idx);
 		bits += bitmap_weight(p_addr, BITS_PER_PAGE);
 		__bm_unmap(p_addr);
-		cond_resched();
 	}
 	/* last (or only) page */
 	last_word = ((b->bm_bits - 1) & BITS_PER_PAGE_MASK) >> LN2_BPL;
@@ -1118,7 +1117,6 @@ static int bm_rw(struct drbd_device *device, const unsigned int flags, unsigned
 			atomic_inc(&ctx->in_flight);
 			bm_page_io_async(ctx, i);
 			++count;
-			cond_resched();
 		}
 	} else if (flags & BM_AIO_WRITE_HINTED) {
 		/* ASSERT: BM_AIO_WRITE_ALL_PAGES is not set. */
@@ -1158,7 +1156,6 @@ static int bm_rw(struct drbd_device *device, const unsigned int flags, unsigned
 			atomic_inc(&ctx->in_flight);
 			bm_page_io_async(ctx, i);
 			++count;
-			cond_resched();
 		}
 	}
 
@@ -1545,7 +1542,6 @@ void _drbd_bm_set_bits(struct drbd_device *device, const unsigned long s, const
 	for (page_nr = first_page; page_nr < last_page; page_nr++) {
 		bm_set_full_words_within_one_page(device->bitmap, page_nr, first_word, last_word);
 		spin_unlock_irq(&b->bm_lock);
-		cond_resched();
 		first_word = 0;
 		spin_lock_irq(&b->bm_lock);
 	}
diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c
index 12460b584bcb..48a85882dfc4 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -318,7 +318,6 @@ static void seq_print_resource_transfer_log_summary(struct seq_file *m,
 			struct drbd_request *req_next;
 			kref_get(&req->kref);
 			spin_unlock_irq(&resource->req_lock);
-			cond_resched();
 			spin_lock_irq(&resource->req_lock);
 			req_next = list_next_entry(req, tl_requests);
 			if (kref_put(&req->kref, drbd_req_destroy))
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 9f2d412fc560..0ea0d37b2f28 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -271,7 +271,6 @@ static int lo_write_simple(struct loop_device *lo, struct request *rq,
 		ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos);
 		if (ret < 0)
 			break;
-		cond_resched();
 	}
 
 	return ret;
@@ -300,7 +299,6 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
 				zero_fill_bio(bio);
 			break;
 		}
-		cond_resched();
 	}
 
 	return 0;
@@ -1948,7 +1946,6 @@ static void loop_process_work(struct loop_worker *worker,
 		spin_unlock_irq(&lo->lo_work_lock);
 
 		loop_handle_cmd(cmd);
-		cond_resched();
 
 		spin_lock_irq(&lo->lo_work_lock);
 	}
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index c362f4ad80ab..9bcef880df30 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1259,9 +1259,6 @@ __do_block_io_op(struct xen_blkif_ring *ring, unsigned int *eoi_flags)
 				goto done;
 			break;
 		}
-
-		/* Yield point for this unbounded loop. */
-		cond_resched();
 	}
 done:
 	return more_to_do;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 06673c6ca255..b1f9312e7905 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1819,8 +1819,6 @@ static ssize_t recompress_store(struct device *dev,
 			ret = err;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	__free_page(page);
diff --git a/drivers/bluetooth/virtio_bt.c b/drivers/bluetooth/virtio_bt.c
index 2ac70b560c46..c570c45d1480 100644
--- a/drivers/bluetooth/virtio_bt.c
+++ b/drivers/bluetooth/virtio_bt.c
@@ -79,7 +79,6 @@ static int virtbt_close_vdev(struct virtio_bluetooth *vbt)
 
 		while ((skb = virtqueue_detach_unused_buf(vq)))
 			kfree_skb(skb);
-		cond_resched();
 	}
 
 	return 0;
diff --git a/drivers/char/hw_random/arm_smccc_trng.c b/drivers/char/hw_random/arm_smccc_trng.c
index 7e954341b09f..f60d101920e4 100644
--- a/drivers/char/hw_random/arm_smccc_trng.c
+++ b/drivers/char/hw_random/arm_smccc_trng.c
@@ -84,7 +84,6 @@ static int smccc_trng_read(struct hwrng *rng, void *data, size_t max, bool wait)
 			tries++;
 			if (tries >= SMCCC_TRNG_MAX_TRIES)
 				return copied;
-			cond_resched();
 			break;
 		default:
 			return -EIO;
diff --git a/drivers/char/lp.c b/drivers/char/lp.c
index 2f171d14b9b5..1d58105112b5 100644
--- a/drivers/char/lp.c
+++ b/drivers/char/lp.c
@@ -478,8 +478,6 @@ static ssize_t lp_read(struct file *file, char __user *buf,
 			retval = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 	parport_negotiate(lp_table[minor].dev->port, IEEE1284_MODE_COMPAT);
  out:
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 1052b0f2d4cf..6f97ab7004d9 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -92,8 +92,6 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size)
 
 static inline bool should_stop_iteration(void)
 {
-	if (need_resched())
-		cond_resched();
 	return signal_pending(current);
 }
 
@@ -497,7 +495,6 @@ static ssize_t read_iter_zero(struct kiocb *iocb, struct iov_iter *iter)
 			continue;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			return written ? written : -EAGAIN;
-		cond_resched();
 	}
 	return written;
 }
@@ -523,7 +520,6 @@ static ssize_t read_zero(struct file *file, char __user *buf,
 
 		if (signal_pending(current))
 			break;
-		cond_resched();
 	}
 
 	return cleared;
diff --git a/drivers/char/mwave/3780i.c b/drivers/char/mwave/3780i.c
index 4a8937f80570..927a1cca1168 100644
--- a/drivers/char/mwave/3780i.c
+++ b/drivers/char/mwave/3780i.c
@@ -51,7 +51,7 @@
 #include <linux/delay.h>
 #include <linux/ioport.h>
 #include <linux/bitops.h>
-#include <linux/sched.h>	/* cond_resched() */
+#include <linux/sched.h>
 
 #include <asm/io.h>
 #include <linux/uaccess.h>
@@ -64,9 +64,7 @@ static DEFINE_SPINLOCK(dsp_lock);
 
 static void PaceMsaAccess(unsigned short usDspBaseIO)
 {
-	cond_resched();
 	udelay(100);
-	cond_resched();
 }
 
 unsigned short dsp3780I_ReadMsaCfg(unsigned short usDspBaseIO,
diff --git a/drivers/char/ppdev.c b/drivers/char/ppdev.c
index 4c188e9e477c..7463228ba9bf 100644
--- a/drivers/char/ppdev.c
+++ b/drivers/char/ppdev.c
@@ -176,8 +176,6 @@ static ssize_t pp_read(struct file *file, char __user *buf, size_t count,
 			bytes_read = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	parport_set_timeout(pp->pdev, pp->default_inactivity);
@@ -256,8 +254,6 @@ static ssize_t pp_write(struct file *file, const char __user *buf,
 
 		if (signal_pending(current))
 			break;
-
-		cond_resched();
 	}
 
 	parport_set_timeout(pp->pdev, pp->default_inactivity);
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 3cb37760dfec..9e25f3a5c83d 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -457,7 +457,6 @@ static ssize_t get_random_bytes_user(struct iov_iter *iter)
 		if (ret % PAGE_SIZE == 0) {
 			if (signal_pending(current))
 				break;
-			cond_resched();
 		}
 	}
 
@@ -1417,7 +1416,6 @@ static ssize_t write_pool_user(struct iov_iter *iter)
 		if (ret % PAGE_SIZE == 0) {
 			if (signal_pending(current))
 				break;
-			cond_resched();
 		}
 	}
 
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 680d1ef2a217..1f8da0a71ce9 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1936,7 +1936,6 @@ static void remove_vqs(struct ports_device *portdev)
 		flush_bufs(vq, true);
 		while ((buf = virtqueue_detach_unused_buf(vq)))
 			free_buf(buf, true);
-		cond_resched();
 	}
 	portdev->vdev->config->del_vqs(portdev->vdev);
 	kfree(portdev->in_vqs);
diff --git a/drivers/crypto/virtio/virtio_crypto_core.c b/drivers/crypto/virtio/virtio_crypto_core.c
index 43a0838d31ff..3842915ea743 100644
--- a/drivers/crypto/virtio/virtio_crypto_core.c
+++ b/drivers/crypto/virtio/virtio_crypto_core.c
@@ -490,7 +490,6 @@ static void virtcrypto_free_unused_reqs(struct virtio_crypto *vcrypto)
 			kfree(vc_req->req_data);
 			kfree(vc_req->sgs);
 		}
-		cond_resched();
 	}
 }
 
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 44a21ab7add5..2c7e670d9a91 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -634,7 +634,6 @@ static irqreturn_t cxl_event_thread(int irq, void *id)
 		if (!status)
 			break;
 		cxl_mem_get_event_records(mds, status);
-		cond_resched();
 	} while (status);
 
 	return IRQ_HANDLED;
diff --git a/drivers/dma-buf/selftest.c b/drivers/dma-buf/selftest.c
index c60b6944b4bd..ddf94da3d412 100644
--- a/drivers/dma-buf/selftest.c
+++ b/drivers/dma-buf/selftest.c
@@ -93,7 +93,6 @@ __subtests(const char *caller, const struct subtest *st, int count, void *data)
 	int err;
 
 	for (; count--; st++) {
-		cond_resched();
 		if (signal_pending(current))
 			return -EINTR;
 
diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c
index c0979c8049b5..cde69fadb4f4 100644
--- a/drivers/dma-buf/st-dma-fence-chain.c
+++ b/drivers/dma-buf/st-dma-fence-chain.c
@@ -431,7 +431,6 @@ static int __find_race(void *arg)
 signal:
 		seqno = get_random_u32_below(data->fc.chain_length - 1);
 		dma_fence_signal(data->fc.fences[seqno]);
-		cond_resched();
 	}
 
 	if (atomic_dec_and_test(&data->children))
diff --git a/drivers/fsi/fsi-sbefifo.c b/drivers/fsi/fsi-sbefifo.c
index 0a98517f3959..0e58ebae0130 100644
--- a/drivers/fsi/fsi-sbefifo.c
+++ b/drivers/fsi/fsi-sbefifo.c
@@ -372,7 +372,13 @@ static int sbefifo_request_reset(struct sbefifo *sbefifo)
 			return 0;
 		}
 
-		cond_resched();
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
 	}
 	dev_err(dev, "FIFO reset timed out\n");
 
@@ -462,7 +468,11 @@ static int sbefifo_wait(struct sbefifo *sbefifo, bool up,
 
 	end_time = jiffies + timeout;
 	while (!time_after(jiffies, end_time)) {
-		cond_resched();
+		/*
+		 * As above, maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
+
 		rc = sbefifo_regr(sbefifo, addr, &sts);
 		if (rc < 0) {
 			dev_err(dev, "FSI error %d reading status register\n", rc);
diff --git a/drivers/i2c/busses/i2c-bcm-iproc.c b/drivers/i2c/busses/i2c-bcm-iproc.c
index 51aab662050b..6efe6d18d859 100644
--- a/drivers/i2c/busses/i2c-bcm-iproc.c
+++ b/drivers/i2c/busses/i2c-bcm-iproc.c
@@ -788,8 +788,13 @@ static int bcm_iproc_i2c_xfer_wait(struct bcm_iproc_i2c_dev *iproc_i2c,
 				break;
 			}
 
-			cpu_relax();
-			cond_resched();
+			/*
+			 * Use cond_resched_stall() to avoid spinning in a
+			 * tight loop.
+			 * Though, given that the timeout is in milliseconds,
+			 * maybe this should be a timed or event wait?
+			 */
+			cond_resched_stall();
 		} while (!iproc_i2c->xfer_is_done);
 	}
 
diff --git a/drivers/i2c/busses/i2c-highlander.c b/drivers/i2c/busses/i2c-highlander.c
index 7922bc917c33..06eed7e1c4f3 100644
--- a/drivers/i2c/busses/i2c-highlander.c
+++ b/drivers/i2c/busses/i2c-highlander.c
@@ -187,8 +187,13 @@ static void highlander_i2c_poll(struct highlander_i2c_dev *dev)
 		if (time_after(jiffies, timeout))
 			break;
 
-		cpu_relax();
-		cond_resched();
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
 	}
 
 	dev_err(dev->dev, "polling timed out\n");
diff --git a/drivers/i2c/busses/i2c-ibm_iic.c b/drivers/i2c/busses/i2c-ibm_iic.c
index 408820319ec4..b486d8b9636b 100644
--- a/drivers/i2c/busses/i2c-ibm_iic.c
+++ b/drivers/i2c/busses/i2c-ibm_iic.c
@@ -207,9 +207,6 @@ static void iic_dev_reset(struct ibm_iic_private* dev)
 			udelay(10);
 			dc ^= DIRCNTL_SCC;
 			out_8(&iic->directcntl, dc);
-
-			/* be nice */
-			cond_resched();
 		}
 	}
 
@@ -231,7 +228,13 @@ static int iic_dc_wait(volatile struct iic_regs __iomem *iic, u8 mask)
 	while ((in_8(&iic->directcntl) & mask) != mask){
 		if (unlikely(time_after(jiffies, x)))
 			return -1;
-		cond_resched();
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
 	}
 	return 0;
 }
diff --git a/drivers/i2c/busses/i2c-mpc.c b/drivers/i2c/busses/i2c-mpc.c
index e4e4995ab224..82d24523c6a7 100644
--- a/drivers/i2c/busses/i2c-mpc.c
+++ b/drivers/i2c/busses/i2c-mpc.c
@@ -712,7 +712,7 @@ static int mpc_i2c_execute_msg(struct mpc_i2c *i2c)
 			}
 			return -EIO;
 		}
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	return i2c->rc;
diff --git a/drivers/i2c/busses/i2c-mxs.c b/drivers/i2c/busses/i2c-mxs.c
index 36def0a9c95c..d4d69cd7ef46 100644
--- a/drivers/i2c/busses/i2c-mxs.c
+++ b/drivers/i2c/busses/i2c-mxs.c
@@ -310,7 +310,14 @@ static int mxs_i2c_pio_wait_xfer_end(struct mxs_i2c_dev *i2c)
 			return -ENXIO;
 		if (time_after(jiffies, timeout))
 			return -ETIMEDOUT;
-		cond_resched();
+
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
 	}
 
 	return 0;
diff --git a/drivers/i2c/busses/scx200_acb.c b/drivers/i2c/busses/scx200_acb.c
index 83c1db610f54..5646130c003f 100644
--- a/drivers/i2c/busses/scx200_acb.c
+++ b/drivers/i2c/busses/scx200_acb.c
@@ -232,8 +232,13 @@ static void scx200_acb_poll(struct scx200_acb_iface *iface)
 		}
 		if (time_after(jiffies, timeout))
 			break;
-		cpu_relax();
-		cond_resched();
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should timeout or event wait?
+		 */
+		cond_resched_stall();
 	}
 
 	dev_err(&iface->adapter.dev, "timeout in state %s\n",
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index f9ab671c8eda..6b4d3d3193a2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -215,7 +215,6 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 		gup_flags |= FOLL_WRITE;
 
 	while (npages) {
-		cond_resched();
 		pinned = pin_user_pages_fast(cur_base,
 					  min_t(unsigned long, npages,
 						PAGE_SIZE /
diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c
index f4492fa407e0..b390eb169a60 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -668,7 +668,6 @@ static noinline int max_packet_exceeded(struct hfi1_packet *packet, int thread)
 		if ((packet->numpkt & (MAX_PKT_RECV_THREAD - 1)) == 0)
 			/* allow defered processing */
 			process_rcv_qp_work(packet);
-		cond_resched();
 		return RCV_PKT_OK;
 	} else {
 		this_cpu_inc(*packet->rcd->dd->rcv_limit);
diff --git a/drivers/infiniband/hw/hfi1/firmware.c b/drivers/infiniband/hw/hfi1/firmware.c
index 0c0cef5b1e0e..717ccb0e69b4 100644
--- a/drivers/infiniband/hw/hfi1/firmware.c
+++ b/drivers/infiniband/hw/hfi1/firmware.c
@@ -560,7 +560,7 @@ static void __obtain_firmware(struct hfi1_devdata *dd)
 		 * something that holds for 30 seconds.  If we do that twice
 		 * in a row it triggers task blocked warning.
 		 */
-		cond_resched();
+		cond_resched_stall();
 		if (fw_8051_load)
 			dispose_one_firmware(&fw_8051);
 		if (fw_fabric_serdes_load)
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index 6de37c5d7d27..3b5abcd72660 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -1958,7 +1958,6 @@ int hfi1_setup_eagerbufs(struct hfi1_ctxtdata *rcd)
 	for (idx = 0; idx < rcd->egrbufs.alloced; idx++) {
 		hfi1_put_tid(dd, rcd->eager_base + idx, PT_EAGER,
 			     rcd->egrbufs.rcvtids[idx].dma, order);
-		cond_resched();
 	}
 
 	return 0;
diff --git a/drivers/infiniband/hw/hfi1/ruc.c b/drivers/infiniband/hw/hfi1/ruc.c
index b0151b7293f5..35fa25211351 100644
--- a/drivers/infiniband/hw/hfi1/ruc.c
+++ b/drivers/infiniband/hw/hfi1/ruc.c
@@ -459,7 +459,6 @@ bool hfi1_schedule_send_yield(struct rvt_qp *qp, struct hfi1_pkt_state *ps,
 			return true;
 		}
 
-		cond_resched();
 		this_cpu_inc(*ps->ppd->dd->send_schedule);
 		ps->timeout = jiffies + ps->timeout_int;
 	}
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index d82daff2d9bd..c76610422255 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -2985,7 +2985,10 @@ static int v2_wait_mbox_complete(struct hns_roce_dev *hr_dev, u32 timeout,
 			return -ETIMEDOUT;
 		}
 
-		cond_resched();
+		/* The timeout is in hundreds of msecs. Maybe this should be a
+		 * timed wait instead?
+		 */
+		cond_resched_stall();
 		ret = -EBUSY;
 	}
 
diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c
index 33667becd52b..0d8e0abb5090 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1674,7 +1674,6 @@ int qib_setup_eagerbufs(struct qib_ctxtdata *rcd)
 					  RCVHQ_RCV_TYPE_EAGER, pa);
 			pa += egrsize;
 		}
-		cond_resched(); /* don't hog the cpu */
 	}
 
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 28e379c108bc..b0fb5a993bae 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -778,12 +778,11 @@ int rxe_qp_to_attr(struct rxe_qp *qp, struct ib_qp_attr *attr, int mask)
 	rxe_av_to_attr(&qp->alt_av, &attr->alt_ah_attr);
 
 	/* Applications that get this state typically spin on it.
-	 * Yield the processor
+	 * Giving up the spinlock will reschedule if needed.
 	 */
 	spin_lock_irqsave(&qp->state_lock, flags);
 	if (qp->attr.sq_draining) {
 		spin_unlock_irqrestore(&qp->state_lock, flags);
-		cond_resched();
 	} else {
 		spin_unlock_irqrestore(&qp->state_lock, flags);
 	}
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 1501120d4f52..692f57fdfdc9 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -227,7 +227,7 @@ void rxe_cleanup_task(struct rxe_task *task)
 	 * for the previously scheduled tasks to finish.
 	 */
 	while (!is_done(task))
-		cond_resched();
+		cond_resched_stall();
 
 	spin_lock_irqsave(&task->lock, flags);
 	task->state = TASK_STATE_INVALID;
@@ -289,7 +289,7 @@ void rxe_disable_task(struct rxe_task *task)
 	spin_unlock_irqrestore(&task->lock, flags);
 
 	while (!is_done(task))
-		cond_resched();
+		cond_resched_stall();
 
 	spin_lock_irqsave(&task->lock, flags);
 	task->state = TASK_STATE_DRAINED;
diff --git a/drivers/input/evdev.c b/drivers/input/evdev.c
index 95f90699d2b1..effbc991be41 100644
--- a/drivers/input/evdev.c
+++ b/drivers/input/evdev.c
@@ -529,7 +529,6 @@ static ssize_t evdev_write(struct file *file, const char __user *buffer,
 
 		input_inject_event(&evdev->handle,
 				   event.type, event.code, event.value);
-		cond_resched();
 	}
 
  out:
diff --git a/drivers/input/keyboard/clps711x-keypad.c b/drivers/input/keyboard/clps711x-keypad.c
index 4c1a3e611edd..e02f6d35ed51 100644
--- a/drivers/input/keyboard/clps711x-keypad.c
+++ b/drivers/input/keyboard/clps711x-keypad.c
@@ -52,7 +52,7 @@ static void clps711x_keypad_poll(struct input_dev *input)
 			/* Read twice for protection against fluctuations */
 			do {
 				state = gpiod_get_value_cansleep(data->desc);
-				cond_resched();
+				cond_resched_stall();
 				state1 = gpiod_get_value_cansleep(data->desc);
 			} while (state != state1);
 
diff --git a/drivers/input/misc/uinput.c b/drivers/input/misc/uinput.c
index d98212d55108..a6c95916ac7e 100644
--- a/drivers/input/misc/uinput.c
+++ b/drivers/input/misc/uinput.c
@@ -624,7 +624,6 @@ static ssize_t uinput_inject_events(struct uinput_device *udev,
 
 		input_event(udev->dev, ev.type, ev.code, ev.value);
 		bytes += input_event_size();
-		cond_resched();
 	}
 
 	return bytes;
diff --git a/drivers/input/mousedev.c b/drivers/input/mousedev.c
index 505c562a5daa..7ce9ffca6d12 100644
--- a/drivers/input/mousedev.c
+++ b/drivers/input/mousedev.c
@@ -704,7 +704,6 @@ static ssize_t mousedev_write(struct file *file, const char __user *buffer,
 		mousedev_generate_response(client, c);
 
 		spin_unlock_irq(&client->packet_lock);
-		cond_resched();
 	}
 
 	kill_fasync(&client->fasync, SIGIO, POLL_IN);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index bd0a596f9863..8f517a80a831 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1582,8 +1582,6 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 			for (i = 0; i < ARRAY_SIZE(evt); ++i)
 				dev_info(smmu->dev, "\t0x%016llx\n",
 					 (unsigned long long)evt[i]);
-
-			cond_resched();
 		}
 
 		/*
diff --git a/drivers/media/i2c/vpx3220.c b/drivers/media/i2c/vpx3220.c
index 1eaae886f217..c673dba9a592 100644
--- a/drivers/media/i2c/vpx3220.c
+++ b/drivers/media/i2c/vpx3220.c
@@ -81,9 +81,6 @@ static int vpx3220_fp_status(struct v4l2_subdev *sd)
 			return 0;
 
 		udelay(10);
-
-		if (need_resched())
-			cond_resched();
 	}
 
 	return -1;
diff --git a/drivers/media/pci/cobalt/cobalt-i2c.c b/drivers/media/pci/cobalt/cobalt-i2c.c
index 10c9ee33f73e..2a11dd49559a 100644
--- a/drivers/media/pci/cobalt/cobalt-i2c.c
+++ b/drivers/media/pci/cobalt/cobalt-i2c.c
@@ -140,7 +140,7 @@ static int cobalt_tx_bytes(struct cobalt_i2c_regs __iomem *regs,
 		while (status & M00018_SR_BITMAP_TIP_MSK) {
 			if (time_after(jiffies, start_time + adap->timeout))
 				return -ETIMEDOUT;
-			cond_resched();
+			cond_resched_stall();
 			status = ioread8(&regs->cr_sr);
 		}
 
@@ -199,7 +199,7 @@ static int cobalt_rx_bytes(struct cobalt_i2c_regs __iomem *regs,
 		while (status & M00018_SR_BITMAP_TIP_MSK) {
 			if (time_after(jiffies, start_time + adap->timeout))
 				return -ETIMEDOUT;
-			cond_resched();
+			cond_resched_stall();
 			status = ioread8(&regs->cr_sr);
 		}
 
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index d4a96137728d..d262e4c5b4e3 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -364,8 +364,7 @@ static inline int bcm_vk_wait(struct bcm_vk *vk, enum pci_barno bar,
 		if (time_after(jiffies, timeout))
 			return -ETIMEDOUT;
 
-		cpu_relax();
-		cond_resched();
+		cond_resched_stall();
 	} while ((rd_val & mask) != value);
 
 	return 0;
diff --git a/drivers/misc/bcm-vk/bcm_vk_msg.c b/drivers/misc/bcm-vk/bcm_vk_msg.c
index e17d81231ea6..1b5a71382e76 100644
--- a/drivers/misc/bcm-vk/bcm_vk_msg.c
+++ b/drivers/misc/bcm-vk/bcm_vk_msg.c
@@ -1295,8 +1295,7 @@ int bcm_vk_release(struct inode *inode, struct file *p_file)
 			break;
 		}
 		dma_cnt = atomic_read(&ctx->dma_cnt);
-		cpu_relax();
-		cond_resched();
+		cond_resched_stall();
 	} while (dma_cnt);
 	dev_dbg(dev, "Draining for [fd-%d] pid %d - delay %d ms\n",
 		ctx->idx, pid, jiffies_to_msecs(jiffies - start_time));
diff --git a/drivers/misc/genwqe/card_base.c b/drivers/misc/genwqe/card_base.c
index 224a7e97cbea..03ed8a426d49 100644
--- a/drivers/misc/genwqe/card_base.c
+++ b/drivers/misc/genwqe/card_base.c
@@ -1004,7 +1004,6 @@ static int genwqe_health_thread(void *data)
 		}
 
 		cd->last_gfir = gfir;
-		cond_resched();
 	}
 
 	return 0;
@@ -1041,7 +1040,7 @@ static int genwqe_health_thread(void *data)
 
 	/* genwqe_bus_reset failed(). Now wait for genwqe_remove(). */
 	while (!kthread_should_stop())
-		cond_resched();
+		cond_resched_stall();
 
 	return -EIO;
 }
diff --git a/drivers/misc/genwqe/card_ddcb.c b/drivers/misc/genwqe/card_ddcb.c
index 500b1feaf1f6..793faf4bdc06 100644
--- a/drivers/misc/genwqe/card_ddcb.c
+++ b/drivers/misc/genwqe/card_ddcb.c
@@ -1207,12 +1207,6 @@ static int genwqe_card_thread(void *data)
 		}
 		if (should_stop)
 			break;
-
-		/*
-		 * Avoid soft lockups on heavy loads; we do not want
-		 * to disable our interrupts.
-		 */
-		cond_resched();
 	}
 	return 0;
 }
diff --git a/drivers/misc/genwqe/card_dev.c b/drivers/misc/genwqe/card_dev.c
index 55fc5b80e649..ec1112dc7d5a 100644
--- a/drivers/misc/genwqe/card_dev.c
+++ b/drivers/misc/genwqe/card_dev.c
@@ -1322,7 +1322,6 @@ static int genwqe_inform_and_stop_processes(struct genwqe_dev *cd)
 			     genwqe_open_files(cd); i++) {
 			dev_info(&pci_dev->dev, "  %d sec ...", i);
 
-			cond_resched();
 			msleep(1000);
 		}
 
@@ -1340,7 +1339,6 @@ static int genwqe_inform_and_stop_processes(struct genwqe_dev *cd)
 				     genwqe_open_files(cd); i++) {
 				dev_warn(&pci_dev->dev, "  %d sec ...", i);
 
-				cond_resched();
 				msleep(1000);
 			}
 		}
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 9ce9b9e0e9b6..7cf977e70935 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -1158,8 +1158,6 @@ static void vmballoon_inflate(struct vmballoon *b)
 			vmballoon_split_refused_pages(&ctl);
 			ctl.page_size--;
 		}
-
-		cond_resched();
 	}
 
 	/*
@@ -1282,8 +1280,6 @@ static unsigned long vmballoon_deflate(struct vmballoon *b, uint64_t n_frames,
 				break;
 			ctl.page_size++;
 		}
-
-		cond_resched();
 	}
 
 	return deflated_frames;
diff --git a/drivers/mmc/host/mmc_spi.c b/drivers/mmc/host/mmc_spi.c
index cc333ad67cac..e05d99437547 100644
--- a/drivers/mmc/host/mmc_spi.c
+++ b/drivers/mmc/host/mmc_spi.c
@@ -192,9 +192,6 @@ static int mmc_spi_skip(struct mmc_spi_host *host, unsigned long timeout,
 			if (cp[i] != byte)
 				return cp[i];
 		}
-
-		/* If we need long timeouts, we may release the CPU */
-		cond_resched();
 	} while (time_is_after_jiffies(start + timeout));
 	return -ETIMEDOUT;
 }
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index d5593b0dc700..5e97555db441 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -435,7 +435,6 @@ static int btt_map_init(struct arena_info *arena)
 
 		offset += size;
 		mapsize -= size;
-		cond_resched();
 	}
 
  free:
@@ -479,7 +478,6 @@ static int btt_log_init(struct arena_info *arena)
 
 		offset += size;
 		logsize -= size;
-		cond_resched();
 	}
 
 	for (i = 0; i < arena->nfree; i++) {
diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index 5b5c1e481722..12eee9a87e42 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -432,8 +432,6 @@ static u16 nvmet_bdev_zone_mgmt_emulate_all(struct nvmet_req *req)
 				zsa_req_op(req->cmd->zms.zsa) | REQ_SYNC,
 				GFP_KERNEL);
 			bio->bi_iter.bi_sector = sector;
-			/* This may take a while, so be nice to others */
-			cond_resched();
 		}
 		sector += bdev_zone_sectors(bdev);
 	}
diff --git a/drivers/parport/parport_ip32.c b/drivers/parport/parport_ip32.c
index 0919ed99ba94..8c52008bbb7c 100644
--- a/drivers/parport/parport_ip32.c
+++ b/drivers/parport/parport_ip32.c
@@ -1238,7 +1238,6 @@ static size_t parport_ip32_epp_write_addr(struct parport *p, const void *buf,
 static unsigned int parport_ip32_fifo_wait_break(struct parport *p,
 						 unsigned long expire)
 {
-	cond_resched();
 	if (time_after(jiffies, expire)) {
 		pr_debug1(PPIP32 "%s: FIFO write timed out\n", p->name);
 		return 1;
diff --git a/drivers/parport/parport_pc.c b/drivers/parport/parport_pc.c
index 1f236aaf7867..a482b5b835ec 100644
--- a/drivers/parport/parport_pc.c
+++ b/drivers/parport/parport_pc.c
@@ -663,8 +663,6 @@ static size_t parport_pc_fifo_write_block_dma(struct parport *port,
 		}
 		/* Is serviceIntr set? */
 		if (!(inb(ECONTROL(port)) & (1<<2))) {
-			cond_resched();
-
 			goto false_alarm;
 		}
 
@@ -674,8 +672,6 @@ static size_t parport_pc_fifo_write_block_dma(struct parport *port,
 		count = get_dma_residue(port->dma);
 		release_dma_lock(dmaflag);
 
-		cond_resched(); /* Can't yield the port. */
-
 		/* Anyone else waiting for the port? */
 		if (port->waithead) {
 			printk(KERN_DEBUG "Somebody wants the port\n");
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index d9eede2dbc0e..e7bb03c3c148 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -719,7 +719,6 @@ static ssize_t pci_read_config(struct file *filp, struct kobject *kobj,
 		data[off - init_off + 3] = (val >> 24) & 0xff;
 		off += 4;
 		size -= 4;
-		cond_resched();
 	}
 
 	if (size >= 2) {
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
index f967709082d6..7d3cd2201e64 100644
--- a/drivers/pci/proc.c
+++ b/drivers/pci/proc.c
@@ -83,7 +83,6 @@ static ssize_t proc_bus_pci_read(struct file *file, char __user *buf,
 		buf += 4;
 		pos += 4;
 		cnt -= 4;
-		cond_resched();
 	}
 
 	if (cnt >= 2) {
diff --git a/drivers/platform/x86/intel/speed_select_if/isst_if_mbox_pci.c b/drivers/platform/x86/intel/speed_select_if/isst_if_mbox_pci.c
index df1fc6c719f3..c202ae0d0656 100644
--- a/drivers/platform/x86/intel/speed_select_if/isst_if_mbox_pci.c
+++ b/drivers/platform/x86/intel/speed_select_if/isst_if_mbox_pci.c
@@ -56,7 +56,7 @@ static int isst_if_mbox_cmd(struct pci_dev *pdev,
 			ret = -EBUSY;
 			tm_delta = ktime_us_delta(ktime_get(), tm);
 			if (tm_delta > OS_MAILBOX_TIMEOUT_AVG_US)
-				cond_resched();
+				cond_resched_stall();
 			continue;
 		}
 		ret = 0;
@@ -95,7 +95,7 @@ static int isst_if_mbox_cmd(struct pci_dev *pdev,
 			ret = -EBUSY;
 			tm_delta = ktime_us_delta(ktime_get(), tm);
 			if (tm_delta > OS_MAILBOX_TIMEOUT_AVG_US)
-				cond_resched();
+				cond_resched_stall();
 			continue;
 		}
 
diff --git a/drivers/s390/cio/css.c b/drivers/s390/cio/css.c
index 3ff46fc694f8..6122a4a057fa 100644
--- a/drivers/s390/cio/css.c
+++ b/drivers/s390/cio/css.c
@@ -659,11 +659,6 @@ static int slow_eval_known_fn(struct subchannel *sch, void *data)
 		rc = css_evaluate_known_subchannel(sch, 1);
 		if (rc == -EAGAIN)
 			css_schedule_eval(sch->schid);
-		/*
-		 * The loop might take long time for platforms with lots of
-		 * known devices. Allow scheduling here.
-		 */
-		cond_resched();
 	}
 	return 0;
 }
@@ -695,9 +690,6 @@ static int slow_eval_unknown_fn(struct subchannel_id schid, void *data)
 		default:
 			rc = 0;
 		}
-		/* Allow scheduling here since the containing loop might
-		 * take a while.  */
-		cond_resched();
 	}
 	return rc;
 }
diff --git a/drivers/scsi/NCR5380.c b/drivers/scsi/NCR5380.c
index cea3a79d538e..40e66afd77cf 100644
--- a/drivers/scsi/NCR5380.c
+++ b/drivers/scsi/NCR5380.c
@@ -738,8 +738,6 @@ static void NCR5380_main(struct work_struct *work)
 			maybe_release_dma_irq(instance);
 		}
 		spin_unlock_irq(&hostdata->lock);
-		if (!done)
-			cond_resched();
 	} while (!done);
 }
 
diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c
index e92f1a73cc9b..675504f8149a 100644
--- a/drivers/scsi/megaraid.c
+++ b/drivers/scsi/megaraid.c
@@ -1696,7 +1696,6 @@ __mega_busywait_mbox (adapter_t *adapter)
 		if (!mbox->m_in.busy)
 			return 0;
 		udelay(100);
-		cond_resched();
 	}
 	return -1;		/* give up after 1 second */
 }
diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
index cd0180b1f5b9..9e2596199458 100644
--- a/drivers/scsi/qedi/qedi_main.c
+++ b/drivers/scsi/qedi/qedi_main.c
@@ -1943,7 +1943,6 @@ static int qedi_percpu_io_thread(void *arg)
 				if (!work->is_solicited)
 					kfree(work);
 			}
-			cond_resched();
 			spin_lock_irqsave(&p->p_work_lock, flags);
 		}
 		set_current_state(TASK_INTERRUPTIBLE);
diff --git a/drivers/scsi/qla2xxx/qla_nx.c b/drivers/scsi/qla2xxx/qla_nx.c
index 6dfb70edb9a6..e1a5c2dbe134 100644
--- a/drivers/scsi/qla2xxx/qla_nx.c
+++ b/drivers/scsi/qla2xxx/qla_nx.c
@@ -972,7 +972,6 @@ qla82xx_flash_wait_write_finish(struct qla_hw_data *ha)
 		if (ret < 0 || (val & 1) == 0)
 			return ret;
 		udelay(10);
-		cond_resched();
 	}
 	ql_log(ql_log_warn, vha, 0xb00d,
 	       "Timeout reached waiting for write finish.\n");
@@ -1037,7 +1036,6 @@ ql82xx_rom_lock_d(struct qla_hw_data *ha)
 
 	while ((qla82xx_rom_lock(ha) != 0) && (loops < 50000)) {
 		udelay(100);
-		cond_resched();
 		loops++;
 	}
 	if (loops >= 50000) {
diff --git a/drivers/scsi/qla2xxx/qla_sup.c b/drivers/scsi/qla2xxx/qla_sup.c
index c092a6b1ced4..40fc521ba89f 100644
--- a/drivers/scsi/qla2xxx/qla_sup.c
+++ b/drivers/scsi/qla2xxx/qla_sup.c
@@ -463,7 +463,6 @@ qla24xx_read_flash_dword(struct qla_hw_data *ha, uint32_t addr, uint32_t *data)
 			return QLA_SUCCESS;
 		}
 		udelay(10);
-		cond_resched();
 	}
 
 	ql_log(ql_log_warn, pci_get_drvdata(ha->pdev), 0x7090,
@@ -505,7 +504,6 @@ qla24xx_write_flash_dword(struct qla_hw_data *ha, uint32_t addr, uint32_t data)
 		if (!(rd_reg_dword(&reg->flash_addr) & FARX_DATA_FLAG))
 			return QLA_SUCCESS;
 		udelay(10);
-		cond_resched();
 	}
 
 	ql_log(ql_log_warn, pci_get_drvdata(ha->pdev), 0x7090,
@@ -2151,7 +2149,6 @@ qla2x00_poll_flash(struct qla_hw_data *ha, uint32_t addr, uint8_t poll_data,
 		}
 		udelay(10);
 		barrier();
-		cond_resched();
 	}
 	return status;
 }
@@ -2301,7 +2298,6 @@ qla2x00_read_flash_data(struct qla_hw_data *ha, uint8_t *tmp_buf,
 		if (saddr % 100)
 			udelay(10);
 		*tmp_buf = data;
-		cond_resched();
 	}
 }
 
@@ -2589,7 +2585,6 @@ qla2x00_write_optrom_data(struct scsi_qla_host *vha, void *buf,
 				rval = QLA_FUNCTION_FAILED;
 				break;
 			}
-			cond_resched();
 		}
 	} while (0);
 	qla2x00_flash_disable(ha);
diff --git a/drivers/scsi/qla4xxx/ql4_nx.c b/drivers/scsi/qla4xxx/ql4_nx.c
index 47adff9f0506..e40a525a2202 100644
--- a/drivers/scsi/qla4xxx/ql4_nx.c
+++ b/drivers/scsi/qla4xxx/ql4_nx.c
@@ -3643,7 +3643,6 @@ qla4_82xx_read_flash_data(struct scsi_qla_host *ha, uint32_t *dwptr,
 	int loops = 0;
 	while ((qla4_82xx_rom_lock(ha) != 0) && (loops < 50000)) {
 		udelay(100);
-		cond_resched();
 		loops++;
 	}
 	if (loops >= 50000) {
diff --git a/drivers/scsi/xen-scsifront.c b/drivers/scsi/xen-scsifront.c
index 9ec55ddc1204..6f8e0c69f832 100644
--- a/drivers/scsi/xen-scsifront.c
+++ b/drivers/scsi/xen-scsifront.c
@@ -442,7 +442,7 @@ static irqreturn_t scsifront_irq_fn(int irq, void *dev_id)
 
 	while (scsifront_cmd_done(info, &eoiflag))
 		/* Yield point for this unbounded loop. */
-		cond_resched();
+		cond_resched_stall();
 
 	xen_irq_lateeoi(irq, eoiflag);
 
diff --git a/drivers/spi/spi-lantiq-ssc.c b/drivers/spi/spi-lantiq-ssc.c
index 938e9e577e4f..151b381fc098 100644
--- a/drivers/spi/spi-lantiq-ssc.c
+++ b/drivers/spi/spi-lantiq-ssc.c
@@ -775,8 +775,7 @@ static void lantiq_ssc_bussy_work(struct work_struct *work)
 			spi_finalize_current_transfer(spi->host);
 			return;
 		}
-
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after_eq(jiffies, end));
 
 	if (spi->host->cur_msg)
diff --git a/drivers/spi/spi-meson-spifc.c b/drivers/spi/spi-meson-spifc.c
index 06626f406f68..ff3550ebb22b 100644
--- a/drivers/spi/spi-meson-spifc.c
+++ b/drivers/spi/spi-meson-spifc.c
@@ -100,7 +100,7 @@ static int meson_spifc_wait_ready(struct meson_spifc *spifc)
 		regmap_read(spifc->regmap, REG_SLAVE, &data);
 		if (data & SLAVE_TRST_DONE)
 			return 0;
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after(jiffies, deadline));
 
 	return -ETIMEDOUT;
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 8d6304cb061e..3ddbfa9babdc 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -1808,7 +1808,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
 
 	/* Prod the scheduler in case transfer_one() was busy waiting */
 	if (!ret)
-		cond_resched();
+		cond_resched_stall();
 	return;
 
 out_unlock:
diff --git a/drivers/staging/rtl8723bs/core/rtw_mlme_ext.c b/drivers/staging/rtl8723bs/core/rtw_mlme_ext.c
index 985683767a40..2a2ebdf12a45 100644
--- a/drivers/staging/rtl8723bs/core/rtw_mlme_ext.c
+++ b/drivers/staging/rtl8723bs/core/rtw_mlme_ext.c
@@ -3775,7 +3775,7 @@ unsigned int send_beacon(struct adapter *padapter)
 		issue_beacon(padapter, 100);
 		issue++;
 		do {
-			cond_resched();
+			cond_resched_stall();
 			rtw_hal_get_hwreg(padapter, HW_VAR_BCN_VALID, (u8 *)(&bxmitok));
 			poll++;
 		} while ((poll%10) != 0 && false == bxmitok && !padapter->bSurpriseRemoved && !padapter->bDriverStopped);
diff --git a/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c b/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
index a392d5b4caf2..c263fbc71201 100644
--- a/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
+++ b/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
@@ -576,8 +576,6 @@ void LPS_Leave_check(struct adapter *padapter)
 	bReady = false;
 	start_time = jiffies;
 
-	cond_resched();
-
 	while (1) {
 		mutex_lock(&pwrpriv->lock);
 
diff --git a/drivers/tee/optee/ffa_abi.c b/drivers/tee/optee/ffa_abi.c
index 0828240f27e6..49f55c051d71 100644
--- a/drivers/tee/optee/ffa_abi.c
+++ b/drivers/tee/optee/ffa_abi.c
@@ -581,7 +581,6 @@ static int optee_ffa_yielding_call(struct tee_context *ctx,
 		 * filled in by ffa_mem_ops->sync_send_receive() returning
 		 * above.
 		 */
-		cond_resched();
 		optee_handle_ffa_rpc(ctx, optee, data->data1, rpc_arg);
 		cmd = OPTEE_FFA_YIELDING_CALL_RESUME;
 		data->data0 = cmd;
diff --git a/drivers/tee/optee/smc_abi.c b/drivers/tee/optee/smc_abi.c
index d5b28fd35d66..86e01454422c 100644
--- a/drivers/tee/optee/smc_abi.c
+++ b/drivers/tee/optee/smc_abi.c
@@ -943,7 +943,6 @@ static int optee_smc_do_call_with_arg(struct tee_context *ctx,
 			 */
 			optee_cq_wait_for_completion(&optee->call_queue, &w);
 		} else if (OPTEE_SMC_RETURN_IS_RPC(res.a0)) {
-			cond_resched();
 			param.a0 = res.a0;
 			param.a1 = res.a1;
 			param.a2 = res.a2;
diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 959fae54ca39..11bb4204b78d 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -538,7 +538,6 @@ static ssize_t hvc_write(struct tty_struct *tty, const u8 *buf, size_t count)
 		if (count) {
 			if (hp->n_outbuf > 0)
 				hvc_flush(hp);
-			cond_resched();
 		}
 	}
 
@@ -653,7 +652,7 @@ static int __hvc_poll(struct hvc_struct *hp, bool may_sleep)
 
 	if (may_sleep) {
 		spin_unlock_irqrestore(&hp->lock, flags);
-		cond_resched();
+
 		spin_lock_irqsave(&hp->lock, flags);
 	}
 
@@ -725,7 +724,7 @@ static int __hvc_poll(struct hvc_struct *hp, bool may_sleep)
 	if (may_sleep) {
 		/* Keep going until the flip is full */
 		spin_unlock_irqrestore(&hp->lock, flags);
-		cond_resched();
+
 		spin_lock_irqsave(&hp->lock, flags);
 		goto read_again;
 	} else if (read_total < HVC_ATOMIC_READ_MAX) {
@@ -802,7 +801,6 @@ static int khvcd(void *unused)
 			mutex_lock(&hvc_structs_mutex);
 			list_for_each_entry(hp, &hvc_structs, next) {
 				poll_mask |= __hvc_poll(hp, true);
-				cond_resched();
 			}
 			mutex_unlock(&hvc_structs_mutex);
 		} else
diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
index 5f6d0cf67571..c70d695ed69d 100644
--- a/drivers/tty/tty_buffer.c
+++ b/drivers/tty/tty_buffer.c
@@ -498,9 +498,6 @@ static void flush_to_ldisc(struct work_struct *work)
 			lookahead_bufs(port, head);
 		if (!rcvd)
 			break;
-
-		if (need_resched())
-			cond_resched();
 	}
 
 	mutex_unlock(&buf->lock);
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 8a94e5a43c6d..0221ff17a4bf 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1032,7 +1032,6 @@ static ssize_t iterate_tty_write(struct tty_ldisc *ld, struct tty_struct *tty,
 		ret = -ERESTARTSYS;
 		if (signal_pending(current))
 			break;
-		cond_resched();
 	}
 	if (written) {
 		tty_update_time(tty, true);
diff --git a/drivers/usb/gadget/udc/max3420_udc.c b/drivers/usb/gadget/udc/max3420_udc.c
index 2d57786d3db7..b9051c341b10 100644
--- a/drivers/usb/gadget/udc/max3420_udc.c
+++ b/drivers/usb/gadget/udc/max3420_udc.c
@@ -451,7 +451,6 @@ static void __max3420_start(struct max3420_udc *udc)
 		val = spi_rd8(udc, MAX3420_REG_USBIRQ);
 		if (val & OSCOKIRQ)
 			break;
-		cond_resched();
 	}
 
 	/* Enable PULL-UP only when Vbus detected */
diff --git a/drivers/usb/host/max3421-hcd.c b/drivers/usb/host/max3421-hcd.c
index d152d72de126..64f12f5113a2 100644
--- a/drivers/usb/host/max3421-hcd.c
+++ b/drivers/usb/host/max3421-hcd.c
@@ -1294,7 +1294,7 @@ max3421_reset_hcd(struct usb_hcd *hcd)
 				"timed out waiting for oscillator OK signal");
 			return 1;
 		}
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/*
diff --git a/drivers/usb/host/xen-hcd.c b/drivers/usb/host/xen-hcd.c
index 46fdab940092..0b78f371c30a 100644
--- a/drivers/usb/host/xen-hcd.c
+++ b/drivers/usb/host/xen-hcd.c
@@ -1086,7 +1086,7 @@ static irqreturn_t xenhcd_int(int irq, void *dev_id)
 	while (xenhcd_urb_request_done(info, &eoiflag) |
 	       xenhcd_conn_notify(info, &eoiflag))
 		/* Yield point for this unbounded loop. */
-		cond_resched();
+		cond_resched_stall();
 
 	xen_irq_lateeoi(irq, eoiflag);
 	return IRQ_HANDLED;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a94ec6225d31..523c6685818d 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -457,8 +457,6 @@ static int tce_iommu_clear(struct tce_container *container,
 			}
 		}
 
-		cond_resched();
-
 		direction = DMA_NONE;
 		oldhpa = 0;
 		ret = iommu_tce_xchg_no_kill(container->mm, tbl, entry, &oldhpa,
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index eacd6ec04de5..afc9724051ce 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -962,8 +962,6 @@ static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
 		kfree(entry);
 	}
 
-	cond_resched();
-
 	return unlocked;
 }
 
@@ -1029,7 +1027,6 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
 						     unmapped >> PAGE_SHIFT,
 						     false);
 		*iova += unmapped;
-		cond_resched();
 	}
 	return unmapped;
 }
@@ -1062,7 +1059,6 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 
 	list_for_each_entry_continue(d, &iommu->domain_list, next) {
 		iommu_unmap(d->domain, dma->iova, dma->size);
-		cond_resched();
 	}
 
 	iommu_iotlb_gather_init(&iotlb_gather);
@@ -1439,8 +1435,6 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 				GFP_KERNEL);
 		if (ret)
 			goto unwind;
-
-		cond_resched();
 	}
 
 	return 0;
@@ -1448,7 +1442,6 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 unwind:
 	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next) {
 		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
-		cond_resched();
 	}
 
 	return ret;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..8939be49c47d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -410,7 +410,6 @@ static bool vhost_worker(void *data)
 			kcov_remote_start_common(worker->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
-			cond_resched();
 		}
 	}
 
diff --git a/drivers/video/console/vgacon.c b/drivers/video/console/vgacon.c
index 7ad047bcae17..e17e7937e11d 100644
--- a/drivers/video/console/vgacon.c
+++ b/drivers/video/console/vgacon.c
@@ -870,12 +870,10 @@ static int vgacon_do_font_op(struct vgastate *state, char *arg, int set,
 		if (set)
 			for (i = 0; i < cmapsz; i++) {
 				vga_writeb(arg[i], charmap + i);
-				cond_resched();
 			}
 		else
 			for (i = 0; i < cmapsz; i++) {
 				arg[i] = vga_readb(charmap + i);
-				cond_resched();
 			}
 
 		/*
@@ -889,12 +887,10 @@ static int vgacon_do_font_op(struct vgastate *state, char *arg, int set,
 			if (set)
 				for (i = 0; i < cmapsz; i++) {
 					vga_writeb(arg[i], charmap + i);
-					cond_resched();
 				}
 			else
 				for (i = 0; i < cmapsz; i++) {
 					arg[i] = vga_readb(charmap + i);
-					cond_resched();
 				}
 		}
 	}
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index fa5226c198cc..c9c66aac49ca 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1754,7 +1754,6 @@ static int virtio_mem_sbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 			rc = virtio_mem_sbm_plug_any_sb(vm, mb_id, &nb_sb);
 			if (rc || !nb_sb)
 				goto out_unlock;
-			cond_resched();
 		}
 	}
 
@@ -1772,7 +1771,6 @@ static int virtio_mem_sbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 		rc = virtio_mem_sbm_plug_and_add_mb(vm, mb_id, &nb_sb);
 		if (rc || !nb_sb)
 			return rc;
-		cond_resched();
 	}
 
 	/* Try to prepare, plug and add new blocks */
@@ -1786,7 +1784,6 @@ static int virtio_mem_sbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 		rc = virtio_mem_sbm_plug_and_add_mb(vm, mb_id, &nb_sb);
 		if (rc)
 			return rc;
-		cond_resched();
 	}
 
 	return 0;
@@ -1869,7 +1866,6 @@ static int virtio_mem_bbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 			nb_bb--;
 		if (rc || !nb_bb)
 			return rc;
-		cond_resched();
 	}
 
 	/* Try to prepare, plug and add new big blocks */
@@ -1885,7 +1881,6 @@ static int virtio_mem_bbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 			nb_bb--;
 		if (rc)
 			return rc;
-		cond_resched();
 	}
 
 	return 0;
@@ -2107,7 +2102,6 @@ static int virtio_mem_sbm_unplug_request(struct virtio_mem *vm, uint64_t diff)
 			if (rc || !nb_sb)
 				goto out_unlock;
 			mutex_unlock(&vm->hotplug_mutex);
-			cond_resched();
 			mutex_lock(&vm->hotplug_mutex);
 		}
 		if (!unplug_online && i == 1) {
@@ -2250,8 +2244,6 @@ static int virtio_mem_bbm_unplug_request(struct virtio_mem *vm, uint64_t diff)
 	 */
 	for (i = 0; i < 3; i++) {
 		virtio_mem_bbm_for_each_bb_rev(vm, bb_id, VIRTIO_MEM_BBM_BB_ADDED) {
-			cond_resched();
-
 			/*
 			 * As we're holding no locks, these checks are racy,
 			 * but we don't care.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* [RFC PATCH 86/86] sched: remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (27 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 85/86] treewide: drivers: " Ankur Arora
@ 2023-11-07 23:08   ` Ankur Arora
  2023-11-07 23:19   ` [RFC PATCH 57/86] coccinelle: script to " Julia Lawall
  2023-11-21  0:45   ` Paul E. McKenney
  30 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

Now that we don't have any users of cond_resched() in the tree,
we can finally remove it.

Cc: Ingo Molnar <mingo@redhat.com> 
Cc: Peter Zijlstra <peterz@infradead.org> 
Cc: Juri Lelli <juri.lelli@redhat.com> 
Cc: Vincent Guittot <vincent.guittot@linaro.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched.h | 16 ++++------------
 kernel/sched/core.c   | 13 -------------
 2 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bae6eed534dd..bbb981c1a142 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2083,19 +2083,11 @@ static inline bool test_tsk_need_resched_any(struct task_struct *tsk)
 }
 
 /*
- * cond_resched() and cond_resched_lock(): latency reduction via
- * explicit rescheduling in places that are safe. The return
- * value indicates whether a reschedule was done in fact.
- * cond_resched_lock() will drop the spinlock before scheduling,
+ * cond_resched_lock(): latency reduction via explicit rescheduling
+ * in places that are safe. The return value indicates whether a
+ * reschedule was done in fact.  cond_resched_lock() will drop the
+ * spinlock before scheduling.
  */
-#ifdef CONFIG_PREEMPTION
-static inline int _cond_resched(void) { return 0; }
-#endif
-
-#define cond_resched() ({			\
-	__might_resched(__FILE__, __LINE__, 0);	\
-	_cond_resched();			\
-})
 
 extern int __cond_resched_lock(spinlock_t *lock);
 extern int __cond_resched_rwlock_read(rwlock_t *lock);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 691b50791e04..6940893e3930 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8580,19 +8580,6 @@ SYSCALL_DEFINE0(sched_yield)
 	return 0;
 }
 
-#ifndef CONFIG_PREEMPTION
-int __sched _cond_resched(void)
-{
-	if (should_resched(0)) {
-		preempt_schedule_common();
-		return 1;
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL(_cond_resched);
-#endif
-
 /*
  * __cond_resched_lock() - if a reschedule is pending, drop the given lock
  * (implicitly calling schedule), and reacquire the lock.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout"
  2023-11-07 21:56 ` [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout" Ankur Arora
@ 2023-11-07 23:10   ` Steven Rostedt
  2023-11-07 23:23     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:10 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue,  7 Nov 2023 13:56:49 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> This reverts commit 089c02ae2771a14af2928c59c56abfb9b885a8d7.

I rather not revert this.

If user space can decided between various version of preemption, then the
trace should reflect that. At least state what the preemption model was when
a trace started, or currently is.

That is, the model may not be "static" per boot. Anyway, the real change here should be:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7b4b1fcd6f93..2553c4efca15 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2208,14 +2208,6 @@ static inline void cond_resched_rcu(void)
 #endif
 }
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-extern bool preempt_model_none(void);
-extern bool preempt_model_voluntary(void);
-extern bool preempt_model_full(void);
-
-#else
-
 static inline bool preempt_model_none(void)
 {
 	return IS_ENABLED(CONFIG_PREEMPT_NONE);
@@ -2229,8 +2221,6 @@ static inline bool preempt_model_full(void)
 	return IS_ENABLED(CONFIG_PREEMPT);
 }
 
-#endif
-
 static inline bool preempt_model_rt(void)
 {
 	return IS_ENABLED(CONFIG_PREEMPT_RT);


Then this way we can decided to make it runtime dynamic, we don't need to
fiddle with the tracing code again.

-- Steve

^ permalink raw reply related	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 04/86] Revert "preempt/dynamic: Introduce preemption model accessors"
  2023-11-07 21:56 ` [RFC PATCH 04/86] Revert "preempt/dynamic: Introduce preemption model accessors" Ankur Arora
@ 2023-11-07 23:12   ` Steven Rostedt
  2023-11-08  4:59     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:12 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue,  7 Nov 2023 13:56:50 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

I know this is an RFC but I'll state it here just so that it is stated. All
reverts need a change log description to why a revert happened, even if you
are just cut and pasting the reason for every commit. That's because git
commits need to be stand alone and not depend on information in other git
commit change logs.

-- Steve


> This reverts commit cfe43f478b79ba45573ca22d52d0d8823be068fa.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/sched.h | 41 -----------------------------------------
>  kernel/sched/core.c   | 12 ------------
>  2 files changed, 53 deletions(-)
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-07 21:56 ` [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()" Ankur Arora
@ 2023-11-07 23:16   ` Steven Rostedt
  2023-11-08  4:55     ` Ankur Arora
  2023-11-09 17:26     ` Josh Poimboeuf
  0 siblings, 2 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:16 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Josh Poimboeuf, Jiri Kosina, Miroslav Benes,
	Petr Mladek, Joe Lawrence, live-patching

On Tue,  7 Nov 2023 13:56:53 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.
> 
> Note that removing this commit reintroduces "live patches failing to
> complete within a reasonable amount of time due to CPU-bound kthreads."
> 
> Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
> existence of cond_resched() so this will need an alternate fix.
> 

Then it would probably be a good idea to Cc the live patching maintainers!

-- Steve

> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/livepatch.h       |   1 -
>  include/linux/livepatch_sched.h |  29 ---------
>  include/linux/sched.h           |  20 ++----
>  kernel/livepatch/core.c         |   1 -
>  kernel/livepatch/transition.c   | 107 +++++---------------------------
>  kernel/sched/core.c             |  64 +++----------------
>  6 files changed, 28 insertions(+), 194 deletions(-)
>  delete mode 100644 include/linux/livepatch_sched.h
> 
> diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> index 9b9b38e89563..293e29960c6e 100644
> --- a/include/linux/livepatch.h
> +++ b/include/linux/livepatch.h
> @@ -13,7 +13,6 @@
>  #include <linux/ftrace.h>
>  #include <linux/completion.h>
>  #include <linux/list.h>
> -#include <linux/livepatch_sched.h>
>  
>  #if IS_ENABLED(CONFIG_LIVEPATCH)
>  
> diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
> deleted file mode 100644
> index 013794fb5da0..000000000000
> --- a/include/linux/livepatch_sched.h
> +++ /dev/null
> @@ -1,29 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-or-later */
> -#ifndef _LINUX_LIVEPATCH_SCHED_H_
> -#define _LINUX_LIVEPATCH_SCHED_H_
> -
> -#include <linux/jump_label.h>
> -#include <linux/static_call_types.h>
> -
> -#ifdef CONFIG_LIVEPATCH
> -
> -void __klp_sched_try_switch(void);
> -
> -#if !defined(CONFIG_PREEMPT_DYNAMIC) || !defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> -
> -DECLARE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
> -
> -static __always_inline void klp_sched_try_switch(void)
> -{
> -	if (static_branch_unlikely(&klp_sched_try_switch_key))
> -		__klp_sched_try_switch();
> -}
> -
> -#endif /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
> -
> -#else /* !CONFIG_LIVEPATCH */
> -static inline void klp_sched_try_switch(void) {}
> -static inline void __klp_sched_try_switch(void) {}
> -#endif /* CONFIG_LIVEPATCH */
> -
> -#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5bdf80136e42..c5b0ef1ecfe4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -36,7 +36,6 @@
>  #include <linux/seqlock.h>
>  #include <linux/kcsan.h>
>  #include <linux/rv.h>
> -#include <linux/livepatch_sched.h>
>  #include <asm/kmap_size.h>
>  
>  /* task_struct member predeclarations (sorted alphabetically): */
> @@ -2087,9 +2086,6 @@ extern int __cond_resched(void);
>  
>  #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>  
> -void sched_dynamic_klp_enable(void);
> -void sched_dynamic_klp_disable(void);
> -
>  DECLARE_STATIC_CALL(cond_resched, __cond_resched);
>  
>  static __always_inline int _cond_resched(void)
> @@ -2098,7 +2094,6 @@ static __always_inline int _cond_resched(void)
>  }
>  
>  #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> -
>  extern int dynamic_cond_resched(void);
>  
>  static __always_inline int _cond_resched(void)
> @@ -2106,25 +2101,20 @@ static __always_inline int _cond_resched(void)
>  	return dynamic_cond_resched();
>  }
>  
> -#else /* !CONFIG_PREEMPTION */
> +#else
>  
>  static inline int _cond_resched(void)
>  {
> -	klp_sched_try_switch();
>  	return __cond_resched();
>  }
>  
> -#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
> +#endif /* CONFIG_PREEMPT_DYNAMIC */
>  
> -#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
> +#else
>  
> -static inline int _cond_resched(void)
> -{
> -	klp_sched_try_switch();
> -	return 0;
> -}
> +static inline int _cond_resched(void) { return 0; }
>  
> -#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
> +#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
>  
>  #define cond_resched() ({			\
>  	__might_resched(__FILE__, __LINE__, 0);	\
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 61328328c474..fc851455740c 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -33,7 +33,6 @@
>   *
>   * - klp_ftrace_handler()
>   * - klp_update_patch_state()
> - * - __klp_sched_try_switch()
>   */
>  DEFINE_MUTEX(klp_mutex);
>  
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> index e54c3d60a904..70bc38f27af7 100644
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -9,7 +9,6 @@
>  
>  #include <linux/cpu.h>
>  #include <linux/stacktrace.h>
> -#include <linux/static_call.h>
>  #include "core.h"
>  #include "patch.h"
>  #include "transition.h"
> @@ -27,25 +26,6 @@ static int klp_target_state = KLP_UNDEFINED;
>  
>  static unsigned int klp_signals_cnt;
>  
> -/*
> - * When a livepatch is in progress, enable klp stack checking in
> - * cond_resched().  This helps CPU-bound kthreads get patched.
> - */
> -#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> -
> -#define klp_cond_resched_enable() sched_dynamic_klp_enable()
> -#define klp_cond_resched_disable() sched_dynamic_klp_disable()
> -
> -#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
> -
> -DEFINE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
> -EXPORT_SYMBOL(klp_sched_try_switch_key);
> -
> -#define klp_cond_resched_enable() static_branch_enable(&klp_sched_try_switch_key)
> -#define klp_cond_resched_disable() static_branch_disable(&klp_sched_try_switch_key)
> -
> -#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
> -
>  /*
>   * This work can be performed periodically to finish patching or unpatching any
>   * "straggler" tasks which failed to transition in the first attempt.
> @@ -194,8 +174,8 @@ void klp_update_patch_state(struct task_struct *task)
>  	 * barrier (smp_rmb) for two cases:
>  	 *
>  	 * 1) Enforce the order of the TIF_PATCH_PENDING read and the
> -	 *    klp_target_state read.  The corresponding write barriers are in
> -	 *    klp_init_transition() and klp_reverse_transition().
> +	 *    klp_target_state read.  The corresponding write barrier is in
> +	 *    klp_init_transition().
>  	 *
>  	 * 2) Enforce the order of the TIF_PATCH_PENDING read and a future read
>  	 *    of func->transition, if klp_ftrace_handler() is called later on
> @@ -363,44 +343,6 @@ static bool klp_try_switch_task(struct task_struct *task)
>  	return !ret;
>  }
>  
> -void __klp_sched_try_switch(void)
> -{
> -	if (likely(!klp_patch_pending(current)))
> -		return;
> -
> -	/*
> -	 * This function is called from cond_resched() which is called in many
> -	 * places throughout the kernel.  Using the klp_mutex here might
> -	 * deadlock.
> -	 *
> -	 * Instead, disable preemption to prevent racing with other callers of
> -	 * klp_try_switch_task().  Thanks to task_call_func() they won't be
> -	 * able to switch this task while it's running.
> -	 */
> -	preempt_disable();
> -
> -	/*
> -	 * Make sure current didn't get patched between the above check and
> -	 * preempt_disable().
> -	 */
> -	if (unlikely(!klp_patch_pending(current)))
> -		goto out;
> -
> -	/*
> -	 * Enforce the order of the TIF_PATCH_PENDING read above and the
> -	 * klp_target_state read in klp_try_switch_task().  The corresponding
> -	 * write barriers are in klp_init_transition() and
> -	 * klp_reverse_transition().
> -	 */
> -	smp_rmb();
> -
> -	klp_try_switch_task(current);
> -
> -out:
> -	preempt_enable();
> -}
> -EXPORT_SYMBOL(__klp_sched_try_switch);
> -
>  /*
>   * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
>   * Kthreads with TIF_PATCH_PENDING set are woken up.
> @@ -507,8 +449,7 @@ void klp_try_complete_transition(void)
>  		return;
>  	}
>  
> -	/* Done!  Now cleanup the data structures. */
> -	klp_cond_resched_disable();
> +	/* we're done, now cleanup the data structures */
>  	patch = klp_transition_patch;
>  	klp_complete_transition();
>  
> @@ -560,8 +501,6 @@ void klp_start_transition(void)
>  			set_tsk_thread_flag(task, TIF_PATCH_PENDING);
>  	}
>  
> -	klp_cond_resched_enable();
> -
>  	klp_signals_cnt = 0;
>  }
>  
> @@ -617,9 +556,8 @@ void klp_init_transition(struct klp_patch *patch, int state)
>  	 * see a func in transition with a task->patch_state of KLP_UNDEFINED.
>  	 *
>  	 * Also enforce the order of the klp_target_state write and future
> -	 * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() and
> -	 * __klp_sched_try_switch() don't set a task->patch_state to
> -	 * KLP_UNDEFINED.
> +	 * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() doesn't
> +	 * set a task->patch_state to KLP_UNDEFINED.
>  	 */
>  	smp_wmb();
>  
> @@ -655,10 +593,14 @@ void klp_reverse_transition(void)
>  		 klp_target_state == KLP_PATCHED ? "patching to unpatching" :
>  						   "unpatching to patching");
>  
> +	klp_transition_patch->enabled = !klp_transition_patch->enabled;
> +
> +	klp_target_state = !klp_target_state;
> +
>  	/*
>  	 * Clear all TIF_PATCH_PENDING flags to prevent races caused by
> -	 * klp_update_patch_state() or __klp_sched_try_switch() running in
> -	 * parallel with the reverse transition.
> +	 * klp_update_patch_state() running in parallel with
> +	 * klp_start_transition().
>  	 */
>  	read_lock(&tasklist_lock);
>  	for_each_process_thread(g, task)
> @@ -668,28 +610,9 @@ void klp_reverse_transition(void)
>  	for_each_possible_cpu(cpu)
>  		clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
>  
> -	/*
> -	 * Make sure all existing invocations of klp_update_patch_state() and
> -	 * __klp_sched_try_switch() see the cleared TIF_PATCH_PENDING before
> -	 * starting the reverse transition.
> -	 */
> +	/* Let any remaining calls to klp_update_patch_state() complete */
>  	klp_synchronize_transition();
>  
> -	/*
> -	 * All patching has stopped, now re-initialize the global variables to
> -	 * prepare for the reverse transition.
> -	 */
> -	klp_transition_patch->enabled = !klp_transition_patch->enabled;
> -	klp_target_state = !klp_target_state;
> -
> -	/*
> -	 * Enforce the order of the klp_target_state write and the
> -	 * TIF_PATCH_PENDING writes in klp_start_transition() to ensure
> -	 * klp_update_patch_state() and __klp_sched_try_switch() don't set
> -	 * task->patch_state to the wrong value.
> -	 */
> -	smp_wmb();
> -
>  	klp_start_transition();
>  }
>  
> @@ -703,9 +626,9 @@ void klp_copy_process(struct task_struct *child)
>  	 * the task flag up to date with the parent here.
>  	 *
>  	 * The operation is serialized against all klp_*_transition()
> -	 * operations by the tasklist_lock. The only exceptions are
> -	 * klp_update_patch_state(current) and __klp_sched_try_switch(), but we
> -	 * cannot race with them because we are current.
> +	 * operations by the tasklist_lock. The only exception is
> +	 * klp_update_patch_state(current), but we cannot race with
> +	 * that because we are current.
>  	 */
>  	if (test_tsk_thread_flag(current, TIF_PATCH_PENDING))
>  		set_tsk_thread_flag(child, TIF_PATCH_PENDING);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0e8764d63041..b43fda3c5733 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8597,7 +8597,6 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
>  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
>  int __sched dynamic_cond_resched(void)
>  {
> -	klp_sched_try_switch();
>  	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
>  		return 0;
>  	return __cond_resched();
> @@ -8746,17 +8745,13 @@ int sched_dynamic_mode(const char *str)
>  #error "Unsupported PREEMPT_DYNAMIC mechanism"
>  #endif
>  
> -DEFINE_MUTEX(sched_dynamic_mutex);
> -static bool klp_override;
> -
> -static void __sched_dynamic_update(int mode)
> +void sched_dynamic_update(int mode)
>  {
>  	/*
>  	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
>  	 * the ZERO state, which is invalid.
>  	 */
> -	if (!klp_override)
> -		preempt_dynamic_enable(cond_resched);
> +	preempt_dynamic_enable(cond_resched);
>  	preempt_dynamic_enable(might_resched);
>  	preempt_dynamic_enable(preempt_schedule);
>  	preempt_dynamic_enable(preempt_schedule_notrace);
> @@ -8764,79 +8759,36 @@ static void __sched_dynamic_update(int mode)
>  
>  	switch (mode) {
>  	case preempt_dynamic_none:
> -		if (!klp_override)
> -			preempt_dynamic_enable(cond_resched);
> +		preempt_dynamic_enable(cond_resched);
>  		preempt_dynamic_disable(might_resched);
>  		preempt_dynamic_disable(preempt_schedule);
>  		preempt_dynamic_disable(preempt_schedule_notrace);
>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
> -		if (mode != preempt_dynamic_mode)
> -			pr_info("Dynamic Preempt: none\n");
> +		pr_info("Dynamic Preempt: none\n");
>  		break;
>  
>  	case preempt_dynamic_voluntary:
> -		if (!klp_override)
> -			preempt_dynamic_enable(cond_resched);
> +		preempt_dynamic_enable(cond_resched);
>  		preempt_dynamic_enable(might_resched);
>  		preempt_dynamic_disable(preempt_schedule);
>  		preempt_dynamic_disable(preempt_schedule_notrace);
>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
> -		if (mode != preempt_dynamic_mode)
> -			pr_info("Dynamic Preempt: voluntary\n");
> +		pr_info("Dynamic Preempt: voluntary\n");
>  		break;
>  
>  	case preempt_dynamic_full:
> -		if (!klp_override)
> -			preempt_dynamic_disable(cond_resched);
> +		preempt_dynamic_disable(cond_resched);
>  		preempt_dynamic_disable(might_resched);
>  		preempt_dynamic_enable(preempt_schedule);
>  		preempt_dynamic_enable(preempt_schedule_notrace);
>  		preempt_dynamic_enable(irqentry_exit_cond_resched);
> -		if (mode != preempt_dynamic_mode)
> -			pr_info("Dynamic Preempt: full\n");
> +		pr_info("Dynamic Preempt: full\n");
>  		break;
>  	}
>  
>  	preempt_dynamic_mode = mode;
>  }
>  
> -void sched_dynamic_update(int mode)
> -{
> -	mutex_lock(&sched_dynamic_mutex);
> -	__sched_dynamic_update(mode);
> -	mutex_unlock(&sched_dynamic_mutex);
> -}
> -
> -#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
> -
> -static int klp_cond_resched(void)
> -{
> -	__klp_sched_try_switch();
> -	return __cond_resched();
> -}
> -
> -void sched_dynamic_klp_enable(void)
> -{
> -	mutex_lock(&sched_dynamic_mutex);
> -
> -	klp_override = true;
> -	static_call_update(cond_resched, klp_cond_resched);
> -
> -	mutex_unlock(&sched_dynamic_mutex);
> -}
> -
> -void sched_dynamic_klp_disable(void)
> -{
> -	mutex_lock(&sched_dynamic_mutex);
> -
> -	klp_override = false;
> -	__sched_dynamic_update(preempt_dynamic_mode);
> -
> -	mutex_unlock(&sched_dynamic_mutex);
> -}
> -
> -#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
> -
>  static int __init setup_preempt_mode(char *str)
>  {
>  	int mode = sched_dynamic_mode(str);


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 08/86] Revert "arm64: Support PREEMPT_DYNAMIC"
  2023-11-07 21:56 ` [RFC PATCH 08/86] Revert "arm64: Support PREEMPT_DYNAMIC" Ankur Arora
@ 2023-11-07 23:17   ` Steven Rostedt
  2023-11-08 15:44   ` Mark Rutland
  1 sibling, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:17 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue,  7 Nov 2023 13:56:54 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> This reverts commit 1b2d3451ee50a0968cb9933f726e50b368ba5073.
> 

I just realized that the maintainers of these patches are not being Cc'd.
If you want comments, you may want to Cc them. (I didn't do that for this
patch).

-- Steve


> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/arm64/Kconfig               |  1 -
>  arch/arm64/include/asm/preempt.h | 19 ++-----------------
>  arch/arm64/kernel/entry-common.c | 10 +---------
>  3 files changed, 3 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 78f20e632712..856d7be2ee45 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -221,7 +221,6 @@ config ARM64
>  	select HAVE_PERF_EVENTS_NMI if ARM64_PSEUDO_NMI
>  	select HAVE_PERF_REGS
>  	select HAVE_PERF_USER_STACK_DUMP
> -	select HAVE_PREEMPT_DYNAMIC_KEY
>  	select HAVE_REGS_AND_STACK_ACCESS_API
>  	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
>  	select HAVE_FUNCTION_ARG_ACCESS_API
> diff --git a/arch/arm64/include/asm/preempt.h b/arch/arm64/include/asm/preempt.h
> index 0159b625cc7f..e83f0982b99c 100644
> --- a/arch/arm64/include/asm/preempt.h
> +++ b/arch/arm64/include/asm/preempt.h
> @@ -2,7 +2,6 @@
>  #ifndef __ASM_PREEMPT_H
>  #define __ASM_PREEMPT_H
>  
> -#include <linux/jump_label.h>
>  #include <linux/thread_info.h>
>  
>  #define PREEMPT_NEED_RESCHED	BIT(32)
> @@ -81,24 +80,10 @@ static inline bool should_resched(int preempt_offset)
>  }
>  
>  #ifdef CONFIG_PREEMPTION
> -
>  void preempt_schedule(void);
> +#define __preempt_schedule() preempt_schedule()
>  void preempt_schedule_notrace(void);
> -
> -#ifdef CONFIG_PREEMPT_DYNAMIC
> -
> -DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
> -void dynamic_preempt_schedule(void);
> -#define __preempt_schedule()		dynamic_preempt_schedule()
> -void dynamic_preempt_schedule_notrace(void);
> -#define __preempt_schedule_notrace()	dynamic_preempt_schedule_notrace()
> -
> -#else /* CONFIG_PREEMPT_DYNAMIC */
> -
> -#define __preempt_schedule()		preempt_schedule()
> -#define __preempt_schedule_notrace()	preempt_schedule_notrace()
> -
> -#endif /* CONFIG_PREEMPT_DYNAMIC */
> +#define __preempt_schedule_notrace() preempt_schedule_notrace()
>  #endif /* CONFIG_PREEMPTION */
>  
>  #endif /* __ASM_PREEMPT_H */
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 0fc94207e69a..5d9c9951562b 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -225,17 +225,9 @@ static void noinstr arm64_exit_el1_dbg(struct pt_regs *regs)
>  		lockdep_hardirqs_on(CALLER_ADDR0);
>  }
>  
> -#ifdef CONFIG_PREEMPT_DYNAMIC
> -DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
> -#define need_irq_preemption() \
> -	(static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
> -#else
> -#define need_irq_preemption()	(IS_ENABLED(CONFIG_PREEMPTION))
> -#endif
> -
>  static void __sched arm64_preempt_schedule_irq(void)
>  {
> -	if (!need_irq_preemption())
> +	if (!IS_ENABLED(CONFIG_PREEMPTION))
>  		return;
>  
>  	/*


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (28 preceding siblings ...)
  2023-11-07 23:08   ` [RFC PATCH 86/86] sched: " Ankur Arora
@ 2023-11-07 23:19   ` Julia Lawall
  2023-11-08  8:29     ` Ankur Arora
  2023-11-21  0:45   ` Paul E. McKenney
  30 siblings, 1 reply; 250+ messages in thread
From: Julia Lawall @ 2023-11-07 23:19 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Julia Lawall, Nicolas Palix



On Tue, 7 Nov 2023, Ankur Arora wrote:

> Rudimentary script to remove the straight-forward subset of
> cond_resched() and allies:
>
> 1)  if (need_resched())
> 	  cond_resched()
>
> 2)  expression*;
>     cond_resched();  /* or in the reverse order */
>
> 3)  if (expression)
> 	statement
>     cond_resched();  /* or in the reverse order */
>
> The last two patterns depend on the control flow level to ensure
> that the complex cond_resched() patterns (ex. conditioned ones)
> are left alone and we only pick up ones which are only minimally
> related the neighbouring code.
>
> Cc: Julia Lawall <Julia.Lawall@inria.fr>
> Cc: Nicolas Palix <nicolas.palix@imag.fr>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>  create mode 100644 scripts/coccinelle/api/cond_resched.cocci
>
> diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
> new file mode 100644
> index 000000000000..bf43768a8f8c
> --- /dev/null
> +++ b/scripts/coccinelle/api/cond_resched.cocci
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/// Remove naked cond_resched() statements
> +///
> +//# Remove cond_resched() statements when:
> +//#   - executing at the same control flow level as the previous or the
> +//#     next statement (this lets us avoid complicated conditionals in
> +//#     the neighbourhood.)
> +//#   - they are of the form "if (need_resched()) cond_resched()" which
> +//#     is always safe.
> +//#
> +//# Coccinelle generally takes care of comments in the immediate neighbourhood
> +//# but might need to handle other comments alluding to rescheduling.
> +//#
> +virtual patch
> +virtual context
> +
> +@ r1 @
> +identifier r;
> +@@
> +
> +(
> + r = cond_resched();
> +|
> +-if (need_resched())
> +-	cond_resched();
> +)

This rule doesn't make sense.  The first branch of the disjunction will
never match a a place where the second branch matches.  Anyway, in the
second branch there is no assignment, so I don't see what the first branch
is protecting against.

The disjunction is just useless.  Whether it is there or or whether only
the second brancha is there, doesn't have any impact on the result.

> +
> +@ r2 @
> +expression E;
> +statement S,T;
> +@@
> +(
> + E;
> +|
> + if (E) S

This case is not needed.  It will be matched by the next case.

> +|
> + if (E) S else T
> +|
> +)
> +-cond_resched();
> +
> +@ r3 @
> +expression E;
> +statement S,T;
> +@@
> +-cond_resched();
> +(
> + E;
> +|
> + if (E) S

As above.

> +|
> + if (E) S else T
> +)

I have the impression that you are trying to retain some cond_rescheds.
Could you send an example of one that you are trying to keep?  Overall,
the above rules seem a bit ad hoc.  You may be keeping some cases you
don't want to, or removing some cases that you want to keep.

Of course, if you are confident that the job is done with this semantic
patch as it is, then that's fine too.

julia

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 14/86] Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
  2023-11-07 21:57 ` [RFC PATCH 14/86] Revert "preempt/dynamic: Fix setup_preempt_mode() return value" Ankur Arora
@ 2023-11-07 23:20   ` Steven Rostedt
  0 siblings, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:20 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue,  7 Nov 2023 13:57:00 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> This reverts commit 9ed20bafc85806ca6c97c9128cec46c3ef80ae86.

Note, it's better to just do a big revert of related code than to have to
revert every individual commit.

You can do one big commit that states:

This reverts commits:

  ....

And list the commits.

That is, for commits that affect a single file, do not cherry-pick commits
to remove, just remove them all in one go.

-- Steve


> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/sched/core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f8bbddd729db..50e1133cacc9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7062,11 +7062,11 @@ static int __init setup_preempt_mode(char *str)
>  	int mode = sched_dynamic_mode(str);
>  	if (mode < 0) {
>  		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
> -		return 0;
> +		return 1;
>  	}
>  
>  	sched_dynamic_update(mode);
> -	return 1;
> +	return 0;
>  }
>  __setup("preempt=", setup_preempt_mode);
>  


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout"
  2023-11-07 23:10   ` Steven Rostedt
@ 2023-11-07 23:23     ` Ankur Arora
  2023-11-07 23:31       ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik


Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue,  7 Nov 2023 13:56:49 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> This reverts commit 089c02ae2771a14af2928c59c56abfb9b885a8d7.
>
> I rather not revert this.
>
> If user space can decided between various version of preemption, then the
> trace should reflect that. At least state what the preemption model was when
> a trace started, or currently is.
>
Oh absolutely. As I mention in the cover at least these three patches
would be back:

       089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout")
       cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors")
       5693fa74f98a ("kcsan: Use preemption model accessors")

The intent was (which I didn't do for the RFC), to do the reverts as cleanly
as possible, do the changes for the series and then bring these patches back
with appropriate modifications.

> That is, the model may not be "static" per boot. Anyway, the real change here should be:

Yeah, I intended to do something like that.

Or would you prefer these not be reverted (and reapplied) at all -- just fixed
as you describe here?

> Then this way we can decided to make it runtime dynamic, we don't need to
> fiddle with the tracing code again.

Yeah, that makes sense.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 31/86] x86/thread_info: add TIF_NEED_RESCHED_LAZY
  2023-11-07 21:57 ` [RFC PATCH 31/86] x86/thread_info: add TIF_NEED_RESCHED_LAZY Ankur Arora
@ 2023-11-07 23:26   ` Steven Rostedt
  0 siblings, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue,  7 Nov 2023 13:57:17 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Add a new flag, TIF_NEED_RESCHED_LAZY which with TIF_NEED_RESCHED
> gives the scheduler two levels of rescheduling priority:
> TIF_NEED_RESCHED means that rescheduling happens at the next
> opportunity; TIF_NEED_RESCHED_LAZY is used to note that a
> reschedule is needed but does not impose any other constraints
> on the scheduler.

Please add:

Link: https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/

For each of the patches that were based off of Thomas's patch.

Thanks!

-- Steve



> 
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/x86/include/asm/thread_info.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index d63b02940747..114d12120051 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -81,8 +81,9 @@ struct thread_info {
>  #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
>  #define TIF_SIGPENDING		2	/* signal pending */
>  #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
> -#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
> -#define TIF_SSBD		5	/* Speculative store bypass disable */
> +#define TIF_NEED_RESCHED_LAZY	4	/* Lazy rescheduling */
> +#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
> +#define TIF_SSBD		6	/* Speculative store bypass disable */
>  #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
>  #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
>  #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
> @@ -104,6 +105,7 @@ struct thread_info {
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
> +#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
>  #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
>  #define _TIF_SSBD		(1 << TIF_SSBD)
>  #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout"
  2023-11-07 23:23     ` Ankur Arora
@ 2023-11-07 23:31       ` Steven Rostedt
  2023-11-07 23:34         ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:31 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, 07 Nov 2023 15:23:05 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Or would you prefer these not be reverted (and reapplied) at all -- just fixed
> as you describe here?

Yes, exactly that.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout"
  2023-11-07 23:31       ` Steven Rostedt
@ 2023-11-07 23:34         ` Steven Rostedt
  2023-11-08  0:12           ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-07 23:34 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, 7 Nov 2023 18:31:54 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 07 Nov 2023 15:23:05 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> > Or would you prefer these not be reverted (and reapplied) at all -- just fixed
> > as you describe here?  
> 
> Yes, exactly that.
> 

Note, a revert usually means, "get rid of something because it's broken", it
shouldn't be used for "I'm implementing this differently, and need to
remove the old code first"

For the latter case, just remove what you don't need for the reason why
it's being removed. Reverting commits is confusing, because when you see a
revert in a git log, you think that commit was broken and needed to be taken
out.

-- Steve


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-07 23:01 ` [RFC PATCH 00/86] Make the kernel preemptible Steven Rostedt
@ 2023-11-07 23:43   ` Ankur Arora
  2023-11-08  0:00     ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-07 23:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik


Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue,  7 Nov 2023 13:56:46 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Hi,
>
> Hi Ankur,
>
> Thanks for doing this!
>
>>
>> We have two models of preemption: voluntary and full (and RT which is
>> a fuller form of full preemption.) In this series -- which is based
>> on Thomas' PoC (see [1]), we try to unify the two by letting the
>> scheduler enforce policy for the voluntary preemption models as well.
>
> I would say there's "NONE" which is really just a "voluntary" but with
> fewer preemption points ;-) But still should be mentioned, otherwise people
> may get confused.
>
>>
>> (Note that this is about preemption when executing in the kernel.
>> Userspace is always preemptible.)
>>
>
>
>> Design
>> ==
>>
>> As Thomas outlines in [1], to unify the preemption models we
>> want to: always have the preempt_count enabled and allow the scheduler
>> to drive preemption policy based on the model in effect.
>>
>> Policies:
>>
>> - preemption=none: run to completion
>> - preemption=voluntary: run to completion, unless a task of higher
>>   sched-class awaits
>> - preemption=full: optimized for low-latency. Preempt whenever a higher
>>   priority task awaits.
>>
>> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
>> scheduler to mark that a reschedule is needed, but is deferred until
>> the task finishes executing in the kernel -- voluntary preemption
>> as it were.
>>
>> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
>> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
>>
>>          ret-to-user    ret-to-kernel    preempt_count()
>> none           Y              N                N
>> voluntary      Y              Y                Y
>> full           Y              Y                Y
>
> Wait. The above is for when RESCHED_LAZY is to preempt, right?
>
> Then, shouldn't voluntary be:
>
>  voluntary      Y              N                N
>
> For LAZY, but
>
>  voluntary      Y              Y                Y
>
> For NEED_RESCHED (without lazy)

Yes. You are, of course, right. I was talking about the TIF_NEED_RESCHED flags
and in the middle switched to talking about how the voluntary model will
get to what it wants.

> That is, the only difference between voluntary and none (as you describe
> above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED,
> but on none, it still sets NEED_RESCHED_LAZY?

Yeah exactly. Just to restate without mucking it up:

The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.

                  ret-to-user    ret-to-kernel    preempt_count()
NEED_RESCHED_LAZY    Y              N                N
NEED_RESCHED         Y              Y                Y

Based on how various preemption models set the flag they would cause
preemption at:

                  ret-to-user    ret-to-kernel    preempt_count()
none                 Y              N                N
voluntary            Y              Y                Y
full                 Y              Y                Y

>>   The max-load numbers (not posted here) also behave similarly.
>
> It would be interesting to run any "latency sensitive" benchmarks.
>
> I wounder how cyclictest would work under each model with and without this
> patch?

Didn't post these numbers because I suspect that code isn't quite right,
but voluntary preemption for instance does what it promises:

# echo NO_FORCE_PREEMPT  > sched/features
# echo NO_PREEMPT_PRIORITY > sched/features    # preempt=none
# stress-ng --cyclic 1  --timeout 10
stress-ng: info:  [1214172] setting to a 10 second run per stressor
stress-ng: info:  [1214172] dispatching hogs: 1 cyclic
stress-ng: info:  [1214174] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info:  [1214174] cyclic:   mean: 9834.56 ns, mode: 3495 ns
stress-ng: info:  [1214174] cyclic:   min: 2413 ns, max: 3145065 ns, std.dev. 77096.98
stress-ng: info:  [1214174] cyclic: latency percentiles:
stress-ng: info:  [1214174] cyclic:   25.00%:       3366 ns
stress-ng: info:  [1214174] cyclic:   50.00%:       3505 ns
stress-ng: info:  [1214174] cyclic:   75.00%:       3776 ns
stress-ng: info:  [1214174] cyclic:   90.00%:       4316 ns
stress-ng: info:  [1214174] cyclic:   95.40%:      10989 ns
stress-ng: info:  [1214174] cyclic:   99.00%:      91181 ns
stress-ng: info:  [1214174] cyclic:   99.50%:     290477 ns
stress-ng: info:  [1214174] cyclic:   99.90%:    1360837 ns
stress-ng: info:  [1214174] cyclic:   99.99%:    3145065 ns
stress-ng: info:  [1214172] successful run completed in 10.00s

# echo PREEMPT_PRIORITY > features    # preempt=voluntary
# stress-ng --cyclic 1  --timeout 10
stress-ng: info:  [916483] setting to a 10 second run per stressor
stress-ng: info:  [916483] dispatching hogs: 1 cyclic
stress-ng: info:  [916484] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info:  [916484] cyclic:   mean: 3682.77 ns, mode: 3185 ns
stress-ng: info:  [916484] cyclic:   min: 2523 ns, max: 150082 ns, std.dev. 2198.07
stress-ng: info:  [916484] cyclic: latency percentiles:
stress-ng: info:  [916484] cyclic:   25.00%:       3185 ns
stress-ng: info:  [916484] cyclic:   50.00%:       3306 ns
stress-ng: info:  [916484] cyclic:   75.00%:       3666 ns
stress-ng: info:  [916484] cyclic:   90.00%:       4778 ns
stress-ng: info:  [916484] cyclic:   95.40%:       5359 ns
stress-ng: info:  [916484] cyclic:   99.00%:       6141 ns
stress-ng: info:  [916484] cyclic:   99.50%:       7824 ns
stress-ng: info:  [916484] cyclic:   99.90%:      29825 ns
stress-ng: info:  [916484] cyclic:   99.99%:     150082 ns
stress-ng: info:  [916483] successful run completed in 10.01s

This is with a background kernbench half-load.

Let me see if I can dig out the numbers without this series.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-07 23:43   ` Ankur Arora
@ 2023-11-08  0:00     ` Steven Rostedt
  0 siblings, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08  0:00 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, 07 Nov 2023 15:43:40 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> 
> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
> 
>                   ret-to-user    ret-to-kernel    preempt_count()
> NEED_RESCHED_LAZY    Y              N                N
> NEED_RESCHED         Y              Y                Y
> 
> Based on how various preemption models set the flag they would cause
> preemption at:

I would change the above to say "set the NEED_SCHED flag", as "set the
flag" is still ambiguous. Or am I still misunderstanding the below table?

> 
>                   ret-to-user    ret-to-kernel    preempt_count()
> none                 Y              N                N
> voluntary            Y              Y                Y
> full                 Y              Y                Y
> 

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 45/86] preempt: ARCH_NO_PREEMPT only preempts lazily
  2023-11-07 21:57 ` [RFC PATCH 45/86] preempt: ARCH_NO_PREEMPT only preempts lazily Ankur Arora
@ 2023-11-08  0:07   ` Steven Rostedt
  2023-11-08  8:47     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08  0:07 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue,  7 Nov 2023 13:57:31 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Note: this commit is badly broken. Only here for discussion.
> 
> Configurations with ARCH_NO_PREEMPT support preempt_count, but might
> not be tested well enough under PREEMPTION to support it might not
> be demarcating the necessary non-preemptible sections.
> 
> One way to handle this is by limiting them to PREEMPT_NONE mode, not
> doing any tick enforcement and limiting preemption to happen only at
> user boundary.
> 
> Unfortunately, this is only a partial solution because eager
> rescheduling could still happen (say, due to RCU wanting an
> expedited quiescent period.) And, because we do not trust the
> preempt_count accounting, this would mean preemption inside an
> unmarked critical section.

Is preempt_count accounting really not trust worthy?

That is, if we preempt at preempt_count() going to zero but nowhere else,
would that work? At least it would create some places that can be resched.

What's the broken part of these archs? The assembly? If that's the case, as
long as the generic code has the preempt_count() I would think that would
be trust worthy. I'm also guessing that in_irq() and friends are still
reliable.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout"
  2023-11-07 23:34         ` Steven Rostedt
@ 2023-11-08  0:12           ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  0:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik


Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue, 7 Nov 2023 18:31:54 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> On Tue, 07 Nov 2023 15:23:05 -0800
>> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> > Or would you prefer these not be reverted (and reapplied) at all -- just fixed
>> > as you describe here?
>>
>> Yes, exactly that.
>>
>
> Note, a revert usually means, "get rid of something because it's broken", it
> shouldn't be used for "I'm implementing this differently, and need to
> remove the old code first"
>
> For the latter case, just remove what you don't need for the reason why
> it's being removed. Reverting commits is confusing, because when you see a
> revert in a git log, you think that commit was broken and needed to be taken
> out.

Ack that. And, agree, it did feel pretty odd to revert so many good commits.
I guess in that sense it makes sense to minimize the number of reverts.

There are some that I suspect I will have to revert. Will detail specifically
why they are being reverted.

Thanks
--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 46/86] tracing: handle lazy resched
  2023-11-07 21:57 ` [RFC PATCH 46/86] tracing: handle lazy resched Ankur Arora
@ 2023-11-08  0:19   ` Steven Rostedt
  2023-11-08  9:24     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08  0:19 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	linux-alpha, Geert Uytterhoeven, linux-m68k, Dinh Nguyen

On Tue,  7 Nov 2023 13:57:32 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Tracing support.
> 
> Note: this is quite incomplete.

What's not complete? The removal of the IRQS_NOSUPPORT?

Really, that's only for alpha, m68k and nios2. I think setting 'X' is not
needed anymore, and we can use that bit for this, and for those archs, have
0 for interrupts disabled.

-- Steve


> 
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/trace_events.h |  6 +++---
>  kernel/trace/trace.c         |  2 ++
>  kernel/trace/trace_output.c  | 16 ++++++++++++++--
>  3 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 21ae37e49319..355d25d5e398 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -178,7 +178,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status);
>  
>  enum trace_flag_type {
>  	TRACE_FLAG_IRQS_OFF		= 0x01,
> -	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
> +	TRACE_FLAG_NEED_RESCHED_LAZY    = 0x02,
>  	TRACE_FLAG_NEED_RESCHED		= 0x04,
>  	TRACE_FLAG_HARDIRQ		= 0x08,
>  	TRACE_FLAG_SOFTIRQ		= 0x10,
> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_ctx(void)
>  
>  static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
>  {
> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> +	return tracing_gen_ctx_irq_test(0);
>  }
>  static inline unsigned int tracing_gen_ctx(void)
>  {
> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> +	return tracing_gen_ctx_irq_test(0);
>  }
>  #endif
>  
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 7f067ad9cf50..0776dba32c2d 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
>  
>  	if (tif_need_resched(RESCHED_eager))
>  		trace_flags |= TRACE_FLAG_NEED_RESCHED;
> +	if (tif_need_resched(RESCHED_lazy))
> +		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
>  	if (test_preempt_need_resched())
>  		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
>  	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index db575094c498..c251a44ad8ac 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
>  		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
>  		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
>  		bh_off ? 'b' :
> -		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
> +		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
>  		'.';
>  
> -	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
> +	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
>  				TRACE_FLAG_PREEMPT_RESCHED)) {
> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> +		need_resched = 'B';
> +		break;
>  	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
>  		need_resched = 'N';
>  		break;
> +	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> +		need_resched = 'L';
> +		break;
> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
> +		need_resched = 'b';
> +		break;
>  	case TRACE_FLAG_NEED_RESCHED:
>  		need_resched = 'n';
>  		break;
> +	case TRACE_FLAG_NEED_RESCHED_LAZY:
> +		need_resched = 'l';
> +		break;
>  	case TRACE_FLAG_PREEMPT_RESCHED:
>  		need_resched = 'p';
>  		break;


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-07 21:57 ` [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT Ankur Arora
@ 2023-11-08  0:27   ` Steven Rostedt
  2023-11-21  0:28     ` Paul E. McKenney
  2023-11-08 12:15   ` Julian Anastasov
  1 sibling, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08  0:27 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue,  7 Nov 2023 13:57:33 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> With PREEMPTION being always-on, some configurations might prefer
> the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> as compared to PREEMPT_RCU=y.
> 
> So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> 
> Note that the preemption model can be changed at runtime (modulo
> configurations with ARCH_NO_PREEMPT), but the RCU configuration
> is statically compiled.

I wonder if we should make this a separate patch, and allow PREEMPT_RCU=n
when PREEMPT=y?

This could allow us to test this without this having to be part of this
series.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()
  2023-11-07 23:08   ` [RFC PATCH 85/86] treewide: drivers: " Ankur Arora
@ 2023-11-08  0:48     ` Chris Packham
  2023-11-09  0:55       ` Ankur Arora
  2023-11-09 23:25     ` Dmitry Torokhov
  1 sibling, 1 reply; 250+ messages in thread
From: Chris Packham @ 2023-11-08  0:48 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Oded Gabbay,
	Miguel Ojeda, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Sudip Mukherjee, Theodore Ts'o, Jason A. Donenfeld,
	Amit Shah, Gonglei, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Sumit Semwal, Christian König, Andi Shyti, Ray Jui,
	Scott Branden, Shawn Guo, Sascha Hauer, Junxian Huang,
	Dmitry Torokhov, Will Deacon, Joerg Roedel,
	Mauro Carvalho Chehab, Srinivas Pandruvada, Hans de Goede,
	Ilpo Järvinen, Mark Gross, Finn Thain, Michael Schmitz,
	James E.J. Bottomley, Martin K. Petersen, Kashyap Desai,
	Sumit Saxena, Shivasharan S, Mark Brown, Neil Armstrong,
	Jens Wiklander, Alex Williamson, Helge Deller, David Hildenbrand


On 8/11/23 12:08, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
>
> 1.  Calls to cond_resched() out of the goodness of our heart,
>      otherwise known as avoiding lockup splats.
>
> 2.  Open coded variants of cond_resched_lock() which call
>      cond_resched().
>
> 3.  Retry or error handling loops, where cond_resched() is used as a
>      quick alternative to spinning in a tight-loop.
>
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
>
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
>
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
>
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
>
> The cond_resched() calls here are all kinds. Those from set-1
> or set-2 are quite straight-forward to handle.
>
> There are quite a few from set-3, where as noted above, we
> use cond_resched() as if it were a amulent. Which I supppose
> it is, in that it wards off softlockup or RCU splats.
>
> Those are now cond_resched_stall(), but in most cases, given
> that the timeouts are in milliseconds, they could be easily
> timed waits.

For i2c-mpc.c:

It looks as the code in question could probably be converted to 
readb_poll_timeout(). If I find sufficient round-tuits I might look at 
that. Regardless in the context of the tree-wide change ...

Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()
  2023-11-07 23:08   ` [RFC PATCH 68/86] treewide: mm: remove cond_resched() Ankur Arora
@ 2023-11-08  1:28     ` Sergey Senozhatsky
  2023-11-08  7:49       ` Vlastimil Babka
  0 siblings, 1 reply; 250+ messages in thread
From: Sergey Senozhatsky @ 2023-11-08  1:28 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, SeongJae Park, Mike Kravetz, Muchun Song,
	Andrey Ryabinin, Marco Elver, Catalin Marinas, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Naoya Horiguchi,
	Miaohe Lin, David Hildenbrand, Oscar Salvador, Mike Rapoport,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Hugh Dickins, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Vitaly Wool,
	Minchan Kim, Sergey Senozhatsky, Seth Jennings, Dan Streetman

On (23/11/07 15:08), Ankur Arora wrote:
[..]
> +++ b/mm/zsmalloc.c
> @@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  			dst_zspage = NULL;
>  
>  			spin_unlock(&pool->lock);
> -			cond_resched();
>  			spin_lock(&pool->lock);
>  		}
>  	}

I'd personally prefer to have a comment explaining why we do that
spin_unlock/spin_lock sequence, which may look confusing to people.

Maybe would make sense to put a nice comment in all similar cases.
For instance:

  	rcu_read_unlock();
 -	cond_resched();
  	rcu_read_lock();

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (57 preceding siblings ...)
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
@ 2023-11-08  4:08 ` Christoph Lameter
  2023-11-08  4:33   ` Ankur Arora
  2023-11-08  7:31 ` Juergen Gross
                   ` (3 subsequent siblings)
  62 siblings, 1 reply; 250+ messages in thread
From: Christoph Lameter @ 2023-11-08  4:08 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

The kernel is not preemptible???? What are you smoking?

On Tue, 7 Nov 2023, Ankur Arora wrote:

> In voluntary models, the scheduler's job is to match the demand
> side of preemption points (a task that needs to be scheduled) with
> the supply side (a task which calls cond_resched().)

Voluntary preemption models are important for code optimization because 
the code can rely on the scheduler not changing the cpu we are running on. 
This allows removing code for preempt_enable/disable to be removed from 
the code and allows better code generation. The best performing code is 
generated with defined preemption points when we have a guarantee that the 
code is not being rescheduled on a different processor. This is f.e. 
important for consistent access to PER CPU areas.

> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
> scheduler to mark that a reschedule is needed, but is deferred until
> the task finishes executing in the kernel -- voluntary preemption
> as it were.

That is different from the current no preemption model? Seems to be 
the same.

> There's just one remaining issue: now that explicit preemption points are
> gone, processes that spread a long time in the kernel have no way to give
> up the CPU.

These are needed to avoid adding preempt_enable/disable to a lot of 
primitives that are used for synchronization. You cannot remove those 
without changing a lot of synchronization primitives to always have to 
consider being preempted while operating.


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08  4:08 ` [RFC PATCH 00/86] Make the kernel preemptible Christoph Lameter
@ 2023-11-08  4:33   ` Ankur Arora
  2023-11-08  4:52     ` Christoph Lameter
  0 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  4:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Christoph Lameter <cl@linux.com> writes:

> The kernel is not preemptible???? What are you smoking?

The title admittedly is a little tongue in check but the point was that
a kernel under a voluntary preemption model isn't preemptible. That's
what this series attempts to do. Essentially enable PREEMPT_COUNT and
PREEMPTION for all preemption models.

PREEMPT_COUNT is always enabled with PREEMPT_DYNAMIC as well. There the
approach is to toggle which preemption points are used dynamically.
Here the idea is to not have statically placed preemption points and let
the scheduler decide when preemption is warranted.
And the only way to safely do that is by having PREEMPT_COUNT=y.

>> In voluntary models, the scheduler's job is to match the demand
>> side of preemption points (a task that needs to be scheduled) with
>> the supply side (a task which calls cond_resched().)
>
> Voluntary preemption models are important for code optimization because the code
> can rely on the scheduler not changing the cpu we are running on. This allows
> removing code for preempt_enable/disable to be removed from the code and allows
> better code generation. The best performing code is generated with defined
> preemption points when we have a guarantee that the code is not being
> rescheduled on a different processor. This is f.e. important for consistent
> access to PER CPU areas.

Right. This necessitates preempt_enable/preempt_disable() so you get
consistent access to the CPU.

This came up in an earlier discussion (See
https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
that preempt_enable/_disable() overhead was relatively minimal.

Is your point that always-on preempt_count is far too expensive?

>> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
>> scheduler to mark that a reschedule is needed, but is deferred until
>> the task finishes executing in the kernel -- voluntary preemption
>> as it were.
>
> That is different from the current no preemption model? Seems to be the same.
>> There's just one remaining issue: now that explicit preemption points are
>> gone, processes that spread a long time in the kernel have no way to give
>> up the CPU.
>
> These are needed to avoid adding preempt_enable/disable to a lot of primitives
> that are used for synchronization. You cannot remove those without changing a
> lot of synchronization primitives to always have to consider being preempted
> while operating.

I'm afraid I don't understand why you would need to change any
synchronization primitives. The code that does preempt_enable/_disable()
is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
CONFIG_PREEMPT_COUNT.

The intent here is to always have CONFIG_PREEMPT_COUNT=y.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08  4:33   ` Ankur Arora
@ 2023-11-08  4:52     ` Christoph Lameter
  2023-11-08  5:12       ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Christoph Lameter @ 2023-11-08  4:52 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

On Tue, 7 Nov 2023, Ankur Arora wrote:

> This came up in an earlier discussion (See
> https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
> that preempt_enable/_disable() overhead was relatively minimal.
>
> Is your point that always-on preempt_count is far too expensive?

Yes over the years distros have traditionally delivered their kernels by 
default without preemption because of these issues. If the overhead has 
been minimized then that may have changed. Even if so there is still a lot 
of code being generated that has questionable benefit and just 
bloats the kernel.

>> These are needed to avoid adding preempt_enable/disable to a lot of primitives
>> that are used for synchronization. You cannot remove those without changing a
>> lot of synchronization primitives to always have to consider being preempted
>> while operating.
>
> I'm afraid I don't understand why you would need to change any
> synchronization primitives. The code that does preempt_enable/_disable()
> is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
> CONFIG_PREEMPT_COUNT.

In the trivial cases it is simple like that. But look f.e.
in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a 
overhead added to be able to allow the cpu to change under us. There are 
likely other examples in the source.

And the whole business of local data 
access via per cpu areas suffers if we cannot rely on two accesses in a 
section being able to see consistent values.

> The intent here is to always have CONFIG_PREEMPT_COUNT=y.

Just for fun? Code is most efficient if it does not have to consider too 
many side conditions like suddenly running on a different processor. This 
introduces needless complexity into the code. It would be better to remove 
PREEMPT_COUNT for good to just rely on voluntary preemption. We could 
probably reduce the complexity of the kernel source significantly.

I have never noticed a need to preemption at every instruction in the 
kernel (if that would be possible at all... Locks etc prevent that ideal 
scenario frequently). Preemption like that is more like a pipe dream.

High performance kernel solution usually disable 
overhead like that.



^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-07 23:16   ` Steven Rostedt
@ 2023-11-08  4:55     ` Ankur Arora
  2023-11-09 17:26     ` Josh Poimboeuf
  1 sibling, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  4:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Josh Poimboeuf, Jiri Kosina, Miroslav Benes,
	Petr Mladek, Joe Lawrence, live-patching


Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue,  7 Nov 2023 13:56:53 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.
>>
>> Note that removing this commit reintroduces "live patches failing to
>> complete within a reasonable amount of time due to CPU-bound kthreads."
>>
>> Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
>> existence of cond_resched() so this will need an alternate fix.
>>
>
> Then it would probably be a good idea to Cc the live patching maintainers!

Indeed. Could have sworn that I had. But clearly not.

Apologies and thanks for adding them.

>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  include/linux/livepatch.h       |   1 -
>>  include/linux/livepatch_sched.h |  29 ---------
>>  include/linux/sched.h           |  20 ++----
>>  kernel/livepatch/core.c         |   1 -
>>  kernel/livepatch/transition.c   | 107 +++++---------------------------
>>  kernel/sched/core.c             |  64 +++----------------
>>  6 files changed, 28 insertions(+), 194 deletions(-)
>>  delete mode 100644 include/linux/livepatch_sched.h
>>
>> diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
>> index 9b9b38e89563..293e29960c6e 100644
>> --- a/include/linux/livepatch.h
>> +++ b/include/linux/livepatch.h
>> @@ -13,7 +13,6 @@
>>  #include <linux/ftrace.h>
>>  #include <linux/completion.h>
>>  #include <linux/list.h>
>> -#include <linux/livepatch_sched.h>
>>
>>  #if IS_ENABLED(CONFIG_LIVEPATCH)
>>
>> diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
>> deleted file mode 100644
>> index 013794fb5da0..000000000000
>> --- a/include/linux/livepatch_sched.h
>> +++ /dev/null
>> @@ -1,29 +0,0 @@
>> -/* SPDX-License-Identifier: GPL-2.0-or-later */
>> -#ifndef _LINUX_LIVEPATCH_SCHED_H_
>> -#define _LINUX_LIVEPATCH_SCHED_H_
>> -
>> -#include <linux/jump_label.h>
>> -#include <linux/static_call_types.h>
>> -
>> -#ifdef CONFIG_LIVEPATCH
>> -
>> -void __klp_sched_try_switch(void);
>> -
>> -#if !defined(CONFIG_PREEMPT_DYNAMIC) || !defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> -
>> -DECLARE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
>> -
>> -static __always_inline void klp_sched_try_switch(void)
>> -{
>> -	if (static_branch_unlikely(&klp_sched_try_switch_key))
>> -		__klp_sched_try_switch();
>> -}
>> -
>> -#endif /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>> -
>> -#else /* !CONFIG_LIVEPATCH */
>> -static inline void klp_sched_try_switch(void) {}
>> -static inline void __klp_sched_try_switch(void) {}
>> -#endif /* CONFIG_LIVEPATCH */
>> -
>> -#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 5bdf80136e42..c5b0ef1ecfe4 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -36,7 +36,6 @@
>>  #include <linux/seqlock.h>
>>  #include <linux/kcsan.h>
>>  #include <linux/rv.h>
>> -#include <linux/livepatch_sched.h>
>>  #include <asm/kmap_size.h>
>>
>>  /* task_struct member predeclarations (sorted alphabetically): */
>> @@ -2087,9 +2086,6 @@ extern int __cond_resched(void);
>>
>>  #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>>
>> -void sched_dynamic_klp_enable(void);
>> -void sched_dynamic_klp_disable(void);
>> -
>>  DECLARE_STATIC_CALL(cond_resched, __cond_resched);
>>
>>  static __always_inline int _cond_resched(void)
>> @@ -2098,7 +2094,6 @@ static __always_inline int _cond_resched(void)
>>  }
>>
>>  #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
>> -
>>  extern int dynamic_cond_resched(void);
>>
>>  static __always_inline int _cond_resched(void)
>> @@ -2106,25 +2101,20 @@ static __always_inline int _cond_resched(void)
>>  	return dynamic_cond_resched();
>>  }
>>
>> -#else /* !CONFIG_PREEMPTION */
>> +#else
>>
>>  static inline int _cond_resched(void)
>>  {
>> -	klp_sched_try_switch();
>>  	return __cond_resched();
>>  }
>>
>> -#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>> +#endif /* CONFIG_PREEMPT_DYNAMIC */
>>
>> -#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
>> +#else
>>
>> -static inline int _cond_resched(void)
>> -{
>> -	klp_sched_try_switch();
>> -	return 0;
>> -}
>> +static inline int _cond_resched(void) { return 0; }
>>
>> -#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
>> +#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
>>
>>  #define cond_resched() ({			\
>>  	__might_resched(__FILE__, __LINE__, 0);	\
>> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
>> index 61328328c474..fc851455740c 100644
>> --- a/kernel/livepatch/core.c
>> +++ b/kernel/livepatch/core.c
>> @@ -33,7 +33,6 @@
>>   *
>>   * - klp_ftrace_handler()
>>   * - klp_update_patch_state()
>> - * - __klp_sched_try_switch()
>>   */
>>  DEFINE_MUTEX(klp_mutex);
>>
>> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
>> index e54c3d60a904..70bc38f27af7 100644
>> --- a/kernel/livepatch/transition.c
>> +++ b/kernel/livepatch/transition.c
>> @@ -9,7 +9,6 @@
>>
>>  #include <linux/cpu.h>
>>  #include <linux/stacktrace.h>
>> -#include <linux/static_call.h>
>>  #include "core.h"
>>  #include "patch.h"
>>  #include "transition.h"
>> @@ -27,25 +26,6 @@ static int klp_target_state = KLP_UNDEFINED;
>>
>>  static unsigned int klp_signals_cnt;
>>
>> -/*
>> - * When a livepatch is in progress, enable klp stack checking in
>> - * cond_resched().  This helps CPU-bound kthreads get patched.
>> - */
>> -#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> -
>> -#define klp_cond_resched_enable() sched_dynamic_klp_enable()
>> -#define klp_cond_resched_disable() sched_dynamic_klp_disable()
>> -
>> -#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>> -
>> -DEFINE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
>> -EXPORT_SYMBOL(klp_sched_try_switch_key);
>> -
>> -#define klp_cond_resched_enable() static_branch_enable(&klp_sched_try_switch_key)
>> -#define klp_cond_resched_disable() static_branch_disable(&klp_sched_try_switch_key)
>> -
>> -#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>> -
>>  /*
>>   * This work can be performed periodically to finish patching or unpatching any
>>   * "straggler" tasks which failed to transition in the first attempt.
>> @@ -194,8 +174,8 @@ void klp_update_patch_state(struct task_struct *task)
>>  	 * barrier (smp_rmb) for two cases:
>>  	 *
>>  	 * 1) Enforce the order of the TIF_PATCH_PENDING read and the
>> -	 *    klp_target_state read.  The corresponding write barriers are in
>> -	 *    klp_init_transition() and klp_reverse_transition().
>> +	 *    klp_target_state read.  The corresponding write barrier is in
>> +	 *    klp_init_transition().
>>  	 *
>>  	 * 2) Enforce the order of the TIF_PATCH_PENDING read and a future read
>>  	 *    of func->transition, if klp_ftrace_handler() is called later on
>> @@ -363,44 +343,6 @@ static bool klp_try_switch_task(struct task_struct *task)
>>  	return !ret;
>>  }
>>
>> -void __klp_sched_try_switch(void)
>> -{
>> -	if (likely(!klp_patch_pending(current)))
>> -		return;
>> -
>> -	/*
>> -	 * This function is called from cond_resched() which is called in many
>> -	 * places throughout the kernel.  Using the klp_mutex here might
>> -	 * deadlock.
>> -	 *
>> -	 * Instead, disable preemption to prevent racing with other callers of
>> -	 * klp_try_switch_task().  Thanks to task_call_func() they won't be
>> -	 * able to switch this task while it's running.
>> -	 */
>> -	preempt_disable();
>> -
>> -	/*
>> -	 * Make sure current didn't get patched between the above check and
>> -	 * preempt_disable().
>> -	 */
>> -	if (unlikely(!klp_patch_pending(current)))
>> -		goto out;
>> -
>> -	/*
>> -	 * Enforce the order of the TIF_PATCH_PENDING read above and the
>> -	 * klp_target_state read in klp_try_switch_task().  The corresponding
>> -	 * write barriers are in klp_init_transition() and
>> -	 * klp_reverse_transition().
>> -	 */
>> -	smp_rmb();
>> -
>> -	klp_try_switch_task(current);
>> -
>> -out:
>> -	preempt_enable();
>> -}
>> -EXPORT_SYMBOL(__klp_sched_try_switch);
>> -
>>  /*
>>   * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
>>   * Kthreads with TIF_PATCH_PENDING set are woken up.
>> @@ -507,8 +449,7 @@ void klp_try_complete_transition(void)
>>  		return;
>>  	}
>>
>> -	/* Done!  Now cleanup the data structures. */
>> -	klp_cond_resched_disable();
>> +	/* we're done, now cleanup the data structures */
>>  	patch = klp_transition_patch;
>>  	klp_complete_transition();
>>
>> @@ -560,8 +501,6 @@ void klp_start_transition(void)
>>  			set_tsk_thread_flag(task, TIF_PATCH_PENDING);
>>  	}
>>
>> -	klp_cond_resched_enable();
>> -
>>  	klp_signals_cnt = 0;
>>  }
>>
>> @@ -617,9 +556,8 @@ void klp_init_transition(struct klp_patch *patch, int state)
>>  	 * see a func in transition with a task->patch_state of KLP_UNDEFINED.
>>  	 *
>>  	 * Also enforce the order of the klp_target_state write and future
>> -	 * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() and
>> -	 * __klp_sched_try_switch() don't set a task->patch_state to
>> -	 * KLP_UNDEFINED.
>> +	 * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() doesn't
>> +	 * set a task->patch_state to KLP_UNDEFINED.
>>  	 */
>>  	smp_wmb();
>>
>> @@ -655,10 +593,14 @@ void klp_reverse_transition(void)
>>  		 klp_target_state == KLP_PATCHED ? "patching to unpatching" :
>>  						   "unpatching to patching");
>>
>> +	klp_transition_patch->enabled = !klp_transition_patch->enabled;
>> +
>> +	klp_target_state = !klp_target_state;
>> +
>>  	/*
>>  	 * Clear all TIF_PATCH_PENDING flags to prevent races caused by
>> -	 * klp_update_patch_state() or __klp_sched_try_switch() running in
>> -	 * parallel with the reverse transition.
>> +	 * klp_update_patch_state() running in parallel with
>> +	 * klp_start_transition().
>>  	 */
>>  	read_lock(&tasklist_lock);
>>  	for_each_process_thread(g, task)
>> @@ -668,28 +610,9 @@ void klp_reverse_transition(void)
>>  	for_each_possible_cpu(cpu)
>>  		clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
>>
>> -	/*
>> -	 * Make sure all existing invocations of klp_update_patch_state() and
>> -	 * __klp_sched_try_switch() see the cleared TIF_PATCH_PENDING before
>> -	 * starting the reverse transition.
>> -	 */
>> +	/* Let any remaining calls to klp_update_patch_state() complete */
>>  	klp_synchronize_transition();
>>
>> -	/*
>> -	 * All patching has stopped, now re-initialize the global variables to
>> -	 * prepare for the reverse transition.
>> -	 */
>> -	klp_transition_patch->enabled = !klp_transition_patch->enabled;
>> -	klp_target_state = !klp_target_state;
>> -
>> -	/*
>> -	 * Enforce the order of the klp_target_state write and the
>> -	 * TIF_PATCH_PENDING writes in klp_start_transition() to ensure
>> -	 * klp_update_patch_state() and __klp_sched_try_switch() don't set
>> -	 * task->patch_state to the wrong value.
>> -	 */
>> -	smp_wmb();
>> -
>>  	klp_start_transition();
>>  }
>>
>> @@ -703,9 +626,9 @@ void klp_copy_process(struct task_struct *child)
>>  	 * the task flag up to date with the parent here.
>>  	 *
>>  	 * The operation is serialized against all klp_*_transition()
>> -	 * operations by the tasklist_lock. The only exceptions are
>> -	 * klp_update_patch_state(current) and __klp_sched_try_switch(), but we
>> -	 * cannot race with them because we are current.
>> +	 * operations by the tasklist_lock. The only exception is
>> +	 * klp_update_patch_state(current), but we cannot race with
>> +	 * that because we are current.
>>  	 */
>>  	if (test_tsk_thread_flag(current, TIF_PATCH_PENDING))
>>  		set_tsk_thread_flag(child, TIF_PATCH_PENDING);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0e8764d63041..b43fda3c5733 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8597,7 +8597,6 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
>>  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
>>  int __sched dynamic_cond_resched(void)
>>  {
>> -	klp_sched_try_switch();
>>  	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
>>  		return 0;
>>  	return __cond_resched();
>> @@ -8746,17 +8745,13 @@ int sched_dynamic_mode(const char *str)
>>  #error "Unsupported PREEMPT_DYNAMIC mechanism"
>>  #endif
>>
>> -DEFINE_MUTEX(sched_dynamic_mutex);
>> -static bool klp_override;
>> -
>> -static void __sched_dynamic_update(int mode)
>> +void sched_dynamic_update(int mode)
>>  {
>>  	/*
>>  	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
>>  	 * the ZERO state, which is invalid.
>>  	 */
>> -	if (!klp_override)
>> -		preempt_dynamic_enable(cond_resched);
>> +	preempt_dynamic_enable(cond_resched);
>>  	preempt_dynamic_enable(might_resched);
>>  	preempt_dynamic_enable(preempt_schedule);
>>  	preempt_dynamic_enable(preempt_schedule_notrace);
>> @@ -8764,79 +8759,36 @@ static void __sched_dynamic_update(int mode)
>>
>>  	switch (mode) {
>>  	case preempt_dynamic_none:
>> -		if (!klp_override)
>> -			preempt_dynamic_enable(cond_resched);
>> +		preempt_dynamic_enable(cond_resched);
>>  		preempt_dynamic_disable(might_resched);
>>  		preempt_dynamic_disable(preempt_schedule);
>>  		preempt_dynamic_disable(preempt_schedule_notrace);
>>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
>> -		if (mode != preempt_dynamic_mode)
>> -			pr_info("Dynamic Preempt: none\n");
>> +		pr_info("Dynamic Preempt: none\n");
>>  		break;
>>
>>  	case preempt_dynamic_voluntary:
>> -		if (!klp_override)
>> -			preempt_dynamic_enable(cond_resched);
>> +		preempt_dynamic_enable(cond_resched);
>>  		preempt_dynamic_enable(might_resched);
>>  		preempt_dynamic_disable(preempt_schedule);
>>  		preempt_dynamic_disable(preempt_schedule_notrace);
>>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
>> -		if (mode != preempt_dynamic_mode)
>> -			pr_info("Dynamic Preempt: voluntary\n");
>> +		pr_info("Dynamic Preempt: voluntary\n");
>>  		break;
>>
>>  	case preempt_dynamic_full:
>> -		if (!klp_override)
>> -			preempt_dynamic_disable(cond_resched);
>> +		preempt_dynamic_disable(cond_resched);
>>  		preempt_dynamic_disable(might_resched);
>>  		preempt_dynamic_enable(preempt_schedule);
>>  		preempt_dynamic_enable(preempt_schedule_notrace);
>>  		preempt_dynamic_enable(irqentry_exit_cond_resched);
>> -		if (mode != preempt_dynamic_mode)
>> -			pr_info("Dynamic Preempt: full\n");
>> +		pr_info("Dynamic Preempt: full\n");
>>  		break;
>>  	}
>>
>>  	preempt_dynamic_mode = mode;
>>  }
>>
>> -void sched_dynamic_update(int mode)
>> -{
>> -	mutex_lock(&sched_dynamic_mutex);
>> -	__sched_dynamic_update(mode);
>> -	mutex_unlock(&sched_dynamic_mutex);
>> -}
>> -
>> -#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
>> -
>> -static int klp_cond_resched(void)
>> -{
>> -	__klp_sched_try_switch();
>> -	return __cond_resched();
>> -}
>> -
>> -void sched_dynamic_klp_enable(void)
>> -{
>> -	mutex_lock(&sched_dynamic_mutex);
>> -
>> -	klp_override = true;
>> -	static_call_update(cond_resched, klp_cond_resched);
>> -
>> -	mutex_unlock(&sched_dynamic_mutex);
>> -}
>> -
>> -void sched_dynamic_klp_disable(void)
>> -{
>> -	mutex_lock(&sched_dynamic_mutex);
>> -
>> -	klp_override = false;
>> -	__sched_dynamic_update(preempt_dynamic_mode);
>> -
>> -	mutex_unlock(&sched_dynamic_mutex);
>> -}
>> -
>> -#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>> -
>>  static int __init setup_preempt_mode(char *str)
>>  {
>>  	int mode = sched_dynamic_mode(str);


--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 04/86] Revert "preempt/dynamic: Introduce preemption model accessors"
  2023-11-07 23:12   ` Steven Rostedt
@ 2023-11-08  4:59     ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  4:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik


Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue,  7 Nov 2023 13:56:50 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> I know this is an RFC but I'll state it here just so that it is stated. All
> reverts need a change log description to why a revert happened, even if you
> are just cut and pasting the reason for every commit. That's because git
> commits need to be stand alone and not depend on information in other git
> commit change logs.

Ack. I will also take your suggestion in the other email and remove the
relevant code instead. Reverting is clearly the wrong mechanism for this.

And thanks for helping me with all of the process related issues.
Appreciate it.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08  4:52     ` Christoph Lameter
@ 2023-11-08  5:12       ` Steven Rostedt
  2023-11-08  6:49         ` Ankur Arora
  2023-11-08  7:54         ` Vlastimil Babka
  0 siblings, 2 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08  5:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
Christoph Lameter <cl@linux.com> wrote:

> On Tue, 7 Nov 2023, Ankur Arora wrote:
> 
> > This came up in an earlier discussion (See
> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
> > that preempt_enable/_disable() overhead was relatively minimal.
> >
> > Is your point that always-on preempt_count is far too expensive?  
> 
> Yes over the years distros have traditionally delivered their kernels by 
> default without preemption because of these issues. If the overhead has 
> been minimized then that may have changed. Even if so there is still a lot 
> of code being generated that has questionable benefit and just 
> bloats the kernel.
> 
> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
> >> that are used for synchronization. You cannot remove those without changing a
> >> lot of synchronization primitives to always have to consider being preempted
> >> while operating.  
> >
> > I'm afraid I don't understand why you would need to change any
> > synchronization primitives. The code that does preempt_enable/_disable()
> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
> > CONFIG_PREEMPT_COUNT.  
> 
> In the trivial cases it is simple like that. But look f.e.
> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a 
> overhead added to be able to allow the cpu to change under us. There are 
> likely other examples in the source.
> 

preempt_disable() and preempt_enable() are much lower overhead today than
it use to be.

If you are worried about changing CPUs, there's also migrate_disable() too.

> And the whole business of local data 
> access via per cpu areas suffers if we cannot rely on two accesses in a 
> section being able to see consistent values.
> 
> > The intent here is to always have CONFIG_PREEMPT_COUNT=y.  
> 
> Just for fun? Code is most efficient if it does not have to consider too 
> many side conditions like suddenly running on a different processor. This 
> introduces needless complexity into the code. It would be better to remove 
> PREEMPT_COUNT for good to just rely on voluntary preemption. We could 
> probably reduce the complexity of the kernel source significantly.

That is what caused this thread in the first place. Randomly scattered
"preemption points" does not scale!

And I'm sorry, we have latency sensitive use cases that require full
preemption.

> 
> I have never noticed a need to preemption at every instruction in the 
> kernel (if that would be possible at all... Locks etc prevent that ideal 
> scenario frequently). Preemption like that is more like a pipe dream.
> 
> High performance kernel solution usually disable 
> overhead like that.
> 

Please read the email from Thomas:

   https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/

This is not technically getting rid of PREEMPT_NONE. It is adding a new
NEED_RESCHED_LAZY flag, that will have the kernel preempt only when
entering or in user space. It will behave the same as PREEMPT_NONE, but
without the need for all the cond_resched() scattered randomly throughout
the kernel.

If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at
250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will
preempt at the next available location (preempt_count == 0).

But yes, all locations that do not explicitly disable preemption, will now
possibly preempt (due to long running kernel threads).

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08  5:12       ` Steven Rostedt
@ 2023-11-08  6:49         ` Ankur Arora
  2023-11-08  7:54         ` Vlastimil Babka
  1 sibling, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  6:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Christoph Lameter, Ankur Arora, linux-kernel, tglx, peterz,
	torvalds, paulmck, linux-mm, x86, akpm, luto, bp, dave.hansen,
	hpa, mingo, juri.lelli, vincent.guittot, willy, mgorman,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	jgross, andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik


Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
> Christoph Lameter <cl@linux.com> wrote:
>
>> On Tue, 7 Nov 2023, Ankur Arora wrote:
>>
>> > This came up in an earlier discussion (See
>> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
>> > that preempt_enable/_disable() overhead was relatively minimal.
>> >
>> > Is your point that always-on preempt_count is far too expensive?
>>
>> Yes over the years distros have traditionally delivered their kernels by
>> default without preemption because of these issues. If the overhead has
>> been minimized then that may have changed. Even if so there is still a lot
>> of code being generated that has questionable benefit and just
>> bloats the kernel.
>>
>> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
>> >> that are used for synchronization. You cannot remove those without changing a
>> >> lot of synchronization primitives to always have to consider being preempted
>> >> while operating.
>> >
>> > I'm afraid I don't understand why you would need to change any
>> > synchronization primitives. The code that does preempt_enable/_disable()
>> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
>> > CONFIG_PREEMPT_COUNT.
>>
>> In the trivial cases it is simple like that. But look f.e.
>> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a
>> overhead added to be able to allow the cpu to change under us. There are
>> likely other examples in the source.
>>
>
> preempt_disable() and preempt_enable() are much lower overhead today than
> it use to be.
>
> If you are worried about changing CPUs, there's also migrate_disable() too.
>
>> And the whole business of local data
>> access via per cpu areas suffers if we cannot rely on two accesses in a
>> section being able to see consistent values.
>>
>> > The intent here is to always have CONFIG_PREEMPT_COUNT=y.
>>
>> Just for fun? Code is most efficient if it does not have to consider too
>> many side conditions like suddenly running on a different processor. This
>> introduces needless complexity into the code. It would be better to remove
>> PREEMPT_COUNT for good to just rely on voluntary preemption. We could
>> probably reduce the complexity of the kernel source significantly.
>
> That is what caused this thread in the first place. Randomly scattered
> "preemption points" does not scale!
>
> And I'm sorry, we have latency sensitive use cases that require full
> preemption.
>
>>
>> I have never noticed a need to preemption at every instruction in the
>> kernel (if that would be possible at all... Locks etc prevent that ideal
>> scenario frequently). Preemption like that is more like a pipe dream.

The intent isn't to preempt at every other instruction in the kernel.

As Thomas describes, the idea is that for voluntary preemption kernels
resched happens at cond_resched() points which have been distributed
heuristically. As a consequence you might get both too little preemption
and too much preemption.

The intent is to bring preemption in control of the scheduler which
can do a better job than randomly placed cond_resched() points.

>> High performance kernel solution usually disable
>> overhead like that.

You are also missing all the ways in which voluntary preemption
points are responsible for poor performance. For instance, if you look
atq clear_huge_page() it does page by page copy with a cond_resched()
call after clearing each page.

But if you can expose the full extent to the CPU, it can optimize
differently (for the 1GB page it can now elide cacheline allocation):

  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%

(See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

> Please read the email from Thomas:
>
>    https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
>
> This is not technically getting rid of PREEMPT_NONE. It is adding a new
> NEED_RESCHED_LAZY flag, that will have the kernel preempt only when
> entering or in user space. It will behave the same as PREEMPT_NONE, but
> without the need for all the cond_resched() scattered randomly throughout
> the kernel.

And a corollary of that is that with a scheduler controlled PREEMPT_NONE
a task might end up running to completion where earlier it could have been
preempted early because it crossed a cond_resched().

> If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at
> 250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will
> preempt at the next available location (preempt_count == 0).
>
> But yes, all locations that do not explicitly disable preemption, will now
> possibly preempt (due to long running kernel threads).

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (58 preceding siblings ...)
  2023-11-08  4:08 ` [RFC PATCH 00/86] Make the kernel preemptible Christoph Lameter
@ 2023-11-08  7:31 ` Juergen Gross
  2023-11-08  8:51 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  62 siblings, 0 replies; 250+ messages in thread
From: Juergen Gross @ 2023-11-08  7:31 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, andrew.cooper3, mingo, bristot, mathieu.desnoyers,
	geert, glaubitz, anton.ivanov, mattst88, krypton, rostedt,
	David.Laight, richard, mjguzik


[-- Attachment #1.1.1: Type: text/plain, Size: 16742 bytes --]

On 07.11.23 22:56, Ankur Arora wrote:
> Hi,
> 
> We have two models of preemption: voluntary and full (and RT which is
> a fuller form of full preemption.) In this series -- which is based
> on Thomas' PoC (see [1]), we try to unify the two by letting the
> scheduler enforce policy for the voluntary preemption models as well.
> 
> (Note that this is about preemption when executing in the kernel.
> Userspace is always preemptible.)
> 
> Background
> ==
> 
> Why?: both of these preemption mechanisms are almost entirely disjoint.
> There are four main sets of preemption points in the kernel:
> 
>   1. return to user
>   2. explicit preemption points (cond_resched() and its ilk)
>   3. return to kernel (tick/IPI/irq at irqexit)
>   4. end of non-preemptible sections at (preempt_count() == preempt_offset)
> 
> Voluntary preemption uses mechanisms 1 and 2. Full preemption
> uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*}
> which can be all things to all people because they internally
> contain 2, and 4.
> 
> Now since there's no ideal placement of explicit preemption points,
> they tend to be randomly spread over code and accumulate over time,
> as they are are added when latency problems are seen. Plus fear of
> regressions makes them difficult to remove.
> (Presumably, asymptotically they would spead out evenly across the
> instruction stream!)
> 
> In voluntary models, the scheduler's job is to match the demand
> side of preemption points (a task that needs to be scheduled) with
> the supply side (a task which calls cond_resched().)
> 
> Full preemption models track preemption count so the scheduler can
> always knows if it is safe to preempt and can drive preemption
> itself (ex. via dynamic preemption points in 3.)
> 
> Design
> ==
> 
> As Thomas outlines in [1], to unify the preemption models we
> want to: always have the preempt_count enabled and allow the scheduler
> to drive preemption policy based on the model in effect.
> 
> Policies:
> 
> - preemption=none: run to completion
> - preemption=voluntary: run to completion, unless a task of higher
>    sched-class awaits
> - preemption=full: optimized for low-latency. Preempt whenever a higher
>    priority task awaits.
> 
> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
> scheduler to mark that a reschedule is needed, but is deferred until
> the task finishes executing in the kernel -- voluntary preemption
> as it were.
> 
> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
> 
>           ret-to-user    ret-to-kernel    preempt_count()
> none           Y              N                N
> voluntary      Y              Y                Y
> full           Y              Y                Y
> 
> 
> There's just one remaining issue: now that explicit preemption points are
> gone, processes that spread a long time in the kernel have no way to give
> up the CPU.
> 
> For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.
> 
> For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
> if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
> (This would cause preemption either at ret-to-kernel, or if the task is in
> a non-preemptible section, when it exits that section.)
> 
> Arguably this provides for much more consistent maximum latency (~2 tick
> lengths + length of non-preemptible section) as compared to the old model
> where the maximum latency depended on the dynamic distribution of
> cond_resched() points.
> 
> (As a bonus it handles code that is preemptible but cannot call cond_resched()
>   completely trivially: ex. long running Xen hypercalls, or this series
>   which started this discussion:
>   https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/)
> 
> 
> Status
> ==
> 
> What works:
>   - The system seems to keep ticking over with the normal scheduling policies
>     (SCHED_OTHER). The support for the realtime policies is somewhat more
>     half baked.)
>   - The basic performance numbers seem pretty close to 6.6-rc7 baseline
> 
> What's broken:
>   - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>     lazily")
>   - Non-x86 architectures. It's trivial to support other archs (only need
>     to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>     comments on the series.
>     (From some testing on arm64, didn't find any surprises.)
>   - livepatch: livepatch depends on using _cond_resched() to provide
>     low-latency patching. That is obviously difficult with cond_resched()
>     gone. We could get a similar effect by using a static_key in
>     preempt_enable() but at least with inline locks, that might be end
>     up bloating the kernel quite a bit.
>   - Documentation/ and comments mention cond_resched()
>   - ftrace support for need-resched-lazy is incomplete
> 
> What needs more discussion:
>   - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
>     only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
>     change to tif_need_resched(resched_t)")
>   - Tracking whether a task in userspace or in the kernel (See patch-40
>     "context_tracking: add ct_state_cpu()")
>   - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
>     preemption")
> 
> 
> Performance
> ==
> 
> Expectation:
> 
> * perf sched bench pipe
> 
> preemption               full           none
> 
> 6.6-rc7              6.68 +- 0.10   6.69 +- 0.07
> +series              6.69 +- 0.12   6.67 +- 0.10
> 
> This is rescheduling out of idle which should and does perform identically.
> 
> * schbench, preempt=none
> 
>    * 1 group, 16 threads each
> 
>                   6.6-rc7      +series
>                   (usecs)      (usecs)
>       50.0th:         6            6
>       90.0th:         8            7
>       99.0th:        11           11
>       99.9th:        15           14
>    
>    * 8 groups, 16 threads each
> 
>                  6.6-rc7       +series
>                   (usecs)      (usecs)
>       50.0th:         6            6
>       90.0th:         8            8
>       99.0th:        12           11
>       99.9th:        20           21
> 
> 
> * schbench, preempt=full
> 
>    * 1 group, 16 threads each
> 
>                  6.6-rc7       +series
>                  (usecs)       (usecs)
>       50.0th:         6            6
>       90.0th:         8            7
>       99.0th:        11           11
>       99.9th:        14           14
> 
> 
>    * 8 groups, 16 threads each
> 
>                  6.6-rc7       +series
>                  (usecs)       (usecs)
>       50.0th:         7            7
>       90.0th:         9            9
>       99.0th:        12           12
>       99.9th:        21           22
> 
> 
>    Not much in it either way.
> 
> * kernbench, preempt=full
> 
>    * half-load (-j 128)
> 
>             6.6-rc7                                    +series
> 
>    wall        149.2  +-     27.2             wall        132.8  +-     0.4
>    utime      8097.1  +-     57.4             utime      8088.5  +-    14.1
>    stime      1165.5  +-      9.4             stime      1159.2  +-     1.9
>    %cpu       6337.6  +-   1072.8             %cpu       6959.6  +-    22.8
>    csw      237618    +-   2190.6             %csw     240343    +-  1386.8
> 
> 
>    * optimal-load (-j 1024)
> 
>             6.6-rc7                                    +series
> 
>    wall        137.8 +-       0.0             wall       137.7  +-       0.8
>    utime     11115.0 +-    3306.1             utime    11041.7  +-    3235.0
>    stime      1340.0 +-     191.3             stime     1323.1  +-     179.5
>    %cpu       8846.3 +-    2830.6             %cpu      9101.3  +-    2346.7
>    csw     2099910   +- 2040080.0             csw    2068210    +- 2002450.0
> 
> 
>    The preempt=full path should effectively not see any change in
>    behaviour. The optimal-loads are pretty much identical.
>    For the half-load, however, the +series version does much better but that
>    seems to be because of much higher run to run variability in the 6.6-rc7 load.
> 
> * kernbench, preempt=none
> 
>    * half-load (-j 128)
> 
>             6.6-rc7                                    +series
> 
>    wall        134.5  +-      4.2             wall        133.6  +-     2.7
>    utime      8093.3  +-     39.3             utime      8099.0  +-    38.9
>    stime      1175.7  +-     10.6             stime      1169.1  +-     8.4
>    %cpu       6893.3  +-    233.2             %cpu       6936.3  +-   142.8
>    csw      240723    +-    423.0             %csw     173152    +-  1126.8
>                                               
> 
>    * optimal-load (-j 1024)
> 
>             6.6-rc7                                    +series
> 
>    wall        139.2 +-       0.3             wall       138.8  +-       0.2
>    utime     11161.0 +-    3360.4             utime    11061.2  +-    3244.9
>    stime      1357.6 +-     199.3             stime     1366.6  +-     216.3
>    %cpu       9108.8 +-    2431.4             %cpu      9081.0  +-    2351.1
>    csw     2078599   +- 2013320.0             csw    1970610    +- 1969030.0
> 
> 
>    For both of these the wallclock, utime, stime etc are pretty much
>    identical. The one interesting difference is that the number of
>    context switches are fewer. This intuitively makes sense given that
>    we reschedule threads lazily rather than rescheduling if we encounter
>    a cond_resched() and there's a thread wanting to be scheduled.
> 
>    The max-load numbers (not posted here) also behave similarly.
> 
> 
> Series
> ==
> 
> With that, this is how he series is laid out:
> 
>   - Patches 01-30: revert the PREEMPT_DYNAMIC code. Most of the infrastructure
>     used by that is via static_calls() and this is a simpler approach which
>     doesn't need any of that (and does away with cond_resched().)
> 
>     Some of the commits will be resurrected.
>         089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout")
>         cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors")
>         5693fa74f98a ("kcsan: Use preemption model accessors")
> 
>   - Patches 31-45: contain the scheduler changes to do this. Of these
>     the critical ones are:
>       patch 35 "thread_info: change to tif_need_resched(resched_t)"
>       patch 41 "sched: handle resched policy in resched_curr()"
>       patch 43 "sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models"
>       patch 44 "sched: voluntary preemption"
>        (this needs more work to decide when a higher sched-policy task
>         should preempt a lower sched-policy task)
>       patch 45 "preempt: ARCH_NO_PREEMPT only preempts lazily"
> 
>   - Patches 47-50: contain RCU related changes. RCU now works in both
>     PREEMPT_RCU=y and PREEMPT_RCU=n modes with CONFIG_PREEMPTION.
>     (Until now PREEMPTION=y => PREEMPT_RCU)
> 
>   - Patches 51-56,86: contain cond_resched() related cleanups.
>       patch 54 "sched: add cond_resched_stall()" adds a new cond_resched()
>       interface. Pitchforks?
> 
>   - Patches 57-86: remove cond_resched() from the tree.
> 
> 
> Also at: github.com/terminus/linux preemption-rfc
> 
> 
> Please review.
> 
> Thanks
> Ankur
> 
> [1] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> 
> 
> Ankur Arora (86):
>    Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>    Revert "sched/core: Make sched_dynamic_mutex static"
>    Revert "ftrace: Use preemption model accessors for trace header
>      printout"
>    Revert "preempt/dynamic: Introduce preemption model accessors"
>    Revert "kcsan: Use preemption model accessors"
>    Revert "entry: Fix compile error in
>      dynamic_irqentry_exit_cond_resched()"
>    Revert "livepatch,sched: Add livepatch task switching to
>      cond_resched()"
>    Revert "arm64: Support PREEMPT_DYNAMIC"
>    Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>    Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
>      GENERIC_ENTRY"
>    Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
>    Revert "sched/preempt: Refactor sched_dynamic_update()"
>    Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
>    Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
>    Revert "preempt: Restore preemption model selection configs"
>    Revert "sched: Provide Kconfig support for default dynamic preempt
>      mode"
>    sched/preempt: remove PREEMPT_DYNAMIC from the build version
>    Revert "preempt/dynamic: Fix typo in macro conditional statement"
>    Revert "sched,preempt: Move preempt_dynamic to debug.c"
>    Revert "static_call: Relax static_call_update() function argument
>      type"
>    Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
>    Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
>    Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
>    Revert "sched: Harden PREEMPT_DYNAMIC"
>    Revert "sched: Add /debug/sched_preempt"
>    Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
>      option"
>    Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
>      call"
>    Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
>      calls"
>    Revert "preempt/dynamic: Provide cond_resched() and might_resched()
>      static calls"
>    Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
>    x86/thread_info: add TIF_NEED_RESCHED_LAZY
>    entry: handle TIF_NEED_RESCHED_LAZY
>    entry/kvm: handle TIF_NEED_RESCHED_LAZY
>    thread_info: accessors for TIF_NEED_RESCHED*
>    thread_info: change to tif_need_resched(resched_t)
>    entry: irqentry_exit only preempts TIF_NEED_RESCHED
>    sched: make test_*_tsk_thread_flag() return bool
>    sched: *_tsk_need_resched() now takes resched_t
>    sched: handle lazy resched in set_nr_*_polling()
>    context_tracking: add ct_state_cpu()
>    sched: handle resched policy in resched_curr()
>    sched: force preemption on tick expiration
>    sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
>    sched: voluntary preemption
>    preempt: ARCH_NO_PREEMPT only preempts lazily
>    tracing: handle lazy resched
>    rcu: select PREEMPT_RCU if PREEMPT
>    rcu: handle quiescent states for PREEMPT_RCU=n
>    osnoise: handle quiescent states directly
>    rcu: TASKS_RCU does not need to depend on PREEMPTION
>    preempt: disallow !PREEMPT_COUNT or !PREEMPTION
>    sched: remove CONFIG_PREEMPTION from *_needbreak()
>    sched: fixup __cond_resched_*()
>    sched: add cond_resched_stall()
>    xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq()
>    xarray: use cond_resched_xas*()
>    coccinelle: script to remove cond_resched()
>    treewide: x86: remove cond_resched()
>    treewide: rcu: remove cond_resched()
>    treewide: torture: remove cond_resched()
>    treewide: bpf: remove cond_resched()
>    treewide: trace: remove cond_resched()
>    treewide: futex: remove cond_resched()
>    treewide: printk: remove cond_resched()
>    treewide: task_work: remove cond_resched()
>    treewide: kernel: remove cond_resched()
>    treewide: kernel: remove cond_reshed()
>    treewide: mm: remove cond_resched()
>    treewide: io_uring: remove cond_resched()
>    treewide: ipc: remove cond_resched()
>    treewide: lib: remove cond_resched()
>    treewide: crypto: remove cond_resched()
>    treewide: security: remove cond_resched()
>    treewide: fs: remove cond_resched()
>    treewide: virt: remove cond_resched()
>    treewide: block: remove cond_resched()
>    treewide: netfilter: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: sound: remove cond_resched()
>    treewide: md: remove cond_resched()
>    treewide: mtd: remove cond_resched()
>    treewide: drm: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: drivers: remove cond_resched()
>    sched: remove cond_resched()

I'm missing the removal of the Xen parts, which were one of the reasons to start
this whole work (xen_in_preemptible_hcall etc.).


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 06/86] Revert "entry: Fix compile error in dynamic_irqentry_exit_cond_resched()"
  2023-11-07 21:56 ` [RFC PATCH 06/86] Revert "entry: Fix compile error in dynamic_irqentry_exit_cond_resched()" Ankur Arora
@ 2023-11-08  7:47   ` Greg KH
  2023-11-08  9:09     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Greg KH @ 2023-11-08  7:47 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

On Tue, Nov 07, 2023 at 01:56:52PM -0800, Ankur Arora wrote:
> This reverts commit 0a70045ed8516dfcff4b5728557e1ef3fd017c53.
> 

None of these reverts say "why" the revert is needed, or why you even
want to do this at all.  Reverting a compilation error feels like you
are going to be adding a compilation error to the build, which is
generally considered a bad thing :(

So, more information please.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()
  2023-11-08  1:28     ` Sergey Senozhatsky
@ 2023-11-08  7:49       ` Vlastimil Babka
  2023-11-08  8:02         ` Yosry Ahmed
  0 siblings, 1 reply; 250+ messages in thread
From: Vlastimil Babka @ 2023-11-08  7:49 UTC (permalink / raw)
  To: Sergey Senozhatsky, Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, SeongJae Park, Mike Kravetz, Muchun Song,
	Andrey Ryabinin, Marco Elver, Catalin Marinas, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Naoya Horiguchi,
	Miaohe Lin, David Hildenbrand, Oscar Salvador, Mike Rapoport,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Hugh Dickins, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vitaly Wool, Minchan Kim,
	Seth Jennings, Dan Streetman

On 11/8/23 02:28, Sergey Senozhatsky wrote:
> On (23/11/07 15:08), Ankur Arora wrote:
> [..]
>> +++ b/mm/zsmalloc.c
>> @@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>>  			dst_zspage = NULL;
>>  
>>  			spin_unlock(&pool->lock);
>> -			cond_resched();
>>  			spin_lock(&pool->lock);
>>  		}
>>  	}
> 
> I'd personally prefer to have a comment explaining why we do that
> spin_unlock/spin_lock sequence, which may look confusing to people.

Wonder if it would make sense to have a lock operation that does the
unlock/lock as a self-documenting thing, and maybe could also be optimized
to first check if there's a actually a need for it (because TIF_NEED_RESCHED
or lock is contended).

> Maybe would make sense to put a nice comment in all similar cases.
> For instance:
> 
>   	rcu_read_unlock();
>  -	cond_resched();
>   	rcu_read_lock();


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08  5:12       ` Steven Rostedt
  2023-11-08  6:49         ` Ankur Arora
@ 2023-11-08  7:54         ` Vlastimil Babka
  1 sibling, 0 replies; 250+ messages in thread
From: Vlastimil Babka @ 2023-11-08  7:54 UTC (permalink / raw)
  To: Steven Rostedt, Christoph Lameter
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On 11/8/23 06:12, Steven Rostedt wrote:
> On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
> Christoph Lameter <cl@linux.com> wrote:
> 
>> On Tue, 7 Nov 2023, Ankur Arora wrote:
>> 
>> > This came up in an earlier discussion (See
>> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
>> > that preempt_enable/_disable() overhead was relatively minimal.
>> >
>> > Is your point that always-on preempt_count is far too expensive?  
>> 
>> Yes over the years distros have traditionally delivered their kernels by 
>> default without preemption because of these issues. If the overhead has 
>> been minimized then that may have changed. Even if so there is still a lot 
>> of code being generated that has questionable benefit and just 
>> bloats the kernel.
>> 
>> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
>> >> that are used for synchronization. You cannot remove those without changing a
>> >> lot of synchronization primitives to always have to consider being preempted
>> >> while operating.  
>> >
>> > I'm afraid I don't understand why you would need to change any
>> > synchronization primitives. The code that does preempt_enable/_disable()
>> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
>> > CONFIG_PREEMPT_COUNT.  
>> 
>> In the trivial cases it is simple like that. But look f.e.
>> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a 
>> overhead added to be able to allow the cpu to change under us. There are 
>> likely other examples in the source.
>> 
> 
> preempt_disable() and preempt_enable() are much lower overhead today than
> it use to be.
> 
> If you are worried about changing CPUs, there's also migrate_disable() too.

Note that while migrate_disable() would be often sufficient, the
implementation of it has actually more overhead (function call, does
preempt_disable()/enable() as part of it) than just preempt_disable(). See
for example the pcpu_task_pin() definition in mm/page_alloc.c




^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()
  2023-11-08  7:49       ` Vlastimil Babka
@ 2023-11-08  8:02         ` Yosry Ahmed
  2023-11-08  8:54           ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Yosry Ahmed @ 2023-11-08  8:02 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sergey Senozhatsky, Ankur Arora, linux-kernel, tglx, peterz,
	torvalds, paulmck, linux-mm, x86, akpm, luto, bp, dave.hansen,
	hpa, mingo, juri.lelli, vincent.guittot, willy, mgorman,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	jgross, andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, SeongJae Park, Mike Kravetz, Muchun Song,
	Andrey Ryabinin, Marco Elver, Catalin Marinas, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Naoya Horiguchi,
	Miaohe Lin, David Hildenbrand, Oscar Salvador, Mike Rapoport,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Hugh Dickins, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vitaly Wool, Minchan Kim,
	Seth Jennings, Dan Streetman

On Tue, Nov 7, 2023 at 11:49 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 11/8/23 02:28, Sergey Senozhatsky wrote:
> > On (23/11/07 15:08), Ankur Arora wrote:
> > [..]
> >> +++ b/mm/zsmalloc.c
> >> @@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> >>                      dst_zspage = NULL;
> >>
> >>                      spin_unlock(&pool->lock);
> >> -                    cond_resched();
> >>                      spin_lock(&pool->lock);
> >>              }
> >>      }
> >
> > I'd personally prefer to have a comment explaining why we do that
> > spin_unlock/spin_lock sequence, which may look confusing to people.
>
> Wonder if it would make sense to have a lock operation that does the
> unlock/lock as a self-documenting thing, and maybe could also be optimized
> to first check if there's a actually a need for it (because TIF_NEED_RESCHED
> or lock is contended).

+1, I was going to suggest this as well. It can be extended to other
locking types that disable preemption as well like RCU. Something like
spin_lock_relax() or something.

>
> > Maybe would make sense to put a nice comment in all similar cases.
> > For instance:
> >
> >       rcu_read_unlock();
> >  -    cond_resched();
> >       rcu_read_lock();
>
>

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()
  2023-11-07 23:19   ` [RFC PATCH 57/86] coccinelle: script to " Julia Lawall
@ 2023-11-08  8:29     ` Ankur Arora
  2023-11-08  9:49       ` Julia Lawall
  0 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  8:29 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Nicolas Palix


Julia Lawall <julia.lawall@inria.fr> writes:

> On Tue, 7 Nov 2023, Ankur Arora wrote:
>
>> Rudimentary script to remove the straight-forward subset of
>> cond_resched() and allies:
>>
>> 1)  if (need_resched())
>> 	  cond_resched()
>>
>> 2)  expression*;
>>     cond_resched();  /* or in the reverse order */
>>
>> 3)  if (expression)
>> 	statement
>>     cond_resched();  /* or in the reverse order */
>>
>> The last two patterns depend on the control flow level to ensure
>> that the complex cond_resched() patterns (ex. conditioned ones)
>> are left alone and we only pick up ones which are only minimally
>> related the neighbouring code.
>>
>> Cc: Julia Lawall <Julia.Lawall@inria.fr>
>> Cc: Nicolas Palix <nicolas.palix@imag.fr>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
>>  1 file changed, 53 insertions(+)
>>  create mode 100644 scripts/coccinelle/api/cond_resched.cocci
>>
>> diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
>> new file mode 100644
>> index 000000000000..bf43768a8f8c
>> --- /dev/null
>> +++ b/scripts/coccinelle/api/cond_resched.cocci
>> @@ -0,0 +1,53 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/// Remove naked cond_resched() statements
>> +///
>> +//# Remove cond_resched() statements when:
>> +//#   - executing at the same control flow level as the previous or the
>> +//#     next statement (this lets us avoid complicated conditionals in
>> +//#     the neighbourhood.)
>> +//#   - they are of the form "if (need_resched()) cond_resched()" which
>> +//#     is always safe.
>> +//#
>> +//# Coccinelle generally takes care of comments in the immediate neighbourhood
>> +//# but might need to handle other comments alluding to rescheduling.
>> +//#
>> +virtual patch
>> +virtual context
>> +
>> +@ r1 @
>> +identifier r;
>> +@@
>> +
>> +(
>> + r = cond_resched();
>> +|
>> +-if (need_resched())
>> +-	cond_resched();
>> +)
>
> This rule doesn't make sense.  The first branch of the disjunction will
> never match a a place where the second branch matches.  Anyway, in the
> second branch there is no assignment, so I don't see what the first branch
> is protecting against.
>
> The disjunction is just useless.  Whether it is there or or whether only
> the second brancha is there, doesn't have any impact on the result.
>
>> +
>> +@ r2 @
>> +expression E;
>> +statement S,T;
>> +@@
>> +(
>> + E;
>> +|
>> + if (E) S
>
> This case is not needed.  It will be matched by the next case.
>
>> +|
>> + if (E) S else T
>> +|
>> +)
>> +-cond_resched();
>> +
>> +@ r3 @
>> +expression E;
>> +statement S,T;
>> +@@
>> +-cond_resched();
>> +(
>> + E;
>> +|
>> + if (E) S
>
> As above.
>
>> +|
>> + if (E) S else T
>> +)
>
> I have the impression that you are trying to retain some cond_rescheds.
> Could you send an example of one that you are trying to keep?  Overall,
> the above rules seem a bit ad hoc.  You may be keeping some cases you
> don't want to, or removing some cases that you want to keep.

Right. I was trying to ensure that the script only handled the cases
that didn't have any "interesting" connections to the surrounding code.

Just to give you an example of the kind of constructs that I wanted
to avoid:

mm/memoy.c::zap_pmd_range():

                if (addr != next)
                        pmd--;
        } while (pmd++, cond_resched(), addr != end);

mm/backing-dev.c::cleanup_offline_cgwbs_workfn()

                while (cleanup_offline_cgwb(wb))
                        cond_resched();


                while (cleanup_offline_cgwb(wb))
                        cond_resched();

But from a quick check the simplest coccinelle script does a much
better job than my overly complex (and incorrect) one:

@r1@
@@
-       cond_resched();

It avoids the first one. And transforms the second to:

                while (cleanup_offline_cgwb(wb))
                        {}

which is exactly what I wanted.

> Of course, if you are confident that the job is done with this semantic
> patch as it is, then that's fine too.

Not at all. Thanks for pointing out the mistakes.



--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 45/86] preempt: ARCH_NO_PREEMPT only preempts lazily
  2023-11-08  0:07   ` Steven Rostedt
@ 2023-11-08  8:47     ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  8:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, geert


Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue,  7 Nov 2023 13:57:31 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Note: this commit is badly broken. Only here for discussion.
>>
>> Configurations with ARCH_NO_PREEMPT support preempt_count, but might
>> not be tested well enough under PREEMPTION to support it might not
>> be demarcating the necessary non-preemptible sections.
>>
>> One way to handle this is by limiting them to PREEMPT_NONE mode, not
>> doing any tick enforcement and limiting preemption to happen only at
>> user boundary.
>>
>> Unfortunately, this is only a partial solution because eager
>> rescheduling could still happen (say, due to RCU wanting an
>> expedited quiescent period.) And, because we do not trust the
>> preempt_count accounting, this would mean preemption inside an
>> unmarked critical section.
>
> Is preempt_count accounting really not trust worthy?

I think the concern was that we might not be marking all the sections
that might be non-preemptible.

Plus, given that these archs have always been !preemption, it is
unlikely that they would work without changes. I think basically we
don't have a clue if they work or not since haven't ever been tested.

> That is, if we preempt at preempt_count() going to zero but nowhere else,
> would that work? At least it would create some places that can be resched.

I'm not sure we can be sure. I can imagine places where it should be
preempt_disable() ; spin_lock() ; ... ; spin_unlock(); preempt_enable()
but the preempt_disable/_enable() are MIA.

I think even so it is a pretty good idea. We could decompose
ARCH_NO_PREEMPT into ARCH_NO_PREEMPT_COUNT and ARCH_NO_PREEMPT_IRQ.

And, as you imply, PREEMPTION (or rather PREEMPT_COUNT) only depends
on ARCH_NO_PREEMPT_COUNT, not the ARCH_NO_PREEMPT_IRQ.

And this change might make the task of fixing this simpler since you
would only have to worry about neighborhood and paths leading to
preempt_enable().

 void irqentry_exit_cond_resched(void)
 {
-       if (!preempt_count()) {
+       if (IS_DISABLED(CONFIG_ARCH_NO_PREEMPT_IRQ) && !preempt_count()) {
                /* Sanity check RCU and thread stack */
                rcu_irq_exit_check_preempt();


Geert, if you think it might help I can send out a patch.

> What's the broken part of these archs? The assembly?

Not sure anyone knows. But, assuming m68k is representative of the other
three ARCH_NO_PREEMPT ones (might be better placed, because COLDFIRE m68k
already supports preemption) the patches Geert had sent out add:

 - preempt_enable/_disable() pairs to the cache/tlb flush logic
 - a need-resched check and call to preempt_schedule_irq() in the
   exception return path.

m68k support: https://lore.kernel.org/all/7858a184cda66e0991fd295c711dfed7e4d1248c.1696603287.git.geert@linux-m68k.org/#t

(The series itself ran into an mmput() bug which might or might not
have anything to do with preemption.)

> If that's the case, as
> long as the generic code has the preempt_count() I would think that would
> be trust worthy. I'm also guessing that in_irq() and friends are still
> reliable.

True.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (59 preceding siblings ...)
  2023-11-08  7:31 ` Juergen Gross
@ 2023-11-08  8:51 ` Peter Zijlstra
  2023-11-08  9:53   ` Daniel Bristot de Oliveira
  2023-11-08 10:04   ` Ankur Arora
  2023-11-08  9:43 ` David Laight
  2023-11-08 16:33 ` Mark Rutland
  62 siblings, 2 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  8:51 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
> Hi,
> 
> We have two models of preemption: voluntary and full 

3, also none (RT is not actually a preemption model).

> (and RT which is
> a fuller form of full preemption.)

It is not in fact a different preemption model, it is the same full
preemption, the difference with RT is that it makes a lot more stuff
preemptible, but the fundamental preemption model is the same -- full.

> In this series -- which is based
> on Thomas' PoC (see [1]), we try to unify the two by letting the
> scheduler enforce policy for the voluntary preemption models as well.

Well, you've also taken out preempt_dynamic for some obscure reason :/


> Please review.

> Ankur Arora (86):
>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>   Revert "sched/core: Make sched_dynamic_mutex static"
>   Revert "ftrace: Use preemption model accessors for trace header
>     printout"
>   Revert "preempt/dynamic: Introduce preemption model accessors"
>   Revert "kcsan: Use preemption model accessors"
>   Revert "entry: Fix compile error in
>     dynamic_irqentry_exit_cond_resched()"
>   Revert "livepatch,sched: Add livepatch task switching to
>     cond_resched()"
>   Revert "arm64: Support PREEMPT_DYNAMIC"
>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>   Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
>     GENERIC_ENTRY"
>   Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
>   Revert "sched/preempt: Refactor sched_dynamic_update()"
>   Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
>   Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
>   Revert "preempt: Restore preemption model selection configs"
>   Revert "sched: Provide Kconfig support for default dynamic preempt
>     mode"
>   sched/preempt: remove PREEMPT_DYNAMIC from the build version
>   Revert "preempt/dynamic: Fix typo in macro conditional statement"
>   Revert "sched,preempt: Move preempt_dynamic to debug.c"
>   Revert "static_call: Relax static_call_update() function argument
>     type"
>   Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
>   Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
>   Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
>   Revert "sched: Harden PREEMPT_DYNAMIC"
>   Revert "sched: Add /debug/sched_preempt"
>   Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
>     option"
>   Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
>     call"
>   Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
>     calls"
>   Revert "preempt/dynamic: Provide cond_resched() and might_resched()
>     static calls"
>   Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"

NAK

Even if you were to remove PREEMPT_NONE, which should be a separate
series, but that isn't on the table at all afaict, removing
preempt_dynamic doesn't make sense.

You still want the preempt= boot time argument and the
/debug/sched/preempt things to dynamically switch between the models.

Please, focus on the voluntary thing, gut that and then replace it with
the lazy thing, but leave everything else in place.

Re dynamic preempt, gutting the current voluntary preemption model means
getting rid of the cond_resched and might_resched toggles but you'll
gain a switch to kill the lazy stuff.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()
  2023-11-08  8:02         ` Yosry Ahmed
@ 2023-11-08  8:54           ` Ankur Arora
  2023-11-08 12:58             ` Matthew Wilcox
  0 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  8:54 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Vlastimil Babka, Sergey Senozhatsky, Ankur Arora, linux-kernel,
	tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, SeongJae Park,
	Mike Kravetz, Muchun Song, Andrey Ryabinin, Marco Elver,
	Catalin Marinas, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Naoya Horiguchi, Miaohe Lin, David Hildenbrand,
	Oscar Salvador, Mike Rapoport, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Hugh Dickins, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vitaly Wool, Minchan Kim, Seth Jennings, Dan Streetman


Yosry Ahmed <yosryahmed@google.com> writes:

> On Tue, Nov 7, 2023 at 11:49 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 11/8/23 02:28, Sergey Senozhatsky wrote:
>> > On (23/11/07 15:08), Ankur Arora wrote:
>> > [..]
>> >> +++ b/mm/zsmalloc.c
>> >> @@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>> >>                      dst_zspage = NULL;
>> >>
>> >>                      spin_unlock(&pool->lock);
>> >> -                    cond_resched();
>> >>                      spin_lock(&pool->lock);
>> >>              }
>> >>      }
>> >
>> > I'd personally prefer to have a comment explaining why we do that
>> > spin_unlock/spin_lock sequence, which may look confusing to people.
>>
>> Wonder if it would make sense to have a lock operation that does the
>> unlock/lock as a self-documenting thing, and maybe could also be optimized
>> to first check if there's a actually a need for it (because TIF_NEED_RESCHED
>> or lock is contended).
>
> +1, I was going to suggest this as well. It can be extended to other
> locking types that disable preemption as well like RCU. Something like
> spin_lock_relax() or something.

Good point. We actually do have exactly that: cond_resched_lock(). (And
similar RW lock variants.)

>> > Maybe would make sense to put a nice comment in all similar cases.
>> > For instance:
>> >
>> >       rcu_read_unlock();
>> >  -    cond_resched();
>> >       rcu_read_lock();

And we have this construct as well: cond_resched_rcu().

I can switch to the alternatives when I send out the next version of
the series.

Thanks

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 34/86] thread_info: accessors for TIF_NEED_RESCHED*
  2023-11-07 21:57 ` [RFC PATCH 34/86] thread_info: accessors for TIF_NEED_RESCHED* Ankur Arora
@ 2023-11-08  8:58   ` Peter Zijlstra
  2023-11-21  5:59     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  8:58 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:20PM -0800, Ankur Arora wrote:
> Add tif_resched() which will be used as an accessor for TIF_NEED_RESCHED
> and TIF_NEED_RESCHED_LAZY. The intent is to force the caller to make an
> explicit choice of how eagerly they want a reschedule.
> 
> This interface will be used almost entirely from core kernel code, so
> forcing a choice shouldn't be too onerous.
> 
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

> ---
>  include/linux/thread_info.h | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 9ea0b28068f4..4eb22b13bf64 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -59,6 +59,27 @@ enum syscall_work_bit {
>  
>  #include <asm/thread_info.h>
>  
> +#ifndef TIF_NEED_RESCHED_LAZY
> +#error "Arch needs to define TIF_NEED_RESCHED_LAZY"
> +#endif
> +
> +#define TIF_NEED_RESCHED_LAZY_OFFSET	(TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
> +
> +typedef enum {
> +	RESCHED_eager = 0,
> +	RESCHED_lazy = TIF_NEED_RESCHED_LAZY_OFFSET,
> +} resched_t;
> +
> +static inline int tif_resched(resched_t r)
> +{
> +	return TIF_NEED_RESCHED + r;
> +}
> +
> +static inline int _tif_resched(resched_t r)
> +{
> +	return 1 << tif_resched(r);
> +}

So either I'm confused or I'm thinking this is wrong. If you want to
preempt eagerly you want to preempt more than when you're not eager to
preempt, right?

So an eager preemption site wants to include the LAZY bit.

Whereas a site that wants to lazily preempt would prefer to not preempt
until forced, and hence would not include LAZY bit.



^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 35/86] thread_info: change to tif_need_resched(resched_t)
  2023-11-07 21:57 ` [RFC PATCH 35/86] thread_info: change to tif_need_resched(resched_t) Ankur Arora
@ 2023-11-08  9:00   ` Peter Zijlstra
  0 siblings, 0 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:00 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:21PM -0800, Ankur Arora wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 95d47783ff6e..5f0d7341cb88 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2172,9 +2172,11 @@ static inline int rwlock_needbreak(rwlock_t *lock)
>  
>  static __always_inline bool need_resched(void)
>  {
> -	return unlikely(tif_need_resched());
> +	return unlikely(tif_need_resched(RESCHED_eager) ||
> +			tif_need_resched(RESCHED_lazy));
>  }
>  
> +

We really needed this extra blank line, yes? :-)

>  /*
>   * Wrappers for p->thread_info->cpu access. No-op on UP.
>   */
> diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
> index 478084f9105e..719416fe8ddc 100644
> --- a/include/linux/sched/idle.h
> +++ b/include/linux/sched/idle.h
> @@ -63,7 +63,7 @@ static __always_inline bool __must_check current_set_polling_and_test(void)
>  	 */
>  	smp_mb__after_atomic();
>  
> -	return unlikely(tif_need_resched());
> +	return unlikely(need_resched());
>  }

You're stacking unlikely's, need_resched() already has unlikely.


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 36/86] entry: irqentry_exit only preempts TIF_NEED_RESCHED
  2023-11-07 21:57 ` [RFC PATCH 36/86] entry: irqentry_exit only preempts TIF_NEED_RESCHED Ankur Arora
@ 2023-11-08  9:01   ` Peter Zijlstra
  2023-11-21  6:00     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:01 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:22PM -0800, Ankur Arora wrote:
> The scheduling policy for RESCHED_lazy (TIF_NEED_RESCHED_LAZY) is
> to let anything running in the kernel run to completion.
> Accordingly, while deciding whether to call preempt_schedule_irq()
> narrow the check to tif_need_resched(RESCHED_eager).
> 
> Also add a comment about why we need to check at all, given that we
> have aleady checked the preempt_count().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/entry/common.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 0d055c39690b..6433e6c77185 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -384,7 +384,15 @@ void irqentry_exit_cond_resched(void)
>  		rcu_irq_exit_check_preempt();
>  		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>  			WARN_ON_ONCE(!on_thread_stack());
> -		if (need_resched())
> +
> +		/*
> +		 * If the scheduler really wants us to preempt while returning
> +		 * to kernel, it would set TIF_NEED_RESCHED.
> +		 * On some archs the flag gets folded in preempt_count, and
> +		 * thus would be covered in the conditional above, but not all
> +		 * archs do that, so check explicitly.
> +		 */
> +		if (tif_need_resched(RESCHED_eager))
>  			preempt_schedule_irq();

See, I'm reading this like if we're eager to preempt, but then it's not
actually eager at all and only wants to preempt when forced.

This naming sucks...

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 37/86] sched: make test_*_tsk_thread_flag() return bool
  2023-11-07 21:57 ` [RFC PATCH 37/86] sched: make test_*_tsk_thread_flag() return bool Ankur Arora
@ 2023-11-08  9:02   ` Peter Zijlstra
  0 siblings, 0 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:02 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:23PM -0800, Ankur Arora wrote:
> All users of test_*_tsk_thread_flag() treat the result value
> as boolean. This is also true for the underlying test_and_*_bit()
> operations.
> 
> Change the return type to bool.
> 
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

You're sending 86 patches, I'm thinking you should do everything humanly
possible to reduce this patch count, so perhaps keep these in a separate
series.

This is irrelevant to the issue at hand. So send it as a separate
cleanup or whatever.

> ---
>  include/linux/sched.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5f0d7341cb88..12d0626601a0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2045,17 +2045,17 @@ static inline void update_tsk_thread_flag(struct task_struct *tsk, int flag,
>  	update_ti_thread_flag(task_thread_info(tsk), flag, value);
>  }
>  
> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
>  {
>  	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
>  }
>  
> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
>  {
>  	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
>  }
>  
> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
>  {
>  	return test_ti_thread_flag(task_thread_info(tsk), flag);
>  }
> @@ -2070,7 +2070,7 @@ static inline void clear_tsk_need_resched(struct task_struct *tsk)
>  	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
>  }
>  
> -static inline int test_tsk_need_resched(struct task_struct *tsk)
> +static inline bool test_tsk_need_resched(struct task_struct *tsk)
>  {
>  	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
>  }
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 38/86] sched: *_tsk_need_resched() now takes resched_t
  2023-11-07 21:57 ` [RFC PATCH 38/86] sched: *_tsk_need_resched() now takes resched_t Ankur Arora
@ 2023-11-08  9:03   ` Peter Zijlstra
  0 siblings, 0 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:03 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:24PM -0800, Ankur Arora wrote:
> *_tsk_need_resched() need to test for the specific need-resched
> flag.
> 
> The only users are RCU and the scheduler. For RCU we always want
> to schedule at the earliest opportunity and that is always
> RESCHED_eager.

Why ?

> 
> For the scheduler, keep everything as RESCHED_eager for now.

Why ?

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 06/86] Revert "entry: Fix compile error in dynamic_irqentry_exit_cond_resched()"
  2023-11-08  7:47   ` Greg KH
@ 2023-11-08  9:09     ` Ankur Arora
  2023-11-08 10:00       ` Greg KH
  0 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  9:09 UTC (permalink / raw)
  To: Greg KH
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Greg KH <gregkh@linuxfoundation.org> writes:

> On Tue, Nov 07, 2023 at 01:56:52PM -0800, Ankur Arora wrote:
>> This reverts commit 0a70045ed8516dfcff4b5728557e1ef3fd017c53.
>>
>
> None of these reverts say "why" the revert is needed, or why you even
> want to do this at all.  Reverting a compilation error feels like you
> are going to be adding a compilation error to the build, which is
> generally considered a bad thing :(

Yeah, one of the many issues with this string of reverts.

I was concerned about repeating the same thing over and over enough
that I just put my explanation at the bottom of the cover-letter and
nowhere else.

The reasoning was this:

The PREEMPT_DYNAMIC code uses the static_calls to dynamically
switch between voluntary and full preemption.

Thomas had outlined an approach (see https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/)
(which this series implements) where instead of depending on
cond_resched(), a none/voluntary/full preemption model could be enforced
by the scheduler. And, this could be done without needing the cond_resched()
preemption points. And, thus also wouldn't need the PREEMPT_DYNAMIC logic.

But, as Steven Rostedt pointed out to me that reverting this code was
all wrong. Since, there's nothing wrong with the logic, it makes sense
to just extract out the bits that are incompatible instead of reverting
functioning code.

Will do that when I send out the next version.

Thanks

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 39/86] sched: handle lazy resched in set_nr_*_polling()
  2023-11-07 21:57 ` [RFC PATCH 39/86] sched: handle lazy resched in set_nr_*_polling() Ankur Arora
@ 2023-11-08  9:15   ` Peter Zijlstra
  0 siblings, 0 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:15 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:25PM -0800, Ankur Arora wrote:
> To trigger a reschedule on a target runqueue a few things need
> to happen first:
> 
>   1. set_tsk_need_resched(target_rq->curr, RESCHED_eager)
>   2. ensure that the target CPU sees the need-resched bit
>   3. preempt_fold_need_resched()
> 
> Most of this is done via some combination of: resched_curr(),
> set_nr_if_polling(), and set_nr_and_not_polling().
> 
> Update the last two to also handle TIF_NEED_RESCHED_LAZY.
> 
> One thing to note is that TIF_NEED_RESCHED_LAZY has run to completion
> semantics, so unlike TIF_NEED_RESCHED, we don't need to ensure that
> the caller sees it, and of course there is no preempt folding.
> 
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/sched/core.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e2215c417323..01df5ac2982c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -892,14 +892,15 @@ static inline void hrtick_rq_init(struct rq *rq)
>  
>  #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
>  /*
> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
>   * this avoids any races wrt polling state changes and thereby avoids
>   * spurious IPIs.
>   */
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
>  {
>  	struct thread_info *ti = task_thread_info(p);
> -	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
> +
> +	return !(fetch_or(&ti->flags, _tif_resched(rs)) & _TIF_POLLING_NRFLAG);
>  }

Argh, this it making the whole thing even worse, because now you're
using that eager naming for setting which has the exact opposite meaning
from testing.

> @@ -916,7 +917,7 @@ static bool set_nr_if_polling(struct task_struct *p)
>  	for (;;) {
>  		if (!(val & _TIF_POLLING_NRFLAG))
>  			return false;
> -		if (val & _TIF_NEED_RESCHED)
> +		if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>  			return true;
>  		if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
>  			break;

Depending on the exact semantics of LAZY this could be wrong, the
Changeog doesn't clarify.

Changing this in a different patch from resched_curr() makes it
impossible to review :/


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-07 23:08   ` [RFC PATCH 71/86] treewide: lib: " Ankur Arora
@ 2023-11-08  9:15     ` Herbert Xu
  2023-11-08 15:08       ` Steven Rostedt
  2023-11-08 19:15     ` Kees Cook
  1 sibling, 1 reply; 250+ messages in thread
From: Herbert Xu @ 2023-11-08  9:15 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, David S. Miller, Kees Cook, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

On Tue, Nov 07, 2023 at 03:08:07PM -0800, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
> 
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
> 
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
> 
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
> 
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
> 
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
> 
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
> 
> Almost all the cond_resched() calls are from set-1. Remove them.
> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net> 
> Cc: Kees Cook <keescook@chromium.org> 
> Cc: Eric Dumazet <edumazet@google.com> 
> Cc: Jakub Kicinski <kuba@kernel.org> 
> Cc: Paolo Abeni <pabeni@redhat.com> 
> Cc: Thomas Graf <tgraf@suug.ch>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  lib/crc32test.c          |  2 --
>  lib/crypto/mpi/mpi-pow.c |  1 -
>  lib/memcpy_kunit.c       |  5 -----
>  lib/random32.c           |  1 -
>  lib/rhashtable.c         |  2 --
>  lib/test_bpf.c           |  3 ---
>  lib/test_lockup.c        |  2 +-
>  lib/test_maple_tree.c    |  8 --------
>  lib/test_rhashtable.c    | 10 ----------
>  9 files changed, 1 insertion(+), 33 deletions(-)

Nack.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 40/86] context_tracking: add ct_state_cpu()
  2023-11-07 21:57 ` [RFC PATCH 40/86] context_tracking: add ct_state_cpu() Ankur Arora
@ 2023-11-08  9:16   ` Peter Zijlstra
  2023-11-21  6:32     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:16 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:26PM -0800, Ankur Arora wrote:
> While making up its mind about whether to reschedule a target
> runqueue eagerly or lazily, resched_curr() needs to know if the
> target is executing in the kernel or in userspace.
> 
> Add ct_state_cpu().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> 
> ---
> Using context-tracking for this seems like overkill. Is there a better
> way to achieve this? One problem with depending on user_enter() is that
> it happens much too late for our purposes. From the scheduler's
> point-of-view the exit state has effectively transitioned once the
> task exits the exit_to_user_loop() so we will see stale state
> while the task is done with exit_to_user_loop() but has not yet
> executed user_enter().
> 
> ---
>  include/linux/context_tracking_state.h | 21 +++++++++++++++++++++
>  kernel/Kconfig.preempt                 |  1 +
>  2 files changed, 22 insertions(+)
> 
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index bbff5f7f8803..6a8f1c7ba105 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -53,6 +53,13 @@ static __always_inline int __ct_state(void)
>  {
>  	return raw_atomic_read(this_cpu_ptr(&context_tracking.state)) & CT_STATE_MASK;
>  }
> +
> +static __always_inline int __ct_state_cpu(int cpu)
> +{
> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
> +
> +	return atomic_read(&ct->state) & CT_STATE_MASK;
> +}
>  #endif
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> @@ -139,6 +146,20 @@ static __always_inline int ct_state(void)
>  	return ret;
>  }
>  
> +static __always_inline int ct_state_cpu(int cpu)
> +{
> +	int ret;
> +
> +	if (!context_tracking_enabled_cpu(cpu))
> +		return CONTEXT_DISABLED;
> +
> +	preempt_disable();
> +	ret = __ct_state_cpu(cpu);
> +	preempt_enable();
> +
> +	return ret;
> +}

Those preempt_disable/enable are pointless.

But this patch is problematic, you do *NOT* want to rely on context
tracking. Context tracking adds atomics to the entry path, this is slow
and even with CONFIG_CONTEXT_TRACKING it is disabled until you configure
the NOHZ_FULL nonsense.

This simply cannot be.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 46/86] tracing: handle lazy resched
  2023-11-08  0:19   ` Steven Rostedt
@ 2023-11-08  9:24     ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08  9:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, glaubitz,
	anton.ivanov, krypton, David.Laight, richard, mjguzik,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, linux-alpha,
	Geert Uytterhoeven, linux-m68k, Dinh Nguyen


Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue,  7 Nov 2023 13:57:32 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Tracing support.
>>
>> Note: this is quite incomplete.
>
> What's not complete? The removal of the IRQS_NOSUPPORT?
>
> Really, that's only for alpha, m68k and nios2. I think setting 'X' is not
> needed anymore, and we can use that bit for this, and for those archs, have
> 0 for interrupts disabled.

Right, that makes sense. I wasn't worried about the IRQS_NOSUPPORT.
I think I just misread the code and thought that some other tracers might
need separate support as well.

Will fix the commit message.

Thanks
Ankur

>>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  include/linux/trace_events.h |  6 +++---
>>  kernel/trace/trace.c         |  2 ++
>>  kernel/trace/trace_output.c  | 16 ++++++++++++++--
>>  3 files changed, 19 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
>> index 21ae37e49319..355d25d5e398 100644
>> --- a/include/linux/trace_events.h
>> +++ b/include/linux/trace_events.h
>> @@ -178,7 +178,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status);
>>
>>  enum trace_flag_type {
>>  	TRACE_FLAG_IRQS_OFF		= 0x01,
>> -	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
>> +	TRACE_FLAG_NEED_RESCHED_LAZY    = 0x02,
>>  	TRACE_FLAG_NEED_RESCHED		= 0x04,
>>  	TRACE_FLAG_HARDIRQ		= 0x08,
>>  	TRACE_FLAG_SOFTIRQ		= 0x10,
>> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_ctx(void)
>>
>>  static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
>>  {
>> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
>> +	return tracing_gen_ctx_irq_test(0);
>>  }
>>  static inline unsigned int tracing_gen_ctx(void)
>>  {
>> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
>> +	return tracing_gen_ctx_irq_test(0);
>>  }
>>  #endif
>>
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index 7f067ad9cf50..0776dba32c2d 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
>>
>>  	if (tif_need_resched(RESCHED_eager))
>>  		trace_flags |= TRACE_FLAG_NEED_RESCHED;
>> +	if (tif_need_resched(RESCHED_lazy))
>> +		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
>>  	if (test_preempt_need_resched())
>>  		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
>>  	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
>> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
>> index db575094c498..c251a44ad8ac 100644
>> --- a/kernel/trace/trace_output.c
>> +++ b/kernel/trace/trace_output.c
>> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
>>  		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
>>  		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
>>  		bh_off ? 'b' :
>> -		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
>> +		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
>>  		'.';
>>
>> -	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
>> +	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
>>  				TRACE_FLAG_PREEMPT_RESCHED)) {
>> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
>> +		need_resched = 'B';
>> +		break;
>>  	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
>>  		need_resched = 'N';
>>  		break;
>> +	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
>> +		need_resched = 'L';
>> +		break;
>> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
>> +		need_resched = 'b';
>> +		break;
>>  	case TRACE_FLAG_NEED_RESCHED:
>>  		need_resched = 'n';
>>  		break;
>> +	case TRACE_FLAG_NEED_RESCHED_LAZY:
>> +		need_resched = 'l';
>> +		break;
>>  	case TRACE_FLAG_PREEMPT_RESCHED:
>>  		need_resched = 'p';
>>  		break;


--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 41/86] sched: handle resched policy in resched_curr()
  2023-11-07 21:57 ` [RFC PATCH 41/86] sched: handle resched policy in resched_curr() Ankur Arora
@ 2023-11-08  9:36   ` Peter Zijlstra
  2023-11-08 10:26     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:36 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>  }
>  
>  /*
> - * resched_curr - mark rq's current task 'to be rescheduled now'.
> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>   *
> - * On UP this means the setting of the need_resched flag, on SMP it
> - * might also involve a cross-CPU call to trigger the scheduler on
> - * the target CPU.
> + * On UP this means the setting of the need_resched flag, on SMP, for
> + * eager resched it might also involve a cross-CPU call to trigger
> + * the scheduler on the target CPU.
>   */
> -void resched_curr(struct rq *rq)
> +void __resched_curr(struct rq *rq, resched_t rs)
>  {
>  	struct task_struct *curr = rq->curr;
>  	int cpu;
> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>  	cpu = cpu_of(rq);
>  
>  	if (cpu == smp_processor_id()) {
> -		set_tsk_need_resched(curr, RESCHED_eager);
> -		set_preempt_need_resched();
> +		set_tsk_need_resched(curr, rs);
> +		if (rs == RESCHED_eager)
> +			set_preempt_need_resched();
>  		return;
>  	}
>  
> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
> -		smp_send_reschedule(cpu);
> -	else
> +	if (set_nr_and_not_polling(curr, rs)) {
> +		if (rs == RESCHED_eager)
> +			smp_send_reschedule(cpu);

I think you just broke things.

Not all idle threads have POLLING support, in which case you need that
IPI to wake them up, even if it's LAZY.

> +	} else if (rs == RESCHED_eager)
>  		trace_sched_wake_idle_without_ipi(cpu);
>  }



>  
> +/*
> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly
> + * or lazily according to the current policy.
> + *
> + * Always schedule eagerly, if:
> + *
> + *  - running under full preemption
> + *
> + *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
> + *    force TIF_NEED_RESCHED to be set and send a resched IPI.
> + *    (the polling case has already set TIF_NEED_RESCHED via
> + *     set_nr_if_polling()).
> + *
> + *  - in userspace: run to completion semantics are only for kernel tasks
> + *
> + * Otherwise (regardless of priority), run to completion.
> + */
> +void resched_curr(struct rq *rq)
> +{
> +	resched_t rs = RESCHED_lazy;
> +	int context;
> +
> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
> +	    (rq->curr->sched_class == &idle_sched_class)) {
> +		rs = RESCHED_eager;
> +		goto resched;
> +	}
> +
> +	/*
> +	 * We might race with the target CPU while checking its ct_state:
> +	 *
> +	 * 1. The task might have just entered the kernel, but has not yet
> +	 * called user_exit(). We will see stale state (CONTEXT_USER) and
> +	 * send an unnecessary resched-IPI.
> +	 *
> +	 * 2. The user task is through with exit_to_user_mode_loop() but has
> +	 * not yet called user_enter().
> +	 *
> +	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
> +	 * schedule it lazily. There's obviously nothing that will handle
> +	 * this need-resched bit until the thread enters the kernel next.
> +	 *
> +	 * The scheduler will still do tick accounting, but a potentially
> +	 * higher priority task waited to be scheduled for a user tick,
> +	 * instead of execution time in the kernel.
> +	 */
> +	context = ct_state_cpu(cpu_of(rq));
> +	if ((context == CONTEXT_USER) ||
> +	    (context == CONTEXT_GUEST)) {
> +
> +		rs = RESCHED_eager;
> +		goto resched;
> +	}

Like said, this simply cannot be. You must not rely on the remote CPU
being in some state or not. Also, it's racy, you could observe USER and
then it enters KERNEL.

> +
> +resched:
> +	__resched_curr(rq, rs);
> +}
> +
>  void resched_cpu(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);

^ permalink raw reply	[flat|nested] 250+ messages in thread

* RE: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (60 preceding siblings ...)
  2023-11-08  8:51 ` Peter Zijlstra
@ 2023-11-08  9:43 ` David Laight
  2023-11-08 15:15   ` Steven Rostedt
  2023-11-08 16:33 ` Mark Rutland
  62 siblings, 1 reply; 250+ messages in thread
From: David Laight @ 2023-11-08  9:43 UTC (permalink / raw)
  To: 'Ankur Arora', linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, richard, mjguzik

From: Ankur Arora
> Sent: 07 November 2023 21:57
...
> There are four main sets of preemption points in the kernel:
> 
>  1. return to user
>  2. explicit preemption points (cond_resched() and its ilk)
>  3. return to kernel (tick/IPI/irq at irqexit)
>  4. end of non-preemptible sections at (preempt_count() == preempt_offset)
> 
...
> Policies:
> 
> A - preemption=none: run to completion
> B - preemption=voluntary: run to completion, unless a task of higher
>     sched-class awaits
> C - preemption=full: optimized for low-latency. Preempt whenever a higher
>     priority task awaits.

If you remove cond_resched() then won't both B and C require an extra IPI.
That is probably OK for RT tasks but could get expensive for
normal tasks that aren't bound to a specific cpu.

I suspect C could also lead to tasks being pre-empted just before
they sleep (eg after waking another task).
There might already be mitigation for that, I'm not sure if
a voluntary sleep can be done in a non-pre-emptible section.

Certainly it should all help the scheduling of RT tasks - which
can currently get delayed by a non-RT task in a slow kernel path.
Although the worst one is the softint code...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()
  2023-11-08  8:29     ` Ankur Arora
@ 2023-11-08  9:49       ` Julia Lawall
  0 siblings, 0 replies; 250+ messages in thread
From: Julia Lawall @ 2023-11-08  9:49 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Julia Lawall, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Nicolas Palix



On Wed, 8 Nov 2023, Ankur Arora wrote:

>
> Julia Lawall <julia.lawall@inria.fr> writes:
>
> > On Tue, 7 Nov 2023, Ankur Arora wrote:
> >
> >> Rudimentary script to remove the straight-forward subset of
> >> cond_resched() and allies:
> >>
> >> 1)  if (need_resched())
> >> 	  cond_resched()
> >>
> >> 2)  expression*;
> >>     cond_resched();  /* or in the reverse order */
> >>
> >> 3)  if (expression)
> >> 	statement
> >>     cond_resched();  /* or in the reverse order */
> >>
> >> The last two patterns depend on the control flow level to ensure
> >> that the complex cond_resched() patterns (ex. conditioned ones)
> >> are left alone and we only pick up ones which are only minimally
> >> related the neighbouring code.
> >>
> >> Cc: Julia Lawall <Julia.Lawall@inria.fr>
> >> Cc: Nicolas Palix <nicolas.palix@imag.fr>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >>  scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
> >>  1 file changed, 53 insertions(+)
> >>  create mode 100644 scripts/coccinelle/api/cond_resched.cocci
> >>
> >> diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
> >> new file mode 100644
> >> index 000000000000..bf43768a8f8c
> >> --- /dev/null
> >> +++ b/scripts/coccinelle/api/cond_resched.cocci
> >> @@ -0,0 +1,53 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/// Remove naked cond_resched() statements
> >> +///
> >> +//# Remove cond_resched() statements when:
> >> +//#   - executing at the same control flow level as the previous or the
> >> +//#     next statement (this lets us avoid complicated conditionals in
> >> +//#     the neighbourhood.)
> >> +//#   - they are of the form "if (need_resched()) cond_resched()" which
> >> +//#     is always safe.
> >> +//#
> >> +//# Coccinelle generally takes care of comments in the immediate neighbourhood
> >> +//# but might need to handle other comments alluding to rescheduling.
> >> +//#
> >> +virtual patch
> >> +virtual context
> >> +
> >> +@ r1 @
> >> +identifier r;
> >> +@@
> >> +
> >> +(
> >> + r = cond_resched();
> >> +|
> >> +-if (need_resched())
> >> +-	cond_resched();
> >> +)
> >
> > This rule doesn't make sense.  The first branch of the disjunction will
> > never match a a place where the second branch matches.  Anyway, in the
> > second branch there is no assignment, so I don't see what the first branch
> > is protecting against.
> >
> > The disjunction is just useless.  Whether it is there or or whether only
> > the second brancha is there, doesn't have any impact on the result.
> >
> >> +
> >> +@ r2 @
> >> +expression E;
> >> +statement S,T;
> >> +@@
> >> +(
> >> + E;
> >> +|
> >> + if (E) S
> >
> > This case is not needed.  It will be matched by the next case.
> >
> >> +|
> >> + if (E) S else T
> >> +|
> >> +)
> >> +-cond_resched();
> >> +
> >> +@ r3 @
> >> +expression E;
> >> +statement S,T;
> >> +@@
> >> +-cond_resched();
> >> +(
> >> + E;
> >> +|
> >> + if (E) S
> >
> > As above.
> >
> >> +|
> >> + if (E) S else T
> >> +)
> >
> > I have the impression that you are trying to retain some cond_rescheds.
> > Could you send an example of one that you are trying to keep?  Overall,
> > the above rules seem a bit ad hoc.  You may be keeping some cases you
> > don't want to, or removing some cases that you want to keep.
>
> Right. I was trying to ensure that the script only handled the cases
> that didn't have any "interesting" connections to the surrounding code.
>
> Just to give you an example of the kind of constructs that I wanted
> to avoid:
>
> mm/memoy.c::zap_pmd_range():
>
>                 if (addr != next)
>                         pmd--;
>         } while (pmd++, cond_resched(), addr != end);
>
> mm/backing-dev.c::cleanup_offline_cgwbs_workfn()
>
>                 while (cleanup_offline_cgwb(wb))
>                         cond_resched();
>
>
>                 while (cleanup_offline_cgwb(wb))
>                         cond_resched();
>
> But from a quick check the simplest coccinelle script does a much
> better job than my overly complex (and incorrect) one:
>
> @r1@
> @@
> -       cond_resched();
>
> It avoids the first one. And transforms the second to:
>
>                 while (cleanup_offline_cgwb(wb))
>                         {}
>
> which is exactly what I wanted.

Perfect!

It could be good to run both scripts and compare the results.

julia

>
> > Of course, if you are confident that the job is done with this semantic
> > patch as it is, then that's fine too.
>
> Not at all. Thanks for pointing out the mistakes.
>
>
>
> --
> ankur
>

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08  8:51 ` Peter Zijlstra
@ 2023-11-08  9:53   ` Daniel Bristot de Oliveira
  2023-11-08 10:04   ` Ankur Arora
  1 sibling, 0 replies; 250+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-08  9:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, mathieu.desnoyers,
	geert, glaubitz, anton.ivanov, mattst88, krypton, rostedt,
	David.Laight, richard, mjguzik

On 11/8/23 09:51, Peter Zijlstra wrote:
> On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
>> Hi,
>>
>> We have two models of preemption: voluntary and full 
> 3, also none (RT is not actually a preemption model).
> 
>> (and RT which is
>> a fuller form of full preemption.)
> It is not in fact a different preemption model, it is the same full
> preemption, the difference with RT is that it makes a lot more stuff
> preemptible, but the fundamental preemption model is the same -- full.

+1

-- Daniel

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 42/86] sched: force preemption on tick expiration
  2023-11-07 21:57 ` [RFC PATCH 42/86] sched: force preemption on tick expiration Ankur Arora
@ 2023-11-08  9:56   ` Peter Zijlstra
  2023-11-21  6:44     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:56 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:28PM -0800, Ankur Arora wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4d86c618ffa2..fe7e5e9b2207 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1016,8 +1016,11 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>   * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>   * this is probably good enough.
>   */
> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static void update_deadline(struct cfs_rq *cfs_rq,
> +			    struct sched_entity *se, bool tick)
>  {
> +	struct rq *rq = rq_of(cfs_rq);
> +
>  	if ((s64)(se->vruntime - se->deadline) < 0)
>  		return;
>  
> @@ -1033,13 +1036,19 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  	 */
>  	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
>  
> +	if (cfs_rq->nr_running < 2)
> +		return;
> +
>  	/*
> -	 * The task has consumed its request, reschedule.
> +	 * The task has consumed its request, reschedule; eagerly
> +	 * if it ignored our last lazy reschedule.
>  	 */
> -	if (cfs_rq->nr_running > 1) {
> -		resched_curr(rq_of(cfs_rq));
> -		clear_buddies(cfs_rq, se);
> -	}
> +	if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
> +		__resched_curr(rq, RESCHED_eager);
> +	else
> +		resched_curr(rq);
> +
> +	clear_buddies(cfs_rq, se);
>  }
>  
>  #include "pelt.h"
> @@ -1147,7 +1156,7 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  /*
>   * Update the current task's runtime statistics.
>   */
> -static void update_curr(struct cfs_rq *cfs_rq)
> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
>  {
>  	struct sched_entity *curr = cfs_rq->curr;
>  	u64 now = rq_clock_task(rq_of(cfs_rq));
> @@ -1174,7 +1183,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  	schedstat_add(cfs_rq->exec_clock, delta_exec);
>  
>  	curr->vruntime += calc_delta_fair(delta_exec, curr);
> -	update_deadline(cfs_rq, curr);
> +	update_deadline(cfs_rq, curr, tick);
>  	update_min_vruntime(cfs_rq);
>  
>  	if (entity_is_task(curr)) {
> @@ -1188,6 +1197,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
>  
> +static void update_curr(struct cfs_rq *cfs_rq)
> +{
> +	__update_curr(cfs_rq, false);
> +}
> +
>  static void update_curr_fair(struct rq *rq)
>  {
>  	update_curr(cfs_rq_of(&rq->curr->se));
> @@ -5309,7 +5323,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  	/*
>  	 * Update run-time statistics of the 'current'.
>  	 */
> -	update_curr(cfs_rq);
> +	__update_curr(cfs_rq, true);
>  
>  	/*
>  	 * Ensure that runnable average is periodically updated.

I'm thinking this will be less of a mess if you flip it around some.

(ignore the hrtick mess, I'll try and get that cleaned up)

This way you have two distinct sites to handle the preemption. the
update_curr() would be 'FULL ? force : lazy' while the tick one gets the
special magic bits.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..5399696de9e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,10 +1016,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
  */
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if ((s64)(se->vruntime - se->deadline) < 0)
-		return;
+		return false;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1037,9 +1037,11 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 * The task has consumed its request, reschedule.
 	 */
 	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
 		clear_buddies(cfs_rq, se);
+		return true;
 	}
+
+	return false;
 }
 
 #include "pelt.h"
@@ -1147,18 +1149,19 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 /*
  * Update the current task's runtime statistics.
  */
-static void update_curr(struct cfs_rq *cfs_rq)
+static bool __update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
 	u64 now = rq_clock_task(rq_of(cfs_rq));
 	u64 delta_exec;
+	bool ret;
 
 	if (unlikely(!curr))
-		return;
+		return false;
 
 	delta_exec = now - curr->exec_start;
 	if (unlikely((s64)delta_exec <= 0))
-		return;
+		return false;
 
 	curr->exec_start = now;
 
@@ -1174,7 +1177,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
-	update_deadline(cfs_rq, curr);
+	ret = update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -1186,6 +1189,14 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+	return ret;
+}
+
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	if (__update_curr(cfs_rq))
+		resched_curr(rq_of(cfs_rq));
 }
 
 static void update_curr_fair(struct rq *rq)
@@ -5309,7 +5320,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Update run-time statistics of the 'current'.
 	 */
-	update_curr(cfs_rq);
+	bool resched = __update_curr(cfs_rq);
 
 	/*
 	 * Ensure that runnable average is periodically updated.
@@ -5317,22 +5328,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_group(curr);
 
-#ifdef CONFIG_SCHED_HRTICK
-	/*
-	 * queued ticks are scheduled to match the slice, so don't bother
-	 * validating it and just reschedule.
-	 */
-	if (queued) {
-		resched_curr(rq_of(cfs_rq));
-		return;
-	}
-	/*
-	 * don't let the period tick interfere with the hrtick preemption
-	 */
-	if (!sched_feat(DOUBLE_TICK) &&
-			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
-		return;
-#endif
+	return resched;
 }
 
 
@@ -12387,12 +12383,16 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &curr->se;
+	bool resched = false;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		entity_tick(cfs_rq, se, queued);
+		resched |= entity_tick(cfs_rq, se, queued);
 	}
 
+	if (resched)
+		resched_curr(rq);
+
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 

^ permalink raw reply related	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 43/86] sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
  2023-11-07 21:57 ` [RFC PATCH 43/86] sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models Ankur Arora
@ 2023-11-08  9:58   ` Peter Zijlstra
  0 siblings, 0 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08  9:58 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:29PM -0800, Ankur Arora wrote:
> The scheduler uses PREEMPT_COUNT and PREEMPTION to drive
> preemption: the first to demarcate non-preemptible sections and
> the second for the actual mechanics of preemption.
> 
> Enable both for voluntary preemption models.
> 
> In addition, define a new scheduler feature FORCE_PREEMPT which
> can now be used to distinguish between voluntary and full
> preemption models at runtime.
> 
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  init/Makefile           |  2 +-
>  kernel/Kconfig.preempt  | 12 ++++++++----
>  kernel/entry/common.c   |  3 +--
>  kernel/sched/core.c     | 26 +++++++++++---------------
>  kernel/sched/features.h |  6 ++++++
>  5 files changed, 27 insertions(+), 22 deletions(-)
> 
> diff --git a/init/Makefile b/init/Makefile
> index 385fd80fa2ef..99e480f24cf3 100644
> --- a/init/Makefile
> +++ b/init/Makefile
> @@ -24,7 +24,7 @@ mounts-$(CONFIG_BLK_DEV_INITRD)	+= do_mounts_initrd.o
>  #
>  
>  smp-flag-$(CONFIG_SMP)			:= SMP
> -preempt-flag-$(CONFIG_PREEMPT)          := PREEMPT
> +preempt-flag-$(CONFIG_PREEMPTION)       := PREEMPT_DYNAMIC
>  preempt-flag-$(CONFIG_PREEMPT_RT)	:= PREEMPT_RT
>  
>  build-version = $(or $(KBUILD_BUILD_VERSION), $(build-version-auto))
> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index aa87b5cd3ecc..074fe5e253b5 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -6,20 +6,23 @@ choice
>  
>  config PREEMPT_NONE
>  	bool "No Forced Preemption (Server)"
> +	select PREEMPTION
>  	help
>  	  This is the traditional Linux preemption model, geared towards
>  	  throughput. It will still provide good latencies most of the
> -	  time, but there are no guarantees and occasional longer delays
> -	  are possible.
> +	  time, but occasional delays are possible.
>  
>  	  Select this option if you are building a kernel for a server or
>  	  scientific/computation system, or if you want to maximize the
>  	  raw processing power of the kernel, irrespective of scheduling
> -	  latencies.
> +	  latencies. Unless your architecture actively disables preemption,
> +	  you can always switch to one of the other preemption models
> +	  at runtime.


> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 6433e6c77185..f7f2efabb5b5 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -422,8 +422,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		}
>  
>  		instrumentation_begin();
> -		if (IS_ENABLED(CONFIG_PREEMPTION))
> -			irqentry_exit_cond_resched();
> +		irqentry_exit_cond_resched();
>  		/* Covers both tracing and lockdep */
>  		trace_hardirqs_on();
>  		instrumentation_end();

I'm totally confused by the PREEMPT_NONE changes here. How does that
make sense?

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 06/86] Revert "entry: Fix compile error in dynamic_irqentry_exit_cond_resched()"
  2023-11-08  9:09     ` Ankur Arora
@ 2023-11-08 10:00       ` Greg KH
  0 siblings, 0 replies; 250+ messages in thread
From: Greg KH @ 2023-11-08 10:00 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

On Wed, Nov 08, 2023 at 01:09:10AM -0800, Ankur Arora wrote:
> 
> Greg KH <gregkh@linuxfoundation.org> writes:
> 
> > On Tue, Nov 07, 2023 at 01:56:52PM -0800, Ankur Arora wrote:
> >> This reverts commit 0a70045ed8516dfcff4b5728557e1ef3fd017c53.
> >>
> >
> > None of these reverts say "why" the revert is needed, or why you even
> > want to do this at all.  Reverting a compilation error feels like you
> > are going to be adding a compilation error to the build, which is
> > generally considered a bad thing :(
> 
> Yeah, one of the many issues with this string of reverts.
> 
> I was concerned about repeating the same thing over and over enough
> that I just put my explanation at the bottom of the cover-letter and
> nowhere else.

cover letters are not in the changelog when patches are committed :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08  8:51 ` Peter Zijlstra
  2023-11-08  9:53   ` Daniel Bristot de Oliveira
@ 2023-11-08 10:04   ` Ankur Arora
  2023-11-08 10:13     ` Peter Zijlstra
  1 sibling, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-08 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
>> Hi,
>>
>> We have two models of preemption: voluntary and full
>
> 3, also none (RT is not actually a preemption model).

I think we are just using preemption models differently.

I was trying to distinguish between how preemption happens: thus
voluntary (based on voluntary preemption points), and full
(based on preemption state.)

>> (and RT which is
>> a fuller form of full preemption.)
>
> It is not in fact a different preemption model, it is the same full
> preemption, the difference with RT is that it makes a lot more stuff
> preemptible, but the fundamental preemption model is the same -- full.
>
>> In this series -- which is based
>> on Thomas' PoC (see [1]), we try to unify the two by letting the
>> scheduler enforce policy for the voluntary preemption models as well.
>
> Well, you've also taken out preempt_dynamic for some obscure reason :/

Sorry about that :). I didn't mention it because I was using preemption
model in the sense I describe above. And to my mind preempt_dynamic is
a clever mechanism that switches between other preemption models.

>> Please review.
>
>> Ankur Arora (86):
>>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>>   Revert "sched/core: Make sched_dynamic_mutex static"
>>   Revert "ftrace: Use preemption model accessors for trace header
>>     printout"
>>   Revert "preempt/dynamic: Introduce preemption model accessors"
>>   Revert "kcsan: Use preemption model accessors"
>>   Revert "entry: Fix compile error in
>>     dynamic_irqentry_exit_cond_resched()"
>>   Revert "livepatch,sched: Add livepatch task switching to
>>     cond_resched()"
>>   Revert "arm64: Support PREEMPT_DYNAMIC"
>>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>>   Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
>>     GENERIC_ENTRY"
>>   Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
>>   Revert "sched/preempt: Refactor sched_dynamic_update()"
>>   Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
>>   Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
>>   Revert "preempt: Restore preemption model selection configs"
>>   Revert "sched: Provide Kconfig support for default dynamic preempt
>>     mode"
>>   sched/preempt: remove PREEMPT_DYNAMIC from the build version
>>   Revert "preempt/dynamic: Fix typo in macro conditional statement"
>>   Revert "sched,preempt: Move preempt_dynamic to debug.c"
>>   Revert "static_call: Relax static_call_update() function argument
>>     type"
>>   Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
>>   Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
>>   Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
>>   Revert "sched: Harden PREEMPT_DYNAMIC"
>>   Revert "sched: Add /debug/sched_preempt"
>>   Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
>>     option"
>>   Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
>>     call"
>>   Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
>>     calls"
>>   Revert "preempt/dynamic: Provide cond_resched() and might_resched()
>>     static calls"
>>   Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
>
> NAK
>
> Even if you were to remove PREEMPT_NONE, which should be a separate
> series, but that isn't on the table at all afaict, removing
> preempt_dynamic doesn't make sense.

Agreed. I don't intend to remove PREEMPT_NONE. And, obviously you
do want preempt_dynamic like toggling abilities.

> You still want the preempt= boot time argument and the
> /debug/sched/preempt things to dynamically switch between the models.

Also, yes.

> Please, focus on the voluntary thing, gut that and then replace it with
> the lazy thing, but leave everything else in place.
>
> Re dynamic preempt, gutting the current voluntary preemption model means
> getting rid of the cond_resched and might_resched toggles but you'll
> gain a switch to kill the lazy stuff.

Yes. I think I mostly agree with you.

And, I should have thought this whole revert thing through.
Reverting it wasn't really the plan. The plan was to revert these
patches temporarily, put in the changes you see in this series, and
then pull in the relevant bits of the preempt_dynamic.

Only I decided to push that to later. Sigh.

On your NAK, what about these patches for instance:

>>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>>   Revert "livepatch,sched: Add livepatch task switching to
>>     cond_resched()"
>>   Revert "arm64: Support PREEMPT_DYNAMIC"
>>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"

What's the best way to handle these? With the lazy bit, cond_resched()
and might_resched() are gone. So we don't need all of the static
key inftrastructure for toggling etc.

The part of preempt_dynamic that makes sense to me is the one that
switches dynamically between none/voluntary/full. Here it would need
to be wired onto controls of the lazy bit.
(Right now the preemption policy is controlled by sched_feat in
patches 43, and 44 but sched/preempt is a much better interface.)

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 10:04   ` Ankur Arora
@ 2023-11-08 10:13     ` Peter Zijlstra
  2023-11-08 11:00       ` Ankur Arora
  2023-11-08 15:38       ` Thomas Gleixner
  0 siblings, 2 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08 10:13 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:

> >>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
> >>   Revert "livepatch,sched: Add livepatch task switching to
> >>     cond_resched()"
> >>   Revert "arm64: Support PREEMPT_DYNAMIC"
> >>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
> 
> What's the best way to handle these? With the lazy bit, cond_resched()
> and might_resched() are gone. So we don't need all of the static
> key inftrastructure for toggling etc.
> 
> The part of preempt_dynamic that makes sense to me is the one that
> switches dynamically between none/voluntary/full. Here it would need
> to be wired onto controls of the lazy bit.
> (Right now the preemption policy is controlled by sched_feat in
> patches 43, and 44 but sched/preempt is a much better interface.)

I'm not understanding, those should stay obviously.

The current preempt_dynamic stuff has 5 toggles:

/*
 * SC:cond_resched
 * SC:might_resched
 * SC:preempt_schedule
 * SC:preempt_schedule_notrace
 * SC:irqentry_exit_cond_resched
 *
 *
 * NONE:
 *   cond_resched               <- __cond_resched
 *   might_resched              <- RET0
 *   preempt_schedule           <- NOP
 *   preempt_schedule_notrace   <- NOP
 *   irqentry_exit_cond_resched <- NOP
 *
 * VOLUNTARY:
 *   cond_resched               <- __cond_resched
 *   might_resched              <- __cond_resched
 *   preempt_schedule           <- NOP
 *   preempt_schedule_notrace   <- NOP
 *   irqentry_exit_cond_resched <- NOP
 *
 * FULL:
 *   cond_resched               <- RET0
 *   might_resched              <- RET0
 *   preempt_schedule           <- preempt_schedule
 *   preempt_schedule_notrace   <- preempt_schedule_notrace
 *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
 */

If you kill voluntary as we know it today, you can remove cond_resched
and might_resched, but the remaining 3 are still needed to switch
between NONE and FULL.

Additionally, you'll get one new state to enable/disable the LAZY stuff.
Neither NONE nor FULL want the LAZY thing on.

You'll then end up with something like:

/*
 * SK:preempt_lazy
 * SC:preempt_schedule
 * SC:preempt_schedule_notrace
 * SC:irqentry_exit_cond_resched
 *
 *
 * NONE:
 *   preempt_lazy		<- OFF
 *   preempt_schedule           <- NOP
 *   preempt_schedule_notrace   <- NOP
 *   irqentry_exit_cond_resched <- NOP
 *
 * VOLUNTARY:
 *   preempt_lazy		<- ON
 *   preempt_schedule           <- preempt_schedule
 *   preempt_schedule_notrace   <- preempt_schedule_notrace
 *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
 *
 * FULL:
 *   preempt_lazy		<- OFF
 *   preempt_schedule           <- preempt_schedule
 *   preempt_schedule_notrace   <- preempt_schedule_notrace
 *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
 */

For the architectures that do not have static_call but instead use
static_key for everything, the SC's are obviously static_key based
wrappers around the function calls -- like now.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 41/86] sched: handle resched policy in resched_curr()
  2023-11-08  9:36   ` Peter Zijlstra
@ 2023-11-08 10:26     ` Ankur Arora
  2023-11-08 10:46       ` Peter Zijlstra
  2023-11-21  6:31       ` Ankur Arora
  0 siblings, 2 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08 10:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>>  }
>>
>>  /*
>> - * resched_curr - mark rq's current task 'to be rescheduled now'.
>> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>>   *
>> - * On UP this means the setting of the need_resched flag, on SMP it
>> - * might also involve a cross-CPU call to trigger the scheduler on
>> - * the target CPU.
>> + * On UP this means the setting of the need_resched flag, on SMP, for
>> + * eager resched it might also involve a cross-CPU call to trigger
>> + * the scheduler on the target CPU.
>>   */
>> -void resched_curr(struct rq *rq)
>> +void __resched_curr(struct rq *rq, resched_t rs)
>>  {
>>  	struct task_struct *curr = rq->curr;
>>  	int cpu;
>> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>>  	cpu = cpu_of(rq);
>>
>>  	if (cpu == smp_processor_id()) {
>> -		set_tsk_need_resched(curr, RESCHED_eager);
>> -		set_preempt_need_resched();
>> +		set_tsk_need_resched(curr, rs);
>> +		if (rs == RESCHED_eager)
>> +			set_preempt_need_resched();
>>  		return;
>>  	}
>>
>> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
>> -		smp_send_reschedule(cpu);
>> -	else
>> +	if (set_nr_and_not_polling(curr, rs)) {
>> +		if (rs == RESCHED_eager)
>> +			smp_send_reschedule(cpu);
>
> I think you just broke things.
>
> Not all idle threads have POLLING support, in which case you need that
> IPI to wake them up, even if it's LAZY.

Yes, I was concerned about that too. But doesn't this check against the
idle_sched_class in resched_curr() cover that?

>> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
>> +	    (rq->curr->sched_class == &idle_sched_class)) {
>> +		rs = RESCHED_eager;
>> +		goto resched;

>> +	} else if (rs == RESCHED_eager)
>>  		trace_sched_wake_idle_without_ipi(cpu);
>>  }
>
>
>
>>
>> +/*
>> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly
>> + * or lazily according to the current policy.
>> + *
>> + * Always schedule eagerly, if:
>> + *
>> + *  - running under full preemption
>> + *
>> + *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
>> + *    force TIF_NEED_RESCHED to be set and send a resched IPI.
>> + *    (the polling case has already set TIF_NEED_RESCHED via
>> + *     set_nr_if_polling()).
>> + *
>> + *  - in userspace: run to completion semantics are only for kernel tasks
>> + *
>> + * Otherwise (regardless of priority), run to completion.
>> + */
>> +void resched_curr(struct rq *rq)
>> +{
>> +	resched_t rs = RESCHED_lazy;
>> +	int context;
>> +
>> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
>> +	    (rq->curr->sched_class == &idle_sched_class)) {
>> +		rs = RESCHED_eager;
>> +		goto resched;
>> +	}
>> +
>> +	/*
>> +	 * We might race with the target CPU while checking its ct_state:
>> +	 *
>> +	 * 1. The task might have just entered the kernel, but has not yet
>> +	 * called user_exit(). We will see stale state (CONTEXT_USER) and
>> +	 * send an unnecessary resched-IPI.
>> +	 *
>> +	 * 2. The user task is through with exit_to_user_mode_loop() but has
>> +	 * not yet called user_enter().
>> +	 *
>> +	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
>> +	 * schedule it lazily. There's obviously nothing that will handle
>> +	 * this need-resched bit until the thread enters the kernel next.
>> +	 *
>> +	 * The scheduler will still do tick accounting, but a potentially
>> +	 * higher priority task waited to be scheduled for a user tick,
>> +	 * instead of execution time in the kernel.
>> +	 */
>> +	context = ct_state_cpu(cpu_of(rq));
>> +	if ((context == CONTEXT_USER) ||
>> +	    (context == CONTEXT_GUEST)) {
>> +
>> +		rs = RESCHED_eager;
>> +		goto resched;
>> +	}
>
> Like said, this simply cannot be. You must not rely on the remote CPU
> being in some state or not. Also, it's racy, you could observe USER and
> then it enters KERNEL.

Or worse. We might observe KERNEL and it enters USER.

I think we would be fine if we observe USER: it would be upgrade
to RESCHED_eager and send an unnecessary IPI.

But if we observe KERNEL and it enters USER, then we will have
set the need-resched-lazy bit which the thread might not see
(it might have left exit_to_user_mode_loop()) until the next
entry to the kernel.

But, yes I would like to avoid the ct_state as well. But
need-resched-lazy only makes sense when the task on the runqueue
is executing in the kernel...

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 41/86] sched: handle resched policy in resched_curr()
  2023-11-08 10:26     ` Ankur Arora
@ 2023-11-08 10:46       ` Peter Zijlstra
  2023-11-21  6:34         ` Ankur Arora
  2023-11-21  6:31       ` Ankur Arora
  1 sibling, 1 reply; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08 10:46 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Wed, Nov 08, 2023 at 02:26:37AM -0800, Ankur Arora wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
> >
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
> >>  }
> >>
> >>  /*
> >> - * resched_curr - mark rq's current task 'to be rescheduled now'.
> >> + * __resched_curr - mark rq's current task 'to be rescheduled'.
> >>   *
> >> - * On UP this means the setting of the need_resched flag, on SMP it
> >> - * might also involve a cross-CPU call to trigger the scheduler on
> >> - * the target CPU.
> >> + * On UP this means the setting of the need_resched flag, on SMP, for
> >> + * eager resched it might also involve a cross-CPU call to trigger
> >> + * the scheduler on the target CPU.
> >>   */
> >> -void resched_curr(struct rq *rq)
> >> +void __resched_curr(struct rq *rq, resched_t rs)
> >>  {
> >>  	struct task_struct *curr = rq->curr;
> >>  	int cpu;
> >> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
> >>  	cpu = cpu_of(rq);
> >>
> >>  	if (cpu == smp_processor_id()) {
> >> -		set_tsk_need_resched(curr, RESCHED_eager);
> >> -		set_preempt_need_resched();
> >> +		set_tsk_need_resched(curr, rs);
> >> +		if (rs == RESCHED_eager)
> >> +			set_preempt_need_resched();
> >>  		return;
> >>  	}
> >>
> >> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
> >> -		smp_send_reschedule(cpu);
> >> -	else
> >> +	if (set_nr_and_not_polling(curr, rs)) {
> >> +		if (rs == RESCHED_eager)
> >> +			smp_send_reschedule(cpu);
> >
> > I think you just broke things.
> >
> > Not all idle threads have POLLING support, in which case you need that
> > IPI to wake them up, even if it's LAZY.
> 
> Yes, I was concerned about that too. But doesn't this check against the
> idle_sched_class in resched_curr() cover that?

I that's what that was. Hmm, maybe.

I mean, we have idle-injection too, those don't as FIFO, but as such,
they can only get preempted from RT/DL, and those will already force
preempt anyway.

The way you've split and structured the code makes it very hard to
follow. Something like:

	if (set_nr_and_not_polling(curr, rs) &&
	    (rs == RESCHED_force || is_idle_task(curr)))
		smp_send_reschedule();

is *far* clearer, no?

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 10:13     ` Peter Zijlstra
@ 2023-11-08 11:00       ` Ankur Arora
  2023-11-08 11:14         ` Peter Zijlstra
  2023-11-08 15:38       ` Thomas Gleixner
  1 sibling, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-08 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
>
>> >>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>> >>   Revert "livepatch,sched: Add livepatch task switching to
>> >>     cond_resched()"
>> >>   Revert "arm64: Support PREEMPT_DYNAMIC"
>> >>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>>
>> What's the best way to handle these? With the lazy bit, cond_resched()
>> and might_resched() are gone. So we don't need all of the static
>> key inftrastructure for toggling etc.
>>
>> The part of preempt_dynamic that makes sense to me is the one that
>> switches dynamically between none/voluntary/full. Here it would need
>> to be wired onto controls of the lazy bit.
>> (Right now the preemption policy is controlled by sched_feat in
>> patches 43, and 44 but sched/preempt is a much better interface.)
>
> I'm not understanding, those should stay obviously.
>
> The current preempt_dynamic stuff has 5 toggles:
>
> /*
>  * SC:cond_resched
>  * SC:might_resched
>  * SC:preempt_schedule
>  * SC:preempt_schedule_notrace
>  * SC:irqentry_exit_cond_resched
>  *
>  *
>  * NONE:
>  *   cond_resched               <- __cond_resched
>  *   might_resched              <- RET0
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * VOLUNTARY:
>  *   cond_resched               <- __cond_resched
>  *   might_resched              <- __cond_resched
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * FULL:
>  *   cond_resched               <- RET0
>  *   might_resched              <- RET0
>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>  */
>
> If you kill voluntary as we know it today, you can remove cond_resched
> and might_resched, but the remaining 3 are still needed to switch
> between NONE and FULL.

Ah now I see what you are saying.

Quick thought: even if we were running under NONE, eventually you'll
want to forcibly preempt out a CPU hog. So we will need to have
at least this one enabled always:

>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched

These two, it might make sense to toggle them based on model.

>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace

Anyway let me think about this more and respond tomorrow.

For now, time for bed.

Thanks for clarifying btw.

Ankur

> Additionally, you'll get one new state to enable/disable the LAZY stuff.
> Neither NONE nor FULL want the LAZY thing on.
>
> You'll then end up with something like:
>
> /*
>  * SK:preempt_lazy
>  * SC:preempt_schedule
>  * SC:preempt_schedule_notrace
>  * SC:irqentry_exit_cond_resched
>  *
>  *
>  * NONE:
>  *   preempt_lazy		<- OFF
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * VOLUNTARY:
>  *   preempt_lazy		<- ON
>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>  *
>  * FULL:
>  *   preempt_lazy		<- OFF
>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>  */
>
> For the architectures that do not have static_call but instead use
> static_key for everything, the SC's are obviously static_key based
> wrappers around the function calls -- like now.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 11:00       ` Ankur Arora
@ 2023-11-08 11:14         ` Peter Zijlstra
  2023-11-08 12:16           ` Peter Zijlstra
  0 siblings, 1 reply; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08 11:14 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Wed, Nov 08, 2023 at 03:00:44AM -0800, Ankur Arora wrote:

> Ah now I see what you are saying.
> 
> Quick thought: even if we were running under NONE, eventually you'll

Well, NONE, you get what you pay for etc..

> want to forcibly preempt out a CPU hog. So we will need to have
> at least this one enabled always:
> 
> >  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> 
> These two, it might make sense to toggle them based on model.
> 
> >  *   preempt_schedule           <- preempt_schedule
> >  *   preempt_schedule_notrace   <- preempt_schedule_notrace
> 
> Anyway let me think about this more and respond tomorrow.

That's more or less killing NONE. At that point there's really no point
in having it.

I would really suggest you start with transforming VOLUNTARY into the
LAZY thing, keep it as simple and narrow as possible.

Once you've got that done, then you can try and argue that NONE makes no
sense and try and take it out.

Smaller step etc..

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-07 21:57 ` [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT Ankur Arora
  2023-11-08  0:27   ` Steven Rostedt
@ 2023-11-08 12:15   ` Julian Anastasov
  1 sibling, 0 replies; 250+ messages in thread
From: Julian Anastasov @ 2023-11-08 12:15 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Simon Horman, Alexei Starovoitov,
	Daniel Borkmann


	Hello,

On Tue, 7 Nov 2023, Ankur Arora wrote:

> With PREEMPTION being always-on, some configurations might prefer
> the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> as compared to PREEMPT_RCU=y.
> 
> So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> 
> Note that the preemption model can be changed at runtime (modulo
> configurations with ARCH_NO_PREEMPT), but the RCU configuration
> is statically compiled.
> 
> Cc: Simon Horman <horms@verge.net.au>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> 
> ---
> CC-note: Paul had flagged some code that might be impacted
> with the proposed RCU changes:
> 
> 1. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
>    unchanged, but I must defer to the include/net/ip_vs.h people.

	Yes, IPVS_EST_TICK_CHAINS depends on the rcu_read_unlock()
and rcu_read_lock() calls in cond_resched_rcu(), so just removing
the cond_resched() call there is ok for us. Same for the other
cond_resched() calls in ipvs/

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 79/86] treewide: net: remove cond_resched()
  2023-11-07 23:08   ` [RFC PATCH 79/86] " Ankur Arora
@ 2023-11-08 12:16     ` Eric Dumazet
  2023-11-08 17:11       ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Eric Dumazet @ 2023-11-08 12:16 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Marek Lindner, Simon Wunderlich,
	Antonio Quartulli, Sven Eckelmann, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Roopa Prabhu, Nikolay Aleksandrov,
	David Ahern, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Willem de Bruijn, Matthieu Baerts,
	Mat Martineau, Marcelo Ricardo Leitner, Xin Long,
	Trond Myklebust, Anna Schumaker, Jon Maloy, Ying Xue,
	Martin Schiller

On Wed, Nov 8, 2023 at 12:09 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> There are broadly three sets of uses of cond_resched():
>
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
>
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
>
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
>
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
>
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.

What about RCU callbacks ? cond_resched() was helping a bit.

>
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
>
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
>
> Most of the uses here are in set-1 (some right after we give up a
> lock or enable bottom-halves, causing an explicit preemption check.)
>
> We can remove all of them.

A patch series of 86 is not reasonable.

596 files changed, 881 insertions(+), 2813 deletions(-)

If really cond_resched() becomes a nop (Nice !) ,
make this at the definition of cond_resched(),
and add there nice debugging.

Whoever needs to call a "real" cond_resched(), could call a
cond_resched_for_real()
(Please change the name, this is only to make a point)

Then let the removal happen whenever each maintainer decides, 6 months
later, without polluting lkml.

Imagine we have to revert this series in 1 month, how painful this
would be had we removed
~1400 cond_resched() calls all over the place, with many conflicts.

Thanks

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 11:14         ` Peter Zijlstra
@ 2023-11-08 12:16           ` Peter Zijlstra
  0 siblings, 0 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08 12:16 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Wed, Nov 08, 2023 at 12:14:15PM +0100, Peter Zijlstra wrote:

> I would really suggest you start with transforming VOLUNTARY into the
> LAZY thing, keep it as simple and narrow as possible.

Possibly make it worse first, add LAZY as a fourth option, then show it
makes VOLUNTARY redundant, kill that.

> Once you've got that done, then you can try and argue that NONE makes no
> sense and try and take it out.

This also pushes out having to deal with the !PREEMPT archs until the
very last moment.

And once you're here, cond_resched() should be an obvious no-op function
and you can go delete them.

Anyway, as said, smaller steps more better. Nobody likes 86 patches in
their inbox in the morning (or at any time really).

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()
  2023-11-08  8:54           ` Ankur Arora
@ 2023-11-08 12:58             ` Matthew Wilcox
  2023-11-08 14:50               ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Matthew Wilcox @ 2023-11-08 12:58 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Yosry Ahmed, Vlastimil Babka, Sergey Senozhatsky, linux-kernel,
	tglx, peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, mgorman,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	jgross, andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, SeongJae Park, Mike Kravetz, Muchun Song,
	Andrey Ryabinin, Marco Elver, Catalin Marinas, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Naoya Horiguchi,
	Miaohe Lin, David Hildenbrand, Oscar Salvador, Mike Rapoport,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Hugh Dickins, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vitaly Wool, Minchan Kim,
	Seth Jennings, Dan Streetman

On Wed, Nov 08, 2023 at 12:54:19AM -0800, Ankur Arora wrote:
> Yosry Ahmed <yosryahmed@google.com> writes:
> > On Tue, Nov 7, 2023 at 11:49 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> On 11/8/23 02:28, Sergey Senozhatsky wrote:
> >> > I'd personally prefer to have a comment explaining why we do that
> >> > spin_unlock/spin_lock sequence, which may look confusing to people.
> >>
> >> Wonder if it would make sense to have a lock operation that does the
> >> unlock/lock as a self-documenting thing, and maybe could also be optimized
> >> to first check if there's a actually a need for it (because TIF_NEED_RESCHED
> >> or lock is contended).
> >
> > +1, I was going to suggest this as well. It can be extended to other
> > locking types that disable preemption as well like RCU. Something like
> > spin_lock_relax() or something.
> 
> Good point. We actually do have exactly that: cond_resched_lock(). (And
> similar RW lock variants.)

That's a shame; I was going to suggest calling it spin_cycle() ...

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()
  2023-11-08 12:58             ` Matthew Wilcox
@ 2023-11-08 14:50               ` Steven Rostedt
  0 siblings, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 14:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ankur Arora, Yosry Ahmed, Vlastimil Babka, Sergey Senozhatsky,
	linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, David.Laight, richard, mjguzik, SeongJae Park,
	Mike Kravetz, Muchun Song, Andrey Ryabinin, Marco Elver,
	Catalin Marinas, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Naoya Horiguchi, Miaohe Lin, David Hildenbrand,
	Oscar Salvador, Mike Rapoport, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Hugh Dickins, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vitaly Wool, Minchan Kim, Seth Jennings, Dan Streetman

On Wed, 8 Nov 2023 12:58:49 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> > Good point. We actually do have exactly that: cond_resched_lock(). (And
> > similar RW lock variants.)  
> 
> That's a shame; I was going to suggest calling it spin_cycle() ...

  Then wash.. rinse.. repeat!

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-08  9:15     ` Herbert Xu
@ 2023-11-08 15:08       ` Steven Rostedt
  2023-11-09  4:19         ` Herbert Xu
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 15:08 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, David S. Miller, Kees Cook, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

On Wed, 8 Nov 2023 17:15:11 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> >  lib/crc32test.c          |  2 --
> >  lib/crypto/mpi/mpi-pow.c |  1 -
> >  lib/memcpy_kunit.c       |  5 -----
> >  lib/random32.c           |  1 -
> >  lib/rhashtable.c         |  2 --
> >  lib/test_bpf.c           |  3 ---
> >  lib/test_lockup.c        |  2 +-
> >  lib/test_maple_tree.c    |  8 --------
> >  lib/test_rhashtable.c    | 10 ----------
> >  9 files changed, 1 insertion(+), 33 deletions(-)  
> 
> Nack.

A "Nack" with no commentary is completely useless and borderline offensive.

What is your rationale for the Nack?

The cond_resched() is going away if the patches earlier in the series gets
implemented. So either it is removed from your code, or it will become a
nop, and just wasting bits in the source tree. Your choice.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08  9:43 ` David Laight
@ 2023-11-08 15:15   ` Steven Rostedt
  2023-11-08 16:29     ` David Laight
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 15:15 UTC (permalink / raw)
  To: David Laight
  Cc: 'Ankur Arora',
	linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, richard, mjguzik

On Wed, 8 Nov 2023 09:43:10 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> > Policies:
> > 
> > A - preemption=none: run to completion
> > B - preemption=voluntary: run to completion, unless a task of higher
> >     sched-class awaits
> > C - preemption=full: optimized for low-latency. Preempt whenever a higher
> >     priority task awaits.  
> 
> If you remove cond_resched() then won't both B and C require an extra IPI.
> That is probably OK for RT tasks but could get expensive for
> normal tasks that aren't bound to a specific cpu.

What IPI is extra?

> 
> I suspect C could also lead to tasks being pre-empted just before
> they sleep (eg after waking another task).
> There might already be mitigation for that, I'm not sure if
> a voluntary sleep can be done in a non-pre-emptible section.

No, voluntary sleep can not be done in a preemptible section.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 10:13     ` Peter Zijlstra
  2023-11-08 11:00       ` Ankur Arora
@ 2023-11-08 15:38       ` Thomas Gleixner
  2023-11-08 16:15         ` Peter Zijlstra
                           ` (2 more replies)
  1 sibling, 3 replies; 250+ messages in thread
From: Thomas Gleixner @ 2023-11-08 15:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ankur Arora
  Cc: linux-kernel, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote:
> On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
> I'm not understanding, those should stay obviously.
>
> The current preempt_dynamic stuff has 5 toggles:
>
> /*
>  * SC:cond_resched
>  * SC:might_resched
>  * SC:preempt_schedule
>  * SC:preempt_schedule_notrace
>  * SC:irqentry_exit_cond_resched
>  *
>  *
>  * NONE:
>  *   cond_resched               <- __cond_resched
>  *   might_resched              <- RET0
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * VOLUNTARY:
>  *   cond_resched               <- __cond_resched
>  *   might_resched              <- __cond_resched
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * FULL:
>  *   cond_resched               <- RET0
>  *   might_resched              <- RET0
>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>  */
>
> If you kill voluntary as we know it today, you can remove cond_resched
> and might_resched, but the remaining 3 are still needed to switch
> between NONE and FULL.

No. The whole point of LAZY is to keep preempt_schedule(),
preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.

Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/

The idea is to always enable preempt count and keep _all_ preemption
points enabled.

For NONE/VOLUNTARY mode let the scheduler set TIF_NEED_RESCHED_LAZY
instead of TIF_NEED_RESCHED. In full mode set TIF_NEED_RESCHED.

Here is where the regular and the lazy flags are evaluated:

                Ret2user        Ret2kernel      PreemptCnt=0  need_resched()

NEED_RESCHED       Y                Y               Y         Y
LAZY_RESCHED       Y                N               N         Y

The trick is that LAZY is not folded into preempt_count so a 1->0
counter transition won't cause preempt_schedule() to be invoked because
the topmost bit (NEED_RESCHED) is set.

The scheduler can still decide to set TIF_NEED_RESCHED which will cause
an immediate preemption at the next preemption point.

This allows to force out a task which loops, e.g. in a massive copy or
clear operation, as it did not reach a point where TIF_NEED_RESCHED_LAZY
is evaluated after a time which is defined by the scheduler itself.

For my PoC I did:

    1) Set TIF_NEED_RESCHED_LAZY

    2) Set TIF_NEED_RESCHED when the task did not react on
       TIF_NEED_RESCHED_LAZY within a tick

I know that's crude but it just works and obviously requires quite some
refinement.

So the way how you switch between preemption modes is to select when the
scheduler sets TIF_NEED_RESCHED/TIF_NEED_RESCHED_LAZY. No static call
switching at all.

In full preemption mode it sets always TIF_NEED_RESCHED and otherwise it
uses the LAZY bit first, grants some time and then gets out the hammer
and sets TIF_NEED_RESCHED when the task did not reach a LAZY preemption
point.

Which means once the whole thing is in place then the whole
PREEMPT_DYNAMIC along with NONE, VOLUNTARY, FULL can go away along with
the cond_resched() hackery.

So I think this series is backwards.

It should add the LAZY muck with a Kconfig switch like I did in my PoC
_first_. Once that is working and agreed on, the existing muck can be
removed.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 08/86] Revert "arm64: Support PREEMPT_DYNAMIC"
  2023-11-07 21:56 ` [RFC PATCH 08/86] Revert "arm64: Support PREEMPT_DYNAMIC" Ankur Arora
  2023-11-07 23:17   ` Steven Rostedt
@ 2023-11-08 15:44   ` Mark Rutland
  1 sibling, 0 replies; 250+ messages in thread
From: Mark Rutland @ 2023-11-08 15:44 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

On Tue, Nov 07, 2023 at 01:56:54PM -0800, Ankur Arora wrote:
> This reverts commit 1b2d3451ee50a0968cb9933f726e50b368ba5073.

As the author of the commit being reverted, I'd appreciate being Cc'd on
subsequent versions of this patch (and ideally, for the series as a whole).

Mark.

> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/arm64/Kconfig               |  1 -
>  arch/arm64/include/asm/preempt.h | 19 ++-----------------
>  arch/arm64/kernel/entry-common.c | 10 +---------
>  3 files changed, 3 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 78f20e632712..856d7be2ee45 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -221,7 +221,6 @@ config ARM64
>  	select HAVE_PERF_EVENTS_NMI if ARM64_PSEUDO_NMI
>  	select HAVE_PERF_REGS
>  	select HAVE_PERF_USER_STACK_DUMP
> -	select HAVE_PREEMPT_DYNAMIC_KEY
>  	select HAVE_REGS_AND_STACK_ACCESS_API
>  	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
>  	select HAVE_FUNCTION_ARG_ACCESS_API
> diff --git a/arch/arm64/include/asm/preempt.h b/arch/arm64/include/asm/preempt.h
> index 0159b625cc7f..e83f0982b99c 100644
> --- a/arch/arm64/include/asm/preempt.h
> +++ b/arch/arm64/include/asm/preempt.h
> @@ -2,7 +2,6 @@
>  #ifndef __ASM_PREEMPT_H
>  #define __ASM_PREEMPT_H
>  
> -#include <linux/jump_label.h>
>  #include <linux/thread_info.h>
>  
>  #define PREEMPT_NEED_RESCHED	BIT(32)
> @@ -81,24 +80,10 @@ static inline bool should_resched(int preempt_offset)
>  }
>  
>  #ifdef CONFIG_PREEMPTION
> -
>  void preempt_schedule(void);
> +#define __preempt_schedule() preempt_schedule()
>  void preempt_schedule_notrace(void);
> -
> -#ifdef CONFIG_PREEMPT_DYNAMIC
> -
> -DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
> -void dynamic_preempt_schedule(void);
> -#define __preempt_schedule()		dynamic_preempt_schedule()
> -void dynamic_preempt_schedule_notrace(void);
> -#define __preempt_schedule_notrace()	dynamic_preempt_schedule_notrace()
> -
> -#else /* CONFIG_PREEMPT_DYNAMIC */
> -
> -#define __preempt_schedule()		preempt_schedule()
> -#define __preempt_schedule_notrace()	preempt_schedule_notrace()
> -
> -#endif /* CONFIG_PREEMPT_DYNAMIC */
> +#define __preempt_schedule_notrace() preempt_schedule_notrace()
>  #endif /* CONFIG_PREEMPTION */
>  
>  #endif /* __ASM_PREEMPT_H */
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 0fc94207e69a..5d9c9951562b 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -225,17 +225,9 @@ static void noinstr arm64_exit_el1_dbg(struct pt_regs *regs)
>  		lockdep_hardirqs_on(CALLER_ADDR0);
>  }
>  
> -#ifdef CONFIG_PREEMPT_DYNAMIC
> -DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
> -#define need_irq_preemption() \
> -	(static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
> -#else
> -#define need_irq_preemption()	(IS_ENABLED(CONFIG_PREEMPTION))
> -#endif
> -
>  static void __sched arm64_preempt_schedule_irq(void)
>  {
> -	if (!need_irq_preemption())
> +	if (!IS_ENABLED(CONFIG_PREEMPTION))
>  		return;
>  
>  	/*
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 15:38       ` Thomas Gleixner
@ 2023-11-08 16:15         ` Peter Zijlstra
  2023-11-08 16:22         ` Steven Rostedt
  2023-11-08 20:26         ` Ankur Arora
  2 siblings, 0 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08 16:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ankur Arora, linux-kernel, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

On Wed, Nov 08, 2023 at 04:38:11PM +0100, Thomas Gleixner wrote:

> No. The whole point of LAZY is to keep preempt_schedule(),
> preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.

Yeah, I got that. What wasn't at all clear is that it also wanted to
replace NONE.

0/n didn't even mention none.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 15:38       ` Thomas Gleixner
  2023-11-08 16:15         ` Peter Zijlstra
@ 2023-11-08 16:22         ` Steven Rostedt
  2023-11-08 16:49           ` Peter Zijlstra
  2023-11-08 20:26         ` Ankur Arora
  2 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 16:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Wed, 08 Nov 2023 16:38:11 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote:
> > On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
> > I'm not understanding, those should stay obviously.
> >
> > The current preempt_dynamic stuff has 5 toggles:
> >
> > /*
> >  * SC:cond_resched
> >  * SC:might_resched
> >  * SC:preempt_schedule
> >  * SC:preempt_schedule_notrace
> >  * SC:irqentry_exit_cond_resched
> >  *
> >  *
> >  * NONE:
> >  *   cond_resched               <- __cond_resched
> >  *   might_resched              <- RET0
> >  *   preempt_schedule           <- NOP
> >  *   preempt_schedule_notrace   <- NOP
> >  *   irqentry_exit_cond_resched <- NOP
> >  *
> >  * VOLUNTARY:
> >  *   cond_resched               <- __cond_resched
> >  *   might_resched              <- __cond_resched
> >  *   preempt_schedule           <- NOP
> >  *   preempt_schedule_notrace   <- NOP
> >  *   irqentry_exit_cond_resched <- NOP
> >  *
> >  * FULL:
> >  *   cond_resched               <- RET0
> >  *   might_resched              <- RET0
> >  *   preempt_schedule           <- preempt_schedule
> >  *   preempt_schedule_notrace   <- preempt_schedule_notrace
> >  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> >  */
> >
> > If you kill voluntary as we know it today, you can remove cond_resched
> > and might_resched, but the remaining 3 are still needed to switch
> > between NONE and FULL.  
> 
> No. The whole point of LAZY is to keep preempt_schedule(),
> preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.

Right.

 * NONE:
 *   cond_resched               <- __cond_resched
 *   might_resched              <- RET0
 *   preempt_schedule           <- NOP
 *   preempt_schedule_notrace   <- NOP
 *   irqentry_exit_cond_resched <- NOP

Peter, how can you say we can get rid of cond_resched() in NONE when you
show that NONE still uses it? I thought the entire point of this was to get
rid of all the cond_resched() and they are there for PREEMPT_NONE as well as
VOLUNTARY. As you showed above, the only difference between NONE and
VOLUNTARY was the might_sleep.

> 
> Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/

And I've been saying that many times already ;-)

Thanks Thomas for reiterating it.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 82/86] treewide: mtd: remove cond_resched()
  2023-11-07 23:08   ` [RFC PATCH 82/86] treewide: mtd: " Ankur Arora
@ 2023-11-08 16:28     ` Miquel Raynal
  2023-11-08 16:32       ` Matthew Wilcox
  0 siblings, 1 reply; 250+ messages in thread
From: Miquel Raynal @ 2023-11-08 16:28 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Vignesh Raghavendra, Kyungmin Park,
	Tudor Ambarus, Pratyush Yadav

Hi Ankur,

ankur.a.arora@oracle.com wrote on Tue,  7 Nov 2023 15:08:18 -0800:

> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
> 
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
> 
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
> 
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
> 
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
> 
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
> 
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
> 
> Most of the uses here are in set-1 (some right after we give up a lock
> or enable bottom-halves, causing an explicit preemption check.)
> 
> There are a few cases from set-3. Replace them with
> cond_resched_stall(). Some of those places, however, have wait-times
> milliseconds, so maybe we should just have an msleep() there?

Yeah, I believe this should work.

> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: Miquel Raynal <miquel.raynal@bootlin.com>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Vignesh Raghavendra <vigneshr@ti.com>
> Cc: Kyungmin Park <kyungmin.park@samsung.com>
> Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
> Cc: Pratyush Yadav <pratyush@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---

[...]

> --- a/drivers/mtd/nand/raw/nand_legacy.c
> +++ b/drivers/mtd/nand/raw/nand_legacy.c
> @@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
>  	do {
>  		if (chip->legacy.dev_ready(chip))
>  			return;
> -		cond_resched();
> +		/*
> +		 * Use a cond_resched_stall() to avoid spinning in
> +		 * a tight loop.
> +		 * Though, given that the timeout is in milliseconds,
> +		 * maybe this should timeout or event wait?

Event waiting is precisely what we do here, with the hardware access
which are available in this case. So I believe this part of the comment
(in general) is not relevant. Now regarding the timeout I believe it is
closer to the second than the millisecond, so timeout-ing is not
relevant either in most cases (talking about mtd/ in general).

> +		 */
> +		cond_resched_stall();
>  	} while (time_before(jiffies, timeo));

Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 250+ messages in thread

* RE: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 15:15   ` Steven Rostedt
@ 2023-11-08 16:29     ` David Laight
  0 siblings, 0 replies; 250+ messages in thread
From: David Laight @ 2023-11-08 16:29 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: 'Ankur Arora',
	linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, richard, mjguzik

From: Steven Rostedt
> Sent: 08 November 2023 15:16
> 
> On Wed, 8 Nov 2023 09:43:10 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > > Policies:
> > >
> > > A - preemption=none: run to completion
> > > B - preemption=voluntary: run to completion, unless a task of higher
> > >     sched-class awaits
> > > C - preemption=full: optimized for low-latency. Preempt whenever a higher
> > >     priority task awaits.
> >
> > If you remove cond_resched() then won't both B and C require an extra IPI.
> > That is probably OK for RT tasks but could get expensive for
> > normal tasks that aren't bound to a specific cpu.
> 
> What IPI is extra?

I was thinking that you wouldn't currently need an IPI if the target cpu
was running in-kernel because nothing would happen until cond_resched()
was called.

> > I suspect C could also lead to tasks being pre-empted just before
> > they sleep (eg after waking another task).
> > There might already be mitigation for that, I'm not sure if
> > a voluntary sleep can be done in a non-pre-emptible section.
> 
> No, voluntary sleep can not be done in a preemptible section.

I'm guessing you missed out a negation in that (or s/not/only/).

I was thinking about sequences like:
	wake_up();
	...
	set_current_state(TASK_UNINTERRUPTIBLE)
	add_wait_queue();
	spin_unlock();
	schedule();

Where you really don't want to be pre-empted by the woken up task.
For non CONFIG_RT the lock might do it - if held long enough.
Otherwise you'll need to have pre-emption disabled and enable
it just after the set_current_state().
And then quite likely disable again after the schedule()
to balance things out.

So having the scheduler save the pre-empt disable count might
be useful.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 82/86] treewide: mtd: remove cond_resched()
  2023-11-08 16:28     ` Miquel Raynal
@ 2023-11-08 16:32       ` Matthew Wilcox
  2023-11-08 17:21         ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Matthew Wilcox @ 2023-11-08 16:32 UTC (permalink / raw)
  To: Miquel Raynal
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Vignesh Raghavendra, Kyungmin Park,
	Tudor Ambarus, Pratyush Yadav

On Wed, Nov 08, 2023 at 05:28:27PM +0100, Miquel Raynal wrote:
> > --- a/drivers/mtd/nand/raw/nand_legacy.c
> > +++ b/drivers/mtd/nand/raw/nand_legacy.c
> > @@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
> >  	do {
> >  		if (chip->legacy.dev_ready(chip))
> >  			return;
> > -		cond_resched();
> > +		/*
> > +		 * Use a cond_resched_stall() to avoid spinning in
> > +		 * a tight loop.
> > +		 * Though, given that the timeout is in milliseconds,
> > +		 * maybe this should timeout or event wait?
> 
> Event waiting is precisely what we do here, with the hardware access
> which are available in this case. So I believe this part of the comment
> (in general) is not relevant. Now regarding the timeout I believe it is
> closer to the second than the millisecond, so timeout-ing is not
> relevant either in most cases (talking about mtd/ in general).

I think you've misunderstood what Ankur wrote here.  What you're
currently doing is spinning in a very tight loop.  The comment is
suggesting you might want to msleep(1) or something to avoid burning CPU
cycles.  It'd be even better if the hardware could signal you somehow,
but I bet it can't.

> > +		 */
> > +		cond_resched_stall();
> >  	} while (time_before(jiffies, timeo));
> 
> Thanks,
> Miquèl
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
                   ` (61 preceding siblings ...)
  2023-11-08  9:43 ` David Laight
@ 2023-11-08 16:33 ` Mark Rutland
  2023-11-09  0:34   ` Ankur Arora
  62 siblings, 1 reply; 250+ messages in thread
From: Mark Rutland @ 2023-11-08 16:33 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
> What's broken:
>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>    lazily")
>  - Non-x86 architectures. It's trivial to support other archs (only need
>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>    comments on the series.
>    (From some testing on arm64, didn't find any surprises.)

When you say "testing on arm64, didn't find any surprises", I assume you mean
with an additional patch adding TIF_NEED_RESCHED_LAZY?

Applying this series as-is atop v6.6-rc7 and building defconfig (with GCC
13.2.0) blows up with:

| In file included from ./arch/arm64/include/asm/preempt.h:5,
|                  from ./include/linux/preempt.h:79,
|                  from ./include/linux/spinlock.h:56,
|                  from ./include/linux/mmzone.h:8,
|                  from ./include/linux/gfp.h:7,
|                  from ./include/linux/slab.h:16,
|                  from ./include/linux/resource_ext.h:11,
|                  from ./include/linux/acpi.h:13,
|                  from ./include/acpi/apei.h:9,
|                  from ./include/acpi/ghes.h:5,
|                  from ./include/linux/arm_sdei.h:8,
|                  from arch/arm64/kernel/asm-offsets.c:10:
| ./include/linux/thread_info.h:63:2: error: #error "Arch needs to define TIF_NEED_RESCHED_LAZY"
|    63 | #error "Arch needs to define TIF_NEED_RESCHED_LAZY"
|       |  ^~~~~
| ./include/linux/thread_info.h:66:42: error: 'TIF_NEED_RESCHED_LAZY' undeclared here (not in a function); did you mean 'TIF_NEED_RESCHED'?
|    66 | #define TIF_NEED_RESCHED_LAZY_OFFSET    (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
|       |                                          ^~~~~~~~~~~~~~~~~~~~~
| ./include/linux/thread_info.h:70:24: note: in expansion of macro 'TIF_NEED_RESCHED_LAZY_OFFSET'
|    70 |         RESCHED_lazy = TIF_NEED_RESCHED_LAZY_OFFSET,
|       |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
| make[2]: *** [scripts/Makefile.build:116: arch/arm64/kernel/asm-offsets.s] Error 1
| make[1]: *** [/home/mark/src/linux/Makefile:1202: prepare0] Error 2
| make: *** [Makefile:234: __sub-make] Error 2

Note that since arm64 doesn't use the generic entry code, that also requires
changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to
handle TIF_NEED_RESCHED_LAZY.

>  - ftrace support for need-resched-lazy is incomplete

What exactly do we need for ftrace here?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 16:22         ` Steven Rostedt
@ 2023-11-08 16:49           ` Peter Zijlstra
  2023-11-08 17:18             ` Steven Rostedt
  2023-11-08 20:46             ` Ankur Arora
  0 siblings, 2 replies; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-08 16:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote:

> Peter, how can you say we can get rid of cond_resched() in NONE when you

Because that would fix none to actually be none. Who cares.

> > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> 
> And I've been saying that many times already ;-)

Why should I look at random patches on the interweb to make sense of
these patches here?

That just underlines these here patches are not making sense.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 79/86] treewide: net: remove cond_resched()
  2023-11-08 12:16     ` Eric Dumazet
@ 2023-11-08 17:11       ` Steven Rostedt
  2023-11-08 20:59         ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 17:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Marek Lindner, Simon Wunderlich, Antonio Quartulli,
	Sven Eckelmann, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, David Ahern,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Marcelo Ricardo Leitner, Xin Long, Trond Myklebust,
	Anna Schumaker, Jon Maloy, Ying Xue, Martin Schiller

On Wed, 8 Nov 2023 13:16:17 +0100
Eric Dumazet <edumazet@google.com> wrote:

> > Most of the uses here are in set-1 (some right after we give up a
> > lock or enable bottom-halves, causing an explicit preemption check.)
> >
> > We can remove all of them.  
> 
> A patch series of 86 is not reasonable.

Agreed. The removal of cond_resched() wasn't needed for the RFC, as there's
really no comments needed once we make cond_resched obsolete.

I think Ankur just wanted to send all the work for the RFC to let people
know what he has done. I chalk that up as a Noobie mistake.

Ankur, next time you may want to break things up to get RFCs for each step
before going to the next one.

Currently, it looks like the first thing to do is to start with Thomas's
patch, and get the kinks out of NEED_RESCHED_LAZY, as Thomas suggested.

Perhaps work on separating PREEMPT_RCU from PREEMPT.

Then you may need to work on handling the #ifndef PREEMPTION parts of the
kernel.

And so on. Each being a separate patch series that will affect the way the
rest of the changes will be done.

I want this change too, so I'm willing to help you out on this. If you
didn't start it, I would have ;-)

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 16:49           ` Peter Zijlstra
@ 2023-11-08 17:18             ` Steven Rostedt
  2023-11-08 20:46             ` Ankur Arora
  1 sibling, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 17:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Wed, 8 Nov 2023 17:49:16 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote:
> 
> > Peter, how can you say we can get rid of cond_resched() in NONE when you  
> 
> Because that would fix none to actually be none. Who cares.

Well, that would lead to regressions with PREEMPT_NONE and the watchdog
timer.

> 
> > > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/  
> > 
> > And I've been saying that many times already ;-)  
> 
> Why should I look at random patches on the interweb to make sense of
> these patches here?

I actually said it to others, this wasn't really supposed to be addressed
to you.

> 
> That just underlines these here patches are not making sense.

It's a complex series, and there's a lot of room for improvement. I'm happy
to help out here.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 82/86] treewide: mtd: remove cond_resched()
  2023-11-08 16:32       ` Matthew Wilcox
@ 2023-11-08 17:21         ` Steven Rostedt
  2023-11-09  8:38           ` Miquel Raynal
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 17:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Miquel Raynal, Ankur Arora, linux-kernel, tglx, peterz, torvalds,
	paulmck, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Vignesh Raghavendra, Kyungmin Park, Tudor Ambarus,
	Pratyush Yadav

On Wed, 8 Nov 2023 16:32:36 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Nov 08, 2023 at 05:28:27PM +0100, Miquel Raynal wrote:
> > > --- a/drivers/mtd/nand/raw/nand_legacy.c
> > > +++ b/drivers/mtd/nand/raw/nand_legacy.c
> > > @@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
> > >  	do {
> > >  		if (chip->legacy.dev_ready(chip))
> > >  			return;
> > > -		cond_resched();
> > > +		/*
> > > +		 * Use a cond_resched_stall() to avoid spinning in
> > > +		 * a tight loop.
> > > +		 * Though, given that the timeout is in milliseconds,
> > > +		 * maybe this should timeout or event wait?  
> > 
> > Event waiting is precisely what we do here, with the hardware access
> > which are available in this case. So I believe this part of the comment
> > (in general) is not relevant. Now regarding the timeout I believe it is
> > closer to the second than the millisecond, so timeout-ing is not
> > relevant either in most cases (talking about mtd/ in general).  
> 
> I think you've misunderstood what Ankur wrote here.  What you're
> currently doing is spinning in a very tight loop.  The comment is
> suggesting you might want to msleep(1) or something to avoid burning CPU
> cycles.  It'd be even better if the hardware could signal you somehow,
> but I bet it can't.
> 

Oh how I wish we could bring back the old PREEMPT_RT cpu_chill()...

#define cpu_chill()	msleep(1)

;-)

-- Steve


> > > +		 */
> > > +		cond_resched_stall();
> > >  	} while (time_before(jiffies, timeo));  
> > 
> > Thanks,
> > Miquèl
> >   


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-07 23:08   ` [RFC PATCH 71/86] treewide: lib: " Ankur Arora
  2023-11-08  9:15     ` Herbert Xu
@ 2023-11-08 19:15     ` Kees Cook
  2023-11-08 19:41       ` Steven Rostedt
  1 sibling, 1 reply; 250+ messages in thread
From: Kees Cook @ 2023-11-08 19:15 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Herbert Xu, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

On Tue, Nov 07, 2023 at 03:08:07PM -0800, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
> 
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
> 
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
> 
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
> 
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
> 
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
> 
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
> 
> Almost all the cond_resched() calls are from set-1. Remove them.

FOr the memcpy_kunit.c cases, I don't think there are preemption
locations in its loops. Perhaps I'm misunderstanding something? Why will
the memcpy test no longer produce softlockup splats?

-Kees

> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net> 
> Cc: Kees Cook <keescook@chromium.org> 
> Cc: Eric Dumazet <edumazet@google.com> 
> Cc: Jakub Kicinski <kuba@kernel.org> 
> Cc: Paolo Abeni <pabeni@redhat.com> 
> Cc: Thomas Graf <tgraf@suug.ch>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  lib/crc32test.c          |  2 --
>  lib/crypto/mpi/mpi-pow.c |  1 -
>  lib/memcpy_kunit.c       |  5 -----
>  lib/random32.c           |  1 -
>  lib/rhashtable.c         |  2 --
>  lib/test_bpf.c           |  3 ---
>  lib/test_lockup.c        |  2 +-
>  lib/test_maple_tree.c    |  8 --------
>  lib/test_rhashtable.c    | 10 ----------
>  9 files changed, 1 insertion(+), 33 deletions(-)
> 
> diff --git a/lib/crc32test.c b/lib/crc32test.c
> index 9b4af79412c4..3eee90482e9a 100644
> --- a/lib/crc32test.c
> +++ b/lib/crc32test.c
> @@ -729,7 +729,6 @@ static int __init crc32c_combine_test(void)
>  			      crc_full == test[i].crc32c_le))
>  				errors++;
>  			runs++;
> -			cond_resched();
>  		}
>  	}
>  
> @@ -817,7 +816,6 @@ static int __init crc32_combine_test(void)
>  			      crc_full == test[i].crc_le))
>  				errors++;
>  			runs++;
> -			cond_resched();
>  		}
>  	}
>  
> diff --git a/lib/crypto/mpi/mpi-pow.c b/lib/crypto/mpi/mpi-pow.c
> index 2fd7a46d55ec..074534900b7e 100644
> --- a/lib/crypto/mpi/mpi-pow.c
> +++ b/lib/crypto/mpi/mpi-pow.c
> @@ -242,7 +242,6 @@ int mpi_powm(MPI res, MPI base, MPI exp, MPI mod)
>  				}
>  				e <<= 1;
>  				c--;
> -				cond_resched();
>  			}
>  
>  			i--;
> diff --git a/lib/memcpy_kunit.c b/lib/memcpy_kunit.c
> index 440aee705ccc..c2a6b09fe93a 100644
> --- a/lib/memcpy_kunit.c
> +++ b/lib/memcpy_kunit.c
> @@ -361,8 +361,6 @@ static void copy_large_test(struct kunit *test, bool use_memmove)
>  			/* Zero out what we copied for the next cycle. */
>  			memset(large_dst + offset, 0, bytes);
>  		}
> -		/* Avoid stall warnings if this loop gets slow. */
> -		cond_resched();
>  	}
>  }
>  
> @@ -489,9 +487,6 @@ static void memmove_overlap_test(struct kunit *test)
>  			for (int s_off = s_start; s_off < s_end;
>  			     s_off = next_step(s_off, s_start, s_end, window_step))
>  				inner_loop(test, bytes, d_off, s_off);
> -
> -			/* Avoid stall warnings. */
> -			cond_resched();
>  		}
>  	}
>  }
> diff --git a/lib/random32.c b/lib/random32.c
> index 32060b852668..10bc804d99d6 100644
> --- a/lib/random32.c
> +++ b/lib/random32.c
> @@ -287,7 +287,6 @@ static int __init prandom_state_selftest(void)
>  			errors++;
>  
>  		runs++;
> -		cond_resched();
>  	}
>  
>  	if (errors)
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 6ae2ba8e06a2..5ff0f521bf29 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -328,7 +328,6 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
>  		err = rhashtable_rehash_chain(ht, old_hash);
>  		if (err)
>  			return err;
> -		cond_resched();
>  	}
>  
>  	/* Publish the new table pointer. */
> @@ -1147,7 +1146,6 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
>  		for (i = 0; i < tbl->size; i++) {
>  			struct rhash_head *pos, *next;
>  
> -			cond_resched();
>  			for (pos = rht_ptr_exclusive(rht_bucket(tbl, i)),
>  			     next = !rht_is_a_nulls(pos) ?
>  					rht_dereference(pos->next, ht) : NULL;
> diff --git a/lib/test_bpf.c b/lib/test_bpf.c
> index ecde4216201e..15b4d32712d8 100644
> --- a/lib/test_bpf.c
> +++ b/lib/test_bpf.c
> @@ -14758,7 +14758,6 @@ static __init int test_skb_segment(void)
>  	for (i = 0; i < ARRAY_SIZE(skb_segment_tests); i++) {
>  		const struct skb_segment_test *test = &skb_segment_tests[i];
>  
> -		cond_resched();
>  		if (exclude_test(i))
>  			continue;
>  
> @@ -14787,7 +14786,6 @@ static __init int test_bpf(void)
>  		struct bpf_prog *fp;
>  		int err;
>  
> -		cond_resched();
>  		if (exclude_test(i))
>  			continue;
>  
> @@ -15171,7 +15169,6 @@ static __init int test_tail_calls(struct bpf_array *progs)
>  		u64 duration;
>  		int ret;
>  
> -		cond_resched();
>  		if (exclude_test(i))
>  			continue;
>  
> diff --git a/lib/test_lockup.c b/lib/test_lockup.c
> index c3fd87d6c2dd..9af5d34c98f6 100644
> --- a/lib/test_lockup.c
> +++ b/lib/test_lockup.c
> @@ -381,7 +381,7 @@ static void test_lockup(bool master)
>  			touch_nmi_watchdog();
>  
>  		if (call_cond_resched)
> -			cond_resched();
> +			cond_resched_stall();
>  
>  		test_wait(cooldown_secs, cooldown_nsecs);
>  
> diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
> index 464eeb90d5ad..321fd5d8aef3 100644
> --- a/lib/test_maple_tree.c
> +++ b/lib/test_maple_tree.c
> @@ -2672,7 +2672,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		rcu_barrier();
>  	}
>  
> -	cond_resched();
>  	mt_cache_shrink();
>  	/* Check with a value at zero, no gap */
>  	for (i = 1000; i < 2000; i++) {
> @@ -2682,7 +2681,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		rcu_barrier();
>  	}
>  
> -	cond_resched();
>  	mt_cache_shrink();
>  	/* Check with a value at zero and unreasonably large */
>  	for (i = big_start; i < big_start + 10; i++) {
> @@ -2692,7 +2690,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		rcu_barrier();
>  	}
>  
> -	cond_resched();
>  	mt_cache_shrink();
>  	/* Small to medium size not starting at zero*/
>  	for (i = 200; i < 1000; i++) {
> @@ -2702,7 +2699,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		rcu_barrier();
>  	}
>  
> -	cond_resched();
>  	mt_cache_shrink();
>  	/* Unreasonably large not starting at zero*/
>  	for (i = big_start; i < big_start + 10; i++) {
> @@ -2710,7 +2706,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		check_dup_gaps(mt, i, false, 5);
>  		mtree_destroy(mt);
>  		rcu_barrier();
> -		cond_resched();
>  		mt_cache_shrink();
>  	}
>  
> @@ -2720,7 +2715,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		check_dup_gaps(mt, i, false, 5);
>  		mtree_destroy(mt);
>  		rcu_barrier();
> -		cond_resched();
>  		if (i % 2 == 0)
>  			mt_cache_shrink();
>  	}
> @@ -2732,7 +2726,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		check_dup_gaps(mt, i, true, 5);
>  		mtree_destroy(mt);
>  		rcu_barrier();
> -		cond_resched();
>  	}
>  
>  	mt_cache_shrink();
> @@ -2743,7 +2736,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		mtree_destroy(mt);
>  		rcu_barrier();
>  		mt_cache_shrink();
> -		cond_resched();
>  	}
>  }
>  
> diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
> index c20f6cb4bf55..e5d1f272f2c6 100644
> --- a/lib/test_rhashtable.c
> +++ b/lib/test_rhashtable.c
> @@ -119,7 +119,6 @@ static int insert_retry(struct rhashtable *ht, struct test_obj *obj,
>  
>  	do {
>  		retries++;
> -		cond_resched();
>  		err = rhashtable_insert_fast(ht, &obj->node, params);
>  		if (err == -ENOMEM && enomem_retry) {
>  			enomem_retries++;
> @@ -253,8 +252,6 @@ static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array,
>  
>  			rhashtable_remove_fast(ht, &obj->node, test_rht_params);
>  		}
> -
> -		cond_resched();
>  	}
>  
>  	end = ktime_get_ns();
> @@ -371,8 +368,6 @@ static int __init test_rhltable(unsigned int entries)
>  		u32 i = get_random_u32_below(entries);
>  		u32 prand = get_random_u32_below(4);
>  
> -		cond_resched();
> -
>  		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
>  		if (test_bit(i, obj_in_table)) {
>  			clear_bit(i, obj_in_table);
> @@ -412,7 +407,6 @@ static int __init test_rhltable(unsigned int entries)
>  	}
>  
>  	for (i = 0; i < entries; i++) {
> -		cond_resched();
>  		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
>  		if (test_bit(i, obj_in_table)) {
>  			if (WARN(err, "cannot remove element at slot %d", i))
> @@ -607,8 +601,6 @@ static int thread_lookup_test(struct thread_data *tdata)
>  			       obj->value.tid, obj->value.id, key.tid, key.id);
>  			err++;
>  		}
> -
> -		cond_resched();
>  	}
>  	return err;
>  }
> @@ -660,8 +652,6 @@ static int threadfunc(void *data)
>  				goto out;
>  			}
>  			tdata->objs[i].value.id = TEST_INSERT_FAIL;
> -
> -			cond_resched();
>  		}
>  		err = thread_lookup_test(tdata);
>  		if (err) {
> -- 
> 2.31.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-08 19:15     ` Kees Cook
@ 2023-11-08 19:41       ` Steven Rostedt
  2023-11-08 22:16         ` Kees Cook
  2023-11-09  9:39         ` David Laight
  0 siblings, 2 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 19:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Herbert Xu, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

On Wed, 8 Nov 2023 11:15:37 -0800
Kees Cook <keescook@chromium.org> wrote:

> FOr the memcpy_kunit.c cases, I don't think there are preemption
> locations in its loops. Perhaps I'm misunderstanding something? Why will
> the memcpy test no longer produce softlockup splats?

This patchset will switch over to a NEED_RESCHED_LAZY routine, so that
VOLUNTARY and NONE preemption models will be forced to preempt if its in
the kernel for too long.

Time slice is over: set NEED_RESCHED_LAZY

For VOLUNTARY and NONE, NEED_RESCHED_LAZY will not preempt the kernel (but
will preempt user space).

If in the kernel for over 1 tick (1ms for 1000Hz, 4ms for 250Hz, etc),
if NEED_RESCHED_LAZY is still set after one tick, then set NEED_RESCHED.

NEED_RESCHED will now schedule in the kernel once it is able to regardless
of preemption model. (PREEMPT_NONE will now use preempt_disable()).

This allows us to get rid of all cond_resched()s throughout the kernel as
this will be the new mechanism to keep from running inside the kernel for
too long. The watchdog is always longer than one tick.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 15:38       ` Thomas Gleixner
  2023-11-08 16:15         ` Peter Zijlstra
  2023-11-08 16:22         ` Steven Rostedt
@ 2023-11-08 20:26         ` Ankur Arora
  2 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08 20:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Thomas Gleixner <tglx@linutronix.de> writes:

> On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote:
>> On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
>> I'm not understanding, those should stay obviously.
>>
>> The current preempt_dynamic stuff has 5 toggles:
>>
>> /*
>>  * SC:cond_resched
>>  * SC:might_resched
>>  * SC:preempt_schedule
>>  * SC:preempt_schedule_notrace
>>  * SC:irqentry_exit_cond_resched
>>  *
>>  *
>>  * NONE:
>>  *   cond_resched               <- __cond_resched
>>  *   might_resched              <- RET0
>>  *   preempt_schedule           <- NOP
>>  *   preempt_schedule_notrace   <- NOP
>>  *   irqentry_exit_cond_resched <- NOP
>>  *
>>  * VOLUNTARY:
>>  *   cond_resched               <- __cond_resched
>>  *   might_resched              <- __cond_resched
>>  *   preempt_schedule           <- NOP
>>  *   preempt_schedule_notrace   <- NOP
>>  *   irqentry_exit_cond_resched <- NOP
>>  *
>>  * FULL:
>>  *   cond_resched               <- RET0
>>  *   might_resched              <- RET0
>>  *   preempt_schedule           <- preempt_schedule
>>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>>  */
>>
>> If you kill voluntary as we know it today, you can remove cond_resched
>> and might_resched, but the remaining 3 are still needed to switch
>> between NONE and FULL.
>
> No. The whole point of LAZY is to keep preempt_schedule(),
> preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.
>
> Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>
> The idea is to always enable preempt count and keep _all_ preemption
> points enabled.
>
> For NONE/VOLUNTARY mode let the scheduler set TIF_NEED_RESCHED_LAZY
> instead of TIF_NEED_RESCHED. In full mode set TIF_NEED_RESCHED.
>
> Here is where the regular and the lazy flags are evaluated:
>
>                 Ret2user        Ret2kernel      PreemptCnt=0  need_resched()
>
> NEED_RESCHED       Y                Y               Y         Y
> LAZY_RESCHED       Y                N               N         Y
>
> The trick is that LAZY is not folded into preempt_count so a 1->0
> counter transition won't cause preempt_schedule() to be invoked because
> the topmost bit (NEED_RESCHED) is set.
>
> The scheduler can still decide to set TIF_NEED_RESCHED which will cause
> an immediate preemption at the next preemption point.
>
> This allows to force out a task which loops, e.g. in a massive copy or
> clear operation, as it did not reach a point where TIF_NEED_RESCHED_LAZY
> is evaluated after a time which is defined by the scheduler itself.
>
> For my PoC I did:
>
>     1) Set TIF_NEED_RESCHED_LAZY
>
>     2) Set TIF_NEED_RESCHED when the task did not react on
>        TIF_NEED_RESCHED_LAZY within a tick
>
> I know that's crude but it just works and obviously requires quite some
> refinement.
>
> So the way how you switch between preemption modes is to select when the
> scheduler sets TIF_NEED_RESCHED/TIF_NEED_RESCHED_LAZY. No static call
> switching at all.
>
> In full preemption mode it sets always TIF_NEED_RESCHED and otherwise it
> uses the LAZY bit first, grants some time and then gets out the hammer
> and sets TIF_NEED_RESCHED when the task did not reach a LAZY preemption
> point.
>
> Which means once the whole thing is in place then the whole
> PREEMPT_DYNAMIC along with NONE, VOLUNTARY, FULL can go away along with
> the cond_resched() hackery.
>
> So I think this series is backwards.
>
> It should add the LAZY muck with a Kconfig switch like I did in my PoC
> _first_. Once that is working and agreed on, the existing muck can be
> removed.

Yeah. I should have done it in the order in your PoC. Right now I'm
doing all of the stuff you describe above, but because there are far
too many structural changes, it's not clear to anybody what the code
is doing.

Okay, so for the next version let me limit the series to just the
scheduler changes which can be orthogonal to the old models (basically
a new scheduler model PREEMPT_AUTO).

Once that is agreed on, the other models can be removed (or expressed
in terms of PREEMPT_AUTO.)

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 16:49           ` Peter Zijlstra
  2023-11-08 17:18             ` Steven Rostedt
@ 2023-11-08 20:46             ` Ankur Arora
  1 sibling, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08 20:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Thomas Gleixner, Ankur Arora, linux-kernel,
	torvalds, paulmck, linux-mm, x86, akpm, luto, bp, dave.hansen,
	hpa, mingo, juri.lelli, vincent.guittot, willy, mgorman,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	jgross, andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote:
>
>> Peter, how can you say we can get rid of cond_resched() in NONE when you
>
> Because that would fix none to actually be none. Who cares.
>
>> > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>>
>> And I've been saying that many times already ;-)
>
> Why should I look at random patches on the interweb to make sense of
> these patches here?
>
> That just underlines these here patches are not making sense.

Yeah, I'm changing too many things structural things all together.

Let me redo this. This time limiting the changes to the scheduler
adding a preemption model which adds the lazy-bit and which can
behave in ways that are similar to preempt=none/full differently
by toggling the treatment of lazy-bit.

This keeps the current models as it is.

And, once that makes sense to people, then we can decide how best to
remove cond_resched() etc.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 79/86] treewide: net: remove cond_resched()
  2023-11-08 17:11       ` Steven Rostedt
@ 2023-11-08 20:59         ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-08 20:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Eric Dumazet, Ankur Arora, linux-kernel, tglx, peterz, torvalds,
	paulmck, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Marek Lindner, Simon Wunderlich, Antonio Quartulli,
	Sven Eckelmann, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, David Ahern,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Marcelo Ricardo Leitner, Xin Long, Trond Myklebust,
	Anna Schumaker, Jon Maloy, Ying Xue, Martin Schiller


Steven Rostedt <rostedt@goodmis.org> writes:

> On Wed, 8 Nov 2023 13:16:17 +0100
> Eric Dumazet <edumazet@google.com> wrote:
>
>> > Most of the uses here are in set-1 (some right after we give up a
>> > lock or enable bottom-halves, causing an explicit preemption check.)
>> >
>> > We can remove all of them.
>>
>> A patch series of 86 is not reasonable.

/me nods.

> Agreed. The removal of cond_resched() wasn't needed for the RFC, as there's
> really no comments needed once we make cond_resched obsolete.
>
> I think Ankur just wanted to send all the work for the RFC to let people
> know what he has done. I chalk that up as a Noobie mistake.
>
> Ankur, next time you may want to break things up to get RFCs for each step
> before going to the next one.

Yeah agreed. It would have made sense to break this up into changes
touching the scheduler code first, get agreement. Rinse. Repeat.

> Currently, it looks like the first thing to do is to start with Thomas's
> patch, and get the kinks out of NEED_RESCHED_LAZY, as Thomas suggested.
>
> Perhaps work on separating PREEMPT_RCU from PREEMPT.

Agree to both of those.

> Then you may need to work on handling the #ifndef PREEMPTION parts of the
> kernel.

In other words express !PREEMPTION scheduler models/things that depend
on cond_resched() etc, into Thomas' model?

> And so on. Each being a separate patch series that will affect the way the
> rest of the changes will be done.

Ack that.

> I want this change too, so I'm willing to help you out on this. If you
> didn't start it, I would have ;-)

Thanks and I really appreciate that.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-08 19:41       ` Steven Rostedt
@ 2023-11-08 22:16         ` Kees Cook
  2023-11-08 22:21           ` Steven Rostedt
  2023-11-09  9:39         ` David Laight
  1 sibling, 1 reply; 250+ messages in thread
From: Kees Cook @ 2023-11-08 22:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Herbert Xu, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

On Wed, Nov 08, 2023 at 02:41:44PM -0500, Steven Rostedt wrote:
> On Wed, 8 Nov 2023 11:15:37 -0800
> Kees Cook <keescook@chromium.org> wrote:
> 
> > FOr the memcpy_kunit.c cases, I don't think there are preemption
> > locations in its loops. Perhaps I'm misunderstanding something? Why will
> > the memcpy test no longer produce softlockup splats?
> 
> This patchset will switch over to a NEED_RESCHED_LAZY routine, so that
> VOLUNTARY and NONE preemption models will be forced to preempt if its in
> the kernel for too long.
> 
> Time slice is over: set NEED_RESCHED_LAZY
> 
> For VOLUNTARY and NONE, NEED_RESCHED_LAZY will not preempt the kernel (but
> will preempt user space).
> 
> If in the kernel for over 1 tick (1ms for 1000Hz, 4ms for 250Hz, etc),
> if NEED_RESCHED_LAZY is still set after one tick, then set NEED_RESCHED.
> 
> NEED_RESCHED will now schedule in the kernel once it is able to regardless
> of preemption model. (PREEMPT_NONE will now use preempt_disable()).
> 
> This allows us to get rid of all cond_resched()s throughout the kernel as
> this will be the new mechanism to keep from running inside the kernel for
> too long. The watchdog is always longer than one tick.

Okay, it sounds like it's taken care of. :)

Acked-by: Kees Cook <keescook@chromium.org> # for lib/memcpy_kunit.c

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-08 22:16         ` Kees Cook
@ 2023-11-08 22:21           ` Steven Rostedt
  0 siblings, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-08 22:21 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Herbert Xu, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

On Wed, 8 Nov 2023 14:16:25 -0800
Kees Cook <keescook@chromium.org> wrote:

> Okay, it sounds like it's taken care of. :)
> 
> Acked-by: Kees Cook <keescook@chromium.org> # for lib/memcpy_kunit.c

Thanks Kees,

But I have to admit (and Ankur is now aware) that it was premature to send
the cond_resched() removal patches with this RFC. It may be a year before
we get everything straighten out with the new preempt models.

So, expect to see this patch again sometime next year ;-)

Hopefully, you can ack it then.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-08 16:33 ` Mark Rutland
@ 2023-11-09  0:34   ` Ankur Arora
  2023-11-09 11:00     ` Mark Rutland
  0 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-09  0:34 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Mark Rutland <mark.rutland@arm.com> writes:

> On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
>> What's broken:
>>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>>    lazily")
>>  - Non-x86 architectures. It's trivial to support other archs (only need
>>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>>    comments on the series.
>>    (From some testing on arm64, didn't find any surprises.)
>
> When you say "testing on arm64, didn't find any surprises", I assume you mean
> with an additional patch adding TIF_NEED_RESCHED_LAZY?

Yeah. And, handling that in the user exit path.

> Note that since arm64 doesn't use the generic entry code, that also requires
> changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to
> handle TIF_NEED_RESCHED_LAZY.

So, the intent (which got muddied due to this overly large series)
was to delay handling TIF_NEED_RESCHED_LAZY until we are about to
return to user.

I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED
and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK
and do_notify_resume().

(The design is much clearer in Thomas' PoC:
https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/)

>>  - ftrace support for need-resched-lazy is incomplete
>
> What exactly do we need for ftrace here?

Only support for TIF_NEED_RESCHED_LAZY which should be complete.
That comment was based on a misreading of the code.


Thanks

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()
  2023-11-08  0:48     ` Chris Packham
@ 2023-11-09  0:55       ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-09  0:55 UTC (permalink / raw)
  To: Chris Packham
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Oded Gabbay, Miguel Ojeda, Jens Axboe,
	Minchan Kim, Sergey Senozhatsky, Sudip Mukherjee,
	Theodore Ts'o, Jason A. Donenfeld, Amit Shah, Gonglei,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Davidlohr Bueso,
	Jonathan Cameron, Dave Jiang, Alison Schofield, Vishal Verma,
	Ira Weiny, Dan Williams, Sumit Semwal, Christian König,
	Andi Shyti, Ray Jui, Scott Branden, Shawn Guo, Sascha Hauer,
	Junxian Huang, Dmitry Torokhov, Will Deacon, Joerg Roedel,
	Mauro Carvalho Chehab, Srinivas Pandruvada, Hans de Goede,
	Ilpo Järvinen, Mark Gross, Finn Thain, Michael Schmitz,
	James E.J. Bottomley, Martin K. Petersen, Kashyap Desai,
	Sumit Saxena, Shivasharan S, Mark Brown, Neil Armstrong,
	Jens Wiklander, Alex Williamson, Helge Deller, David Hildenbrand


Chris Packham <Chris.Packham@alliedtelesis.co.nz> writes:

> On 8/11/23 12:08, Ankur Arora wrote:
>> There are broadly three sets of uses of cond_resched():
>>
>> 1.  Calls to cond_resched() out of the goodness of our heart,
>>      otherwise known as avoiding lockup splats.
>>
>> 2.  Open coded variants of cond_resched_lock() which call
>>      cond_resched().
>>
>> 3.  Retry or error handling loops, where cond_resched() is used as a
>>      quick alternative to spinning in a tight-loop.
>>
>> When running under a full preemption model, the cond_resched() reduces
>> to a NOP (not even a barrier) so removing it obviously cannot matter.
>>
>> But considering only voluntary preemption models (for say code that
>> has been mostly tested under those), for set-1 and set-2 the
>> scheduler can now preempt kernel tasks running beyond their time
>> quanta anywhere they are preemptible() [1]. Which removes any need
>> for these explicitly placed scheduling points.
>>
>> The cond_resched() calls in set-3 are a little more difficult.
>> To start with, given it's NOP character under full preemption, it
>> never actually saved us from a tight loop.
>> With voluntary preemption, it's not a NOP, but it might as well be --
>> for most workloads the scheduler does not have an interminable supply
>> of runnable tasks on the runqueue.
>>
>> So, cond_resched() is useful to not get softlockup splats, but not
>> terribly good for error handling. Ideally, these should be replaced
>> with some kind of timed or event wait.
>> For now we use cond_resched_stall(), which tries to schedule if
>> possible, and executes a cpu_relax() if not.
>>
>> The cond_resched() calls here are all kinds. Those from set-1
>> or set-2 are quite straight-forward to handle.
>>
>> There are quite a few from set-3, where as noted above, we
>> use cond_resched() as if it were a amulent. Which I supppose
>> it is, in that it wards off softlockup or RCU splats.
>>
>> Those are now cond_resched_stall(), but in most cases, given
>> that the timeouts are in milliseconds, they could be easily
>> timed waits.
>
> For i2c-mpc.c:
>
> It looks as the code in question could probably be converted to
> readb_poll_timeout(). If I find sufficient round-tuits I might look at
> that. Regardless in the context of the tree-wide change ...
>
> Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>

Thanks Chris. This'll take a while before this lands.
I'll see if I can send a patch with cond_resched_stall() or similar
separately.

Meanwhile please feel free to make the readb_poll_timeout() change.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-08 15:08       ` Steven Rostedt
@ 2023-11-09  4:19         ` Herbert Xu
  2023-11-09  4:43           ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Herbert Xu @ 2023-11-09  4:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, David S. Miller, Kees Cook, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

On Wed, Nov 08, 2023 at 10:08:18AM -0500, Steven Rostedt wrote:
>
> A "Nack" with no commentary is completely useless and borderline offensive.

Well you just sent me an email out of the blue, with zero context
about what you were doing, and you're complaining to me about giving
your a curt response?

> What is your rationale for the Nack?

Next time perhaps consider sending the cover letter and the important
patches to everyone rather than the mailing list.

> The cond_resched() is going away if the patches earlier in the series gets
> implemented. So either it is removed from your code, or it will become a
> nop, and just wasting bits in the source tree. Your choice.

This is exactly what I should have received.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-09  4:19         ` Herbert Xu
@ 2023-11-09  4:43           ` Steven Rostedt
  0 siblings, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-09  4:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, David S. Miller, Kees Cook, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Thomas Graf

On Thu, 9 Nov 2023 12:19:55 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Wed, Nov 08, 2023 at 10:08:18AM -0500, Steven Rostedt wrote:
> >
> > A "Nack" with no commentary is completely useless and borderline offensive.  
> 
> Well you just sent me an email out of the blue, with zero context
> about what you were doing, and you're complaining to me about giving
> your a curt response?

First, I didn't send the email, and your "Nack" wasn't directed at me.

Second, with lore and lei, it's trivial today to find the cover letter from
the message id. But I get it. It's annoying when you have to do that.

> 
> > What is your rationale for the Nack?  
> 
> Next time perhaps consider sending the cover letter and the important
> patches to everyone rather than the mailing list.

Then that is how you should have responded. I see other maintainers respond
as such. A "Nack" is still meaningless. You could have responded with:

 "What is this? And why are you doing it?"

Which is a much better and a more meaningful response than a "Nack".

> 
> > The cond_resched() is going away if the patches earlier in the series gets
> > implemented. So either it is removed from your code, or it will become a
> > nop, and just wasting bits in the source tree. Your choice.  
> 
> This is exactly what I should have received.

Which is why I replied, as the original email author is still new at this,
but is learning.

-- Steve


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 82/86] treewide: mtd: remove cond_resched()
  2023-11-08 17:21         ` Steven Rostedt
@ 2023-11-09  8:38           ` Miquel Raynal
  0 siblings, 0 replies; 250+ messages in thread
From: Miquel Raynal @ 2023-11-09  8:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Matthew Wilcox, Ankur Arora, linux-kernel, tglx, peterz,
	torvalds, paulmck, linux-mm, x86, akpm, luto, bp, dave.hansen,
	hpa, mingo, juri.lelli, vincent.guittot, mgorman, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Vignesh Raghavendra, Kyungmin Park, Tudor Ambarus,
	Pratyush Yadav

Hello,

rostedt@goodmis.org wrote on Wed, 8 Nov 2023 12:21:16 -0500:

> On Wed, 8 Nov 2023 16:32:36 +0000
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Wed, Nov 08, 2023 at 05:28:27PM +0100, Miquel Raynal wrote:  
> > > > --- a/drivers/mtd/nand/raw/nand_legacy.c
> > > > +++ b/drivers/mtd/nand/raw/nand_legacy.c
> > > > @@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
> > > >  	do {
> > > >  		if (chip->legacy.dev_ready(chip))
> > > >  			return;
> > > > -		cond_resched();
> > > > +		/*
> > > > +		 * Use a cond_resched_stall() to avoid spinning in
> > > > +		 * a tight loop.
> > > > +		 * Though, given that the timeout is in milliseconds,
> > > > +		 * maybe this should timeout or event wait?    
> > > 
> > > Event waiting is precisely what we do here, with the hardware access
> > > which are available in this case. So I believe this part of the comment
> > > (in general) is not relevant. Now regarding the timeout I believe it is
> > > closer to the second than the millisecond, so timeout-ing is not
> > > relevant either in most cases (talking about mtd/ in general).    
> > 
> > I think you've misunderstood what Ankur wrote here.  What you're
> > currently doing is spinning in a very tight loop.  The comment is
> > suggesting you might want to msleep(1) or something to avoid burning CPU
> > cycles.  It'd be even better if the hardware could signal you somehow,
> > but I bet it can't.

Well, I think I'm aligned with the change and the first sentence in the
comment, but not with the second sentence which I find not relevant.

Maybe I don't understand what "maybe this should timeout" and Ankur
meant "sleeping" there, but for me a timeout is when you bail out with
an error. If sleeping is advised, then why not using a more explicit
wording? As for hardware events, in this case it is not relevant, as
you noticed, so I asked this part of the sentence to be dropped.

This is a legacy part of the core but is still part of the core. In
general I don't mind treewide changes to be slightly generic and I will
not be bothered too much with the device drivers changes, but the core
is more important to my eyes.

> Oh how I wish we could bring back the old PREEMPT_RT cpu_chill()...
> 
> #define cpu_chill()	msleep(1)

:')

Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 250+ messages in thread

* RE: [RFC PATCH 71/86] treewide: lib: remove cond_resched()
  2023-11-08 19:41       ` Steven Rostedt
  2023-11-08 22:16         ` Kees Cook
@ 2023-11-09  9:39         ` David Laight
  1 sibling, 0 replies; 250+ messages in thread
From: David Laight @ 2023-11-09  9:39 UTC (permalink / raw)
  To: 'Steven Rostedt', Kees Cook
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, richard, mjguzik,
	Herbert Xu, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Thomas Graf

From: Steven Rostedt
> Sent: 08 November 2023 19:42
> 
> On Wed, 8 Nov 2023 11:15:37 -0800
> Kees Cook <keescook@chromium.org> wrote:
> 
> > FOr the memcpy_kunit.c cases, I don't think there are preemption
> > locations in its loops. Perhaps I'm misunderstanding something? Why will
> > the memcpy test no longer produce softlockup splats?
> 
> This patchset will switch over to a NEED_RESCHED_LAZY routine, so that
> VOLUNTARY and NONE preemption models will be forced to preempt if its in
> the kernel for too long.
> 
> Time slice is over: set NEED_RESCHED_LAZY
> 
> For VOLUNTARY and NONE, NEED_RESCHED_LAZY will not preempt the kernel (but
> will preempt user space).
> 
> If in the kernel for over 1 tick (1ms for 1000Hz, 4ms for 250Hz, etc),
> if NEED_RESCHED_LAZY is still set after one tick, then set NEED_RESCHED.

Delaying the reschedule that long seems like a regression.
I'm sure a lot of the cond_resched() calls were added to cause
pre-emption much earlier than 1 tick.

I doubt the distibutions will change from VOLUTARY any time soon.
So that is what most people will be using.

	David.

> 
> NEED_RESCHED will now schedule in the kernel once it is able to regardless
> of preemption model. (PREEMPT_NONE will now use preempt_disable()).
> 
> This allows us to get rid of all cond_resched()s throughout the kernel as
> this will be the new mechanism to keep from running inside the kernel for
> too long. The watchdog is always longer than one tick.
> 
> -- Steve

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-09  0:34   ` Ankur Arora
@ 2023-11-09 11:00     ` Mark Rutland
  2023-11-09 22:36       ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Mark Rutland @ 2023-11-09 11:00 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

On Wed, Nov 08, 2023 at 04:34:41PM -0800, Ankur Arora wrote:
> Mark Rutland <mark.rutland@arm.com> writes:
> 
> > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
> >> What's broken:
> >>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
> >>    lazily")
> >>  - Non-x86 architectures. It's trivial to support other archs (only need
> >>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
> >>    comments on the series.
> >>    (From some testing on arm64, didn't find any surprises.)
> >
> > When you say "testing on arm64, didn't find any surprises", I assume you mean
> > with an additional patch adding TIF_NEED_RESCHED_LAZY?
> 
> Yeah. And, handling that in the user exit path.
> 
> > Note that since arm64 doesn't use the generic entry code, that also requires
> > changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to
> > handle TIF_NEED_RESCHED_LAZY.
> 
> So, the intent (which got muddied due to this overly large series)
> was to delay handling TIF_NEED_RESCHED_LAZY until we are about to
> return to user.

Ah, I missed that detail -- thanks for clarifying!

> I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED
> and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK
> and do_notify_resume().

Digging a bit more, I think that should still work.

One slight clarification: arm64_preempt_schedule_irq() doesn't look at
TIF_NEED_RESCHED today, as it relies on the scheduler IPI calling
preempt_fold_need_resched() to propogate TIF_NEED_RESCHED into
PREEMPT_NEED_RESCHED. That should still work since this series makes
preempt_fold_need_resched() check tif_need_resched(RESCHED_eager).

I was a bit cnofused because in the generic entry code,
irqentry_exit_cond_resched() explicitly checks for TIF_NEED_RESCHED, and I'm
not sure why it does that rather than relying on the scheduler IPI as above.

> (The design is much clearer in Thomas' PoC:
> https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/)
> 
> >>  - ftrace support for need-resched-lazy is incomplete
> >
> > What exactly do we need for ftrace here?
> 
> Only support for TIF_NEED_RESCHED_LAZY which should be complete.
> That comment was based on a misreading of the code.

Cool; thanks!

Mark.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 54/86] sched: add cond_resched_stall()
  2023-11-07 21:57 ` [RFC PATCH 54/86] sched: add cond_resched_stall() Ankur Arora
@ 2023-11-09 11:19   ` Thomas Gleixner
  2023-11-09 22:27     ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Thomas Gleixner @ 2023-11-09 11:19 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel
  Cc: peterz, torvalds, paulmck, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Ankur Arora

On Tue, Nov 07 2023 at 13:57, Ankur Arora wrote:
> The kernel has a lot of intances of cond_resched() where it is used
> as an alternative to spinning in a tight-loop while waiting to
> retry an operation, or while waiting for a device state to change.
>
> Unfortunately, because the scheduler is unlikely to have an
> interminable supply of runnable tasks on the runqueue, this just
> amounts to spinning in a tight-loop with a cond_resched().
> (When running in a fully preemptible kernel, cond_resched()
> calls are stubbed out so it amounts to even less.)
>
> In sum, cond_resched() in error handling/retry contexts might
> be useful in avoiding softlockup splats, but not very good at
> error handling. Ideally, these should be replaced with some kind
> of timed or event wait.
>
> For now add cond_resched_stall(), which tries to schedule if
> possible, and failing that executes a cpu_relax().

What's the point of this new variant of cond_resched()? We really do not
want it at all. 

> +int __cond_resched_stall(void)
> +{
> +	if (tif_need_resched(RESCHED_eager)) {
> +		__preempt_schedule();

Under the new model TIF_NEED_RESCHED is going to reschedule if the
preemption counter goes to zero.

So the typical

   while (readl(mmio) & BUSY)
   	cpu_relax();

will just be preempted like any other loop, no?

Confused.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-07 23:16   ` Steven Rostedt
  2023-11-08  4:55     ` Ankur Arora
@ 2023-11-09 17:26     ` Josh Poimboeuf
  2023-11-09 17:31       ` Steven Rostedt
  1 sibling, 1 reply; 250+ messages in thread
From: Josh Poimboeuf @ 2023-11-09 17:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	live-patching

On Tue, Nov 07, 2023 at 06:16:09PM -0500, Steven Rostedt wrote:
> On Tue,  7 Nov 2023 13:56:53 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> > This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.
> > 
> > Note that removing this commit reintroduces "live patches failing to
> > complete within a reasonable amount of time due to CPU-bound kthreads."
> > 
> > Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
> > existence of cond_resched() so this will need an alternate fix.

We definitely don't want to introduce a regression, something will need
to be figured out before removing cond_resched().

We could hook into preempt_schedule_irq(), but that wouldn't work for
non-ORC.

Another option would be to hook into schedule().  Then livepatch could
set TIF_NEED_RESCHED on remaining unpatched tasks.  But again if they go
through the preemption path then we have the same problem for non-ORC.

Worst case we'll need to sprinkle cond_livepatch() everywhere :-/

-- 
Josh

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-09 17:26     ` Josh Poimboeuf
@ 2023-11-09 17:31       ` Steven Rostedt
  2023-11-09 17:51         ` Josh Poimboeuf
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-09 17:31 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	live-patching

On Thu, 9 Nov 2023 09:26:37 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> On Tue, Nov 07, 2023 at 06:16:09PM -0500, Steven Rostedt wrote:
> > On Tue,  7 Nov 2023 13:56:53 -0800
> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >   
> > > This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.
> > > 
> > > Note that removing this commit reintroduces "live patches failing to
> > > complete within a reasonable amount of time due to CPU-bound kthreads."
> > > 
> > > Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
> > > existence of cond_resched() so this will need an alternate fix.  
> 
> We definitely don't want to introduce a regression, something will need
> to be figured out before removing cond_resched().
> 
> We could hook into preempt_schedule_irq(), but that wouldn't work for
> non-ORC.
> 
> Another option would be to hook into schedule().  Then livepatch could
> set TIF_NEED_RESCHED on remaining unpatched tasks.  But again if they go
> through the preemption path then we have the same problem for non-ORC.
> 
> Worst case we'll need to sprinkle cond_livepatch() everywhere :-/
> 

I guess I'm not fully understanding what the cond rescheds are for. But
would an IPI to all CPUs setting NEED_RESCHED, fix it?

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-09 17:31       ` Steven Rostedt
@ 2023-11-09 17:51         ` Josh Poimboeuf
  2023-11-09 22:50           ` Ankur Arora
  2023-11-10  0:56           ` Steven Rostedt
  0 siblings, 2 replies; 250+ messages in thread
From: Josh Poimboeuf @ 2023-11-09 17:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	live-patching

On Thu, Nov 09, 2023 at 12:31:47PM -0500, Steven Rostedt wrote:
> On Thu, 9 Nov 2023 09:26:37 -0800
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> 
> > On Tue, Nov 07, 2023 at 06:16:09PM -0500, Steven Rostedt wrote:
> > > On Tue,  7 Nov 2023 13:56:53 -0800
> > > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> > >   
> > > > This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.
> > > > 
> > > > Note that removing this commit reintroduces "live patches failing to
> > > > complete within a reasonable amount of time due to CPU-bound kthreads."
> > > > 
> > > > Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
> > > > existence of cond_resched() so this will need an alternate fix.  
> > 
> > We definitely don't want to introduce a regression, something will need
> > to be figured out before removing cond_resched().
> > 
> > We could hook into preempt_schedule_irq(), but that wouldn't work for
> > non-ORC.
> > 
> > Another option would be to hook into schedule().  Then livepatch could
> > set TIF_NEED_RESCHED on remaining unpatched tasks.  But again if they go
> > through the preemption path then we have the same problem for non-ORC.
> > 
> > Worst case we'll need to sprinkle cond_livepatch() everywhere :-/
> > 
> 
> I guess I'm not fully understanding what the cond rescheds are for. But
> would an IPI to all CPUs setting NEED_RESCHED, fix it?

If all livepatch arches had the ORC unwinder, yes.

The problem is that frame pointer (and similar) unwinders can't reliably
unwind past an interrupt frame.

-- 
Josh

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 54/86] sched: add cond_resched_stall()
  2023-11-09 11:19   ` Thomas Gleixner
@ 2023-11-09 22:27     ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-09 22:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, torvalds, paulmck, linux-mm, x86, akpm,
	luto, bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot,
	willy, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, rostedt, David.Laight, richard, mjguzik,
	Ankur Arora


Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, Nov 07 2023 at 13:57, Ankur Arora wrote:
>> The kernel has a lot of intances of cond_resched() where it is used
>> as an alternative to spinning in a tight-loop while waiting to
>> retry an operation, or while waiting for a device state to change.
>>
>> Unfortunately, because the scheduler is unlikely to have an
>> interminable supply of runnable tasks on the runqueue, this just
>> amounts to spinning in a tight-loop with a cond_resched().
>> (When running in a fully preemptible kernel, cond_resched()
>> calls are stubbed out so it amounts to even less.)
>>
>> In sum, cond_resched() in error handling/retry contexts might
>> be useful in avoiding softlockup splats, but not very good at
>> error handling. Ideally, these should be replaced with some kind
>> of timed or event wait.
>>
>> For now add cond_resched_stall(), which tries to schedule if
>> possible, and failing that executes a cpu_relax().
>
> What's the point of this new variant of cond_resched()? We really do not
> want it at all.
>
>> +int __cond_resched_stall(void)
>> +{
>> +	if (tif_need_resched(RESCHED_eager)) {
>> +		__preempt_schedule();
>
> Under the new model TIF_NEED_RESCHED is going to reschedule if the
> preemption counter goes to zero.

Yes agreed. cond_resched_stall() was just meant to be window dressing.

> So the typical
>
>    while (readl(mmio) & BUSY)
>    	cpu_relax();
>
> will just be preempted like any other loop, no?

Yeah. But drivers could be using that right now as well. I suspect people
don't like the idea of spinning in a loop and, that's why they use
cond_resched(). Which in loops like this, is pretty much:

     while (readl(mmio) & BUSY)
     	   ;

The reason I added cond_resched_stall() was as an analogue to
cond_resched_lock() etc. Here, explicitly giving up CPU.

Though, someone pointed out a much better interface to do that sort
of thing: readb_poll_timeout(). Not all but a fair number of sites
could be converted to that.

Ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 00/86] Make the kernel preemptible
  2023-11-09 11:00     ` Mark Rutland
@ 2023-11-09 22:36       ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-09 22:36 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Mark Rutland <mark.rutland@arm.com> writes:

> On Wed, Nov 08, 2023 at 04:34:41PM -0800, Ankur Arora wrote:
>> Mark Rutland <mark.rutland@arm.com> writes:
>>
>> > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
>> >> What's broken:
>> >>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>> >>    lazily")
>> >>  - Non-x86 architectures. It's trivial to support other archs (only need
>> >>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>> >>    comments on the series.
>> >>    (From some testing on arm64, didn't find any surprises.)
>> >
>> > When you say "testing on arm64, didn't find any surprises", I assume you mean
>> > with an additional patch adding TIF_NEED_RESCHED_LAZY?
>>
>> Yeah. And, handling that in the user exit path.
>>
>> > Note that since arm64 doesn't use the generic entry code, that also requires
>> > changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to
>> > handle TIF_NEED_RESCHED_LAZY.
>>
>> So, the intent (which got muddied due to this overly large series)
>> was to delay handling TIF_NEED_RESCHED_LAZY until we are about to
>> return to user.
>
> Ah, I missed that detail -- thanks for clarifying!
>
>> I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED
>> and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK
>> and do_notify_resume().
>
> Digging a bit more, I think that should still work.
>
> One slight clarification: arm64_preempt_schedule_irq() doesn't look at
> TIF_NEED_RESCHED today, as it relies on the scheduler IPI calling
> preempt_fold_need_resched() to propogate TIF_NEED_RESCHED into
> PREEMPT_NEED_RESCHED. That should still work since this series makes
> preempt_fold_need_resched() check tif_need_resched(RESCHED_eager).
>
> I was a bit cnofused because in the generic entry code,
> irqentry_exit_cond_resched() explicitly checks for TIF_NEED_RESCHED, and I'm
> not sure why it does that rather than relying on the scheduler IPI as above.

Yeah I found that confusing as well. I suspect the reason is that not
all archs do the folding and we need the explicit check for those that
don't.


--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-09 17:51         ` Josh Poimboeuf
@ 2023-11-09 22:50           ` Ankur Arora
  2023-11-09 23:47             ` Josh Poimboeuf
  2023-11-10  0:56           ` Steven Rostedt
  1 sibling, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-09 22:50 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Steven Rostedt, Ankur Arora, linux-kernel, tglx, peterz,
	torvalds, paulmck, linux-mm, x86, akpm, luto, bp, dave.hansen,
	hpa, mingo, juri.lelli, vincent.guittot, willy, mgorman,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	jgross, andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	live-patching


Josh Poimboeuf <jpoimboe@kernel.org> writes:

> On Thu, Nov 09, 2023 at 12:31:47PM -0500, Steven Rostedt wrote:
>> On Thu, 9 Nov 2023 09:26:37 -0800
>> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>>
>> > On Tue, Nov 07, 2023 at 06:16:09PM -0500, Steven Rostedt wrote:
>> > > On Tue,  7 Nov 2023 13:56:53 -0800
>> > > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> > >
>> > > > This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.
>> > > >
>> > > > Note that removing this commit reintroduces "live patches failing to
>> > > > complete within a reasonable amount of time due to CPU-bound kthreads."
>> > > >
>> > > > Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
>> > > > existence of cond_resched() so this will need an alternate fix.
>> >
>> > We definitely don't want to introduce a regression, something will need
>> > to be figured out before removing cond_resched().
>> >
>> > We could hook into preempt_schedule_irq(), but that wouldn't work for
>> > non-ORC.
>> >
>> > Another option would be to hook into schedule().  Then livepatch could
>> > set TIF_NEED_RESCHED on remaining unpatched tasks.  But again if they go
>> > through the preemption path then we have the same problem for non-ORC.
>> >
>> > Worst case we'll need to sprinkle cond_livepatch() everywhere :-/
>> >
>>
>> I guess I'm not fully understanding what the cond rescheds are for. But
>> would an IPI to all CPUs setting NEED_RESCHED, fix it?

Yeah. We could just temporarily toggle to full preemption, when
NEED_RESCHED_LAZY is always upgraded to NEED_RESCHED which will
then send IPIs.

> If all livepatch arches had the ORC unwinder, yes.
>
> The problem is that frame pointer (and similar) unwinders can't reliably
> unwind past an interrupt frame.

Ah, I wonder if we could just disable the preempt_schedule_irq() path
temporarily? Hooking into schedule() alongside something like this:

@@ -379,7 +379,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)

 void irqentry_exit_cond_resched(void)
 {
-       if (!preempt_count()) {
+       if (klp_cond_resched_disable() && !preempt_count()) {

The problem would be tasks that don't go through any preemptible
sections.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()
  2023-11-07 23:08   ` [RFC PATCH 85/86] treewide: drivers: " Ankur Arora
  2023-11-08  0:48     ` Chris Packham
@ 2023-11-09 23:25     ` Dmitry Torokhov
  2023-11-09 23:41       ` Steven Rostedt
  2023-11-10  0:01       ` Ankur Arora
  1 sibling, 2 replies; 250+ messages in thread
From: Dmitry Torokhov @ 2023-11-09 23:25 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Oded Gabbay, Miguel Ojeda, Jens Axboe,
	Minchan Kim, Sergey Senozhatsky, Sudip Mukherjee,
	Theodore Ts'o, Jason A. Donenfeld, Amit Shah, Gonglei,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Davidlohr Bueso,
	Jonathan Cameron, Dave Jiang, Alison Schofield, Vishal Verma,
	Ira Weiny, Dan Williams, Sumit Semwal, Christian König,
	Andi Shyti, Ray Jui, Scott Branden, Chris Packham, Shawn Guo,
	Sascha Hauer, Junxian Huang, Will Deacon, Joerg Roedel,
	Mauro Carvalho Chehab, Srinivas Pandruvada, Hans de Goede,
	Ilpo Järvinen, Mark Gross, Finn Thain, Michael Schmitz,
	James E.J. Bottomley, Martin K. Petersen, Kashyap Desai,
	Sumit Saxena, Shivasharan S, Mark Brown, Neil Armstrong,
	Jens Wiklander, Alex Williamson, Helge Deller, David Hildenbrand

Hi Anhur,

On Tue, Nov 07, 2023 at 03:08:21PM -0800, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.

...

What about RCU stalls? The calls to cond_resched() in evdev.c and
mousedev.c were added specifically to allow RCU to run in cases when
userspace passes a large buffer and the kernel is not fully preemptable.

Thanks.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()
  2023-11-09 23:25     ` Dmitry Torokhov
@ 2023-11-09 23:41       ` Steven Rostedt
  2023-11-10  0:01       ` Ankur Arora
  1 sibling, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-09 23:41 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Oded Gabbay, Miguel Ojeda, Jens Axboe, Minchan Kim,
	Sergey Senozhatsky, Sudip Mukherjee, Theodore Ts'o,
	Jason A. Donenfeld, Amit Shah, Gonglei, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Davidlohr Bueso, Jonathan Cameron,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Sumit Semwal, Christian König, Andi Shyti,
	Ray Jui, Scott Branden, Chris Packham, Shawn Guo, Sascha Hauer,
	Junxian Huang, Will Deacon, Joerg Roedel, Mauro Carvalho Chehab,
	Srinivas Pandruvada, Hans de Goede, Ilpo Järvinen,
	Mark Gross, Finn Thain, Michael Schmitz, James E.J. Bottomley,
	Martin K. Petersen, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Mark Brown, Neil Armstrong, Jens Wiklander, Alex Williamson,
	Helge Deller, David Hildenbrand

On Thu, 9 Nov 2023 15:25:54 -0800
Dmitry Torokhov <dmitry.torokhov@gmail.com> wrote:

> Hi Anhur,
> 
> On Tue, Nov 07, 2023 at 03:08:21PM -0800, Ankur Arora wrote:
> > There are broadly three sets of uses of cond_resched():
> > 
> > 1.  Calls to cond_resched() out of the goodness of our heart,
> >     otherwise known as avoiding lockup splats.  
> 
> ...
> 
> What about RCU stalls? The calls to cond_resched() in evdev.c and
> mousedev.c were added specifically to allow RCU to run in cases when
> userspace passes a large buffer and the kernel is not fully preemptable.
> 

First, this patch is being sent out premature as it depends on acceptance
of the previous patches.

When the previous patches are finished, then we don't need cond_resched()
to protect against RCU stalls, because even "PREEMPT_NONE" will allow
preemption inside the kernel.

What the earlier patches do is introduce a concept of NEED_RESCHED_LAZY.
Then when the scheduler wants to resched the task, it will set that bit
instead of NEED_RESCHED (for the old PREEMPT_NONE version). For VOLUNTARY,
it sets the LAZY bit of SCHED_OTHER but NEED_RESCHED for RT/DL tasks. For
PREEMPT, it will always set NEED_RESCHED.

NEED_RESCHED will always schedule, but NEED_RESCHED_LAZY only schedules
when going to user space.

Now after on tick (depending on HZ it can be 1ms, 3.3ms, 4ms 10ms) if
NEED_RESCHED_LAZY is set, then it will set NEED_RESCHED, forcing a
preemption at the next available moment (when preempt count is zero).

This will be done even with the old PREEMPT_NONE configuration.

That way we will no longer play whack-a-mole to get rid of all he long
running kernel paths by inserting cond_resched() in them.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-09 22:50           ` Ankur Arora
@ 2023-11-09 23:47             ` Josh Poimboeuf
  2023-11-10  0:46               ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Josh Poimboeuf @ 2023-11-09 23:47 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Steven Rostedt, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	live-patching

On Thu, Nov 09, 2023 at 02:50:48PM -0800, Ankur Arora wrote:
> >> I guess I'm not fully understanding what the cond rescheds are for. But
> >> would an IPI to all CPUs setting NEED_RESCHED, fix it?
> 
> Yeah. We could just temporarily toggle to full preemption, when
> NEED_RESCHED_LAZY is always upgraded to NEED_RESCHED which will
> then send IPIs.
> 
> > If all livepatch arches had the ORC unwinder, yes.
> >
> > The problem is that frame pointer (and similar) unwinders can't reliably
> > unwind past an interrupt frame.
> 
> Ah, I wonder if we could just disable the preempt_schedule_irq() path
> temporarily? Hooking into schedule() alongside something like this:
> 
> @@ -379,7 +379,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> 
>  void irqentry_exit_cond_resched(void)
>  {
> -       if (!preempt_count()) {
> +       if (klp_cond_resched_disable() && !preempt_count()) {
> 
> The problem would be tasks that don't go through any preemptible
> sections.

Let me back up a bit and explain what klp is trying to do.

When a livepatch is applied, klp needs to unwind all the tasks,
preferably within a reasonable amount of time.

We can't unwind task A from task B while task A is running, since task A
could be changing the stack during the unwind.  So task A needs to be
blocked or asleep.  The only exception to that is if the unwind happens
in the context of task A itself.

The problem we were seeing was CPU-bound kthreads (e.g., vhost_worker)
not getting patched within a reasonable amount of time.  We fixed it by
hooking the klp unwind into cond_resched() so it can unwind from the
task itself.

It only worked because we had a non-preempted hook (because non-ORC
unwinders can't unwind reliably through preemption) which called klp to
unwind from the context of the task.

Without something to hook into, we have a problem.  We could of course
hook into schedule(), but if the kthread never calls schedule() from a
non-preempted context then it still doesn't help.

-- 
Josh

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()
  2023-11-09 23:25     ` Dmitry Torokhov
  2023-11-09 23:41       ` Steven Rostedt
@ 2023-11-10  0:01       ` Ankur Arora
  1 sibling, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-10  0:01 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Oded Gabbay, Miguel Ojeda, Jens Axboe,
	Minchan Kim, Sergey Senozhatsky, Sudip Mukherjee,
	Theodore Ts'o, Jason A. Donenfeld, Amit Shah, Gonglei,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Davidlohr Bueso,
	Jonathan Cameron, Dave Jiang, Alison Schofield, Vishal Verma,
	Ira Weiny, Dan Williams, Sumit Semwal, Christian König,
	Andi Shyti, Ray Jui, Scott Branden, Chris Packham, Shawn Guo,
	Sascha Hauer, Junxian Huang, Will Deacon, Joerg Roedel,
	Mauro Carvalho Chehab, Srinivas Pandruvada, Hans de Goede,
	Ilpo Järvinen, Mark Gross, Finn Thain, Michael Schmitz,
	James E.J. Bottomley, Martin K. Petersen, Kashyap Desai,
	Sumit Saxena, Shivasharan S, Mark Brown, Neil Armstrong,
	Jens Wiklander, Alex Williamson, Helge Deller, David Hildenbrand


Dmitry Torokhov <dmitry.torokhov@gmail.com> writes:

> Hi Anhur,
>
> On Tue, Nov 07, 2023 at 03:08:21PM -0800, Ankur Arora wrote:
>> There are broadly three sets of uses of cond_resched():
>>
>> 1.  Calls to cond_resched() out of the goodness of our heart,
>>     otherwise known as avoiding lockup splats.
>
> ...
>
> What about RCU stalls? The calls to cond_resched() in evdev.c and
> mousedev.c were added specifically to allow RCU to run in cases when
> userspace passes a large buffer and the kernel is not fully preemptable.

Hi Dmitry

The short answer is that even if the kernel isn't fully preemptible, it
will always have preempt-count which means that RCU will always know
when a read-side critical section gets over.

Long version: cond_resched_rcu() is defined as:

 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
        rcu_read_unlock();
        cond_resched();
        rcu_read_lock();
 #endif
 }

So the relevant case is PREEMPT_RCU=n.

Now, currently PREEMPT_RCU=n, also implies PREEMPT_COUNT=n. And so
the rcu_read_lock()/_unlock() reduce to a barrier. And, that's
why we need the explicit cond_resched() there.


The reason we can remove the cond_resched() after patch 43, and 47 is
because rcu_read_lock()/_unlock() will modify the preempt count and so
RCU will have visibility into when RCU read-side critical sections
finish.

That said, this series in this form isn't really going anywhere in the
short-term so none of this is imminent.

On the calls to cond_resched(), if the kernel is fully preemptible
they are a NOP. And then the code would be polling in a tight loop.

Would it make sense to do something like this instead?

     if (!cond_resched())
        msleep()/usleep()/cpu_relax();


--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-09 23:47             ` Josh Poimboeuf
@ 2023-11-10  0:46               ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-10  0:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Ankur Arora, Steven Rostedt, linux-kernel, tglx, peterz,
	torvalds, paulmck, linux-mm, x86, akpm, luto, bp, dave.hansen,
	hpa, mingo, juri.lelli, vincent.guittot, willy, mgorman,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	jgross, andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	live-patching


Josh Poimboeuf <jpoimboe@kernel.org> writes:

> On Thu, Nov 09, 2023 at 02:50:48PM -0800, Ankur Arora wrote:
>> >> I guess I'm not fully understanding what the cond rescheds are for. But
>> >> would an IPI to all CPUs setting NEED_RESCHED, fix it?
>>
>> Yeah. We could just temporarily toggle to full preemption, when
>> NEED_RESCHED_LAZY is always upgraded to NEED_RESCHED which will
>> then send IPIs.
>>
>> > If all livepatch arches had the ORC unwinder, yes.
>> >
>> > The problem is that frame pointer (and similar) unwinders can't reliably
>> > unwind past an interrupt frame.
>>
>> Ah, I wonder if we could just disable the preempt_schedule_irq() path
>> temporarily? Hooking into schedule() alongside something like this:
>>
>> @@ -379,7 +379,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>>
>>  void irqentry_exit_cond_resched(void)
>>  {
>> -       if (!preempt_count()) {
>> +       if (klp_cond_resched_disable() && !preempt_count()) {
>>
>> The problem would be tasks that don't go through any preemptible
>> sections.
>
> Let me back up a bit and explain what klp is trying to do.
>
> When a livepatch is applied, klp needs to unwind all the tasks,
> preferably within a reasonable amount of time.
>
> We can't unwind task A from task B while task A is running, since task A
> could be changing the stack during the unwind.  So task A needs to be
> blocked or asleep.  The only exception to that is if the unwind happens
> in the context of task A itself.

> The problem we were seeing was CPU-bound kthreads (e.g., vhost_worker)
> not getting patched within a reasonable amount of time.  We fixed it by
> hooking the klp unwind into cond_resched() so it can unwind from the
> task itself.

Right, so the task calls schedule() itself via cond_resched() and that
works. If the task schedules out by calling preempt_enable() that
presumably works as well.

So, that leaves two paths where we can't unwind:

 1. a task never entering or leaving preemptible sections
 2. or, a task which gets preempted in irqentry_exit_cond_resched()
   This we could disable temporarily.

> It only worked because we had a non-preempted hook (because non-ORC
> unwinders can't unwind reliably through preemption) which called klp to
> unwind from the context of the task.
>
> Without something to hook into, we have a problem.  We could of course
> hook into schedule(), but if the kthread never calls schedule() from a
> non-preempted context then it still doesn't help.

Yeah agreed. The first one is a problem. And, that's a problem with the
removal of cond_resched() generally. Because the way to fix case 1 was
typically to add a cond_resched() when softlockups were seen or in
code review.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
  2023-11-09 17:51         ` Josh Poimboeuf
  2023-11-09 22:50           ` Ankur Arora
@ 2023-11-10  0:56           ` Steven Rostedt
  1 sibling, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-10  0:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	live-patching

On Thu, 9 Nov 2023 09:51:18 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> > I guess I'm not fully understanding what the cond rescheds are for. But
> > would an IPI to all CPUs setting NEED_RESCHED, fix it?  
> 
> If all livepatch arches had the ORC unwinder, yes.
> 
> The problem is that frame pointer (and similar) unwinders can't reliably
> unwind past an interrupt frame.

Perhaps we can use this to push those archs with bad unwinders to port over
ORC unwinding ;-)

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 66/86] treewide: kernel: remove cond_resched()
  2023-11-07 23:08   ` [RFC PATCH 66/86] treewide: kernel: " Ankur Arora
@ 2023-11-17 18:14     ` Luis Chamberlain
  2023-11-17 19:51       ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Luis Chamberlain @ 2023-11-17 18:14 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Tejun Heo, Zefan Li, Johannes Weiner,
	Peter Oberparleiter, Eric Biederman, Will Deacon, Oleg Nesterov

On Tue, Nov 07, 2023 at 03:08:02PM -0800, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
> 
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
> 
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
> 
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
> 
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
> 
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
> 
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
> 
> All of these are from set-1 except for the retry loops in
> task_function_call() or the mutex testing logic.
> 
> Replace these with cond_resched_stall(). The others can be removed.
> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: Tejun Heo <tj@kernel.org> 
> Cc: Zefan Li <lizefan.x@bytedance.com> 
> Cc: Johannes Weiner <hannes@cmpxchg.org> 
> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> 
> Cc: Eric Biederman <ebiederm@xmission.com> 
> Cc: Will Deacon <will@kernel.org> 
> Cc: Luis Chamberlain <mcgrof@kernel.org> 
> Cc: Oleg Nesterov <oleg@redhat.com> 
> Cc: Juri Lelli <juri.lelli@redhat.com> 
> Cc: Vincent Guittot <vincent.guittot@linaro.org> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Sounds like the sort of test which should be put into linux-next to get
test coverage right away. To see what really blows up.

 Luis

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 66/86] treewide: kernel: remove cond_resched()
  2023-11-17 18:14     ` Luis Chamberlain
@ 2023-11-17 19:51       ` Steven Rostedt
  0 siblings, 0 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-11-17 19:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, paulmck,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Tejun Heo, Zefan Li, Johannes Weiner,
	Peter Oberparleiter, Eric Biederman, Will Deacon, Oleg Nesterov

On Fri, 17 Nov 2023 10:14:33 -0800
Luis Chamberlain <mcgrof@kernel.org> wrote:

> Sounds like the sort of test which should be put into linux-next to get
> test coverage right away. To see what really blows up.

No, it shouldn't have been added this early in the development. It depends
on the first part of this patch series, that needs to finished first.

Ankur just gave his full vision of the RFC. You can ignore the removal of
the cond_resched() for the time being.

Thanks for looking at it Luis!

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-08  0:27   ` Steven Rostedt
@ 2023-11-21  0:28     ` Paul E. McKenney
  2023-11-21  3:43       ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  0:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue, Nov 07, 2023 at 07:27:03PM -0500, Steven Rostedt wrote:
> On Tue,  7 Nov 2023 13:57:33 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> > With PREEMPTION being always-on, some configurations might prefer
> > the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> > as compared to PREEMPT_RCU=y.
> > 
> > So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> > enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> > 
> > Note that the preemption model can be changed at runtime (modulo
> > configurations with ARCH_NO_PREEMPT), but the RCU configuration
> > is statically compiled.
> 
> I wonder if we should make this a separate patch, and allow PREEMPT_RCU=n
> when PREEMPT=y?

You mean independent of this series?  If so, I am not all that excited
about allowing a new option due to the effect on testing.  With this full
series, the number of test scenarios is preserved.

Actually, that is not exactly true, is it?  It would be if we instead had
something like this:

config PREEMPT_RCU
	bool
	default y if PREEMPT || PREEMPT_RT
	depends on !PREEMPT_NONE && !PREEMPT_VOLUNTARY
	select TREE_RCU

Any reason why this would be a problem?

Or to put it another way, do you know of anyone who really wants
a preemptible kernel (CONFIG_PREEMPT=y, CONFIG_PREEMPT_NONE=n
and CONFIG_PREEMPT_VOLUNTARY=n) but also non-preemptible RCU
(CONFIG_PREEMPT_RCU=y)?  If so, why?  I am having some difficulty seeing
how this combination could be at all helpful.  And if it is not helpful,
we should not allow people to shoot themselves in the foot with it.

> This could allow us to test this without this having to be part of this
> series.

OK, if you mean for testing purposes but not to go to mainline without
the rest of the series, I am good with that idea.

And thank you to Ankur for preserving non-preemptible RCU for those of us
using system that are adequately but not generously endowed with memory!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-07 21:57 ` [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n Ankur Arora
@ 2023-11-21  0:38   ` Paul E. McKenney
  2023-11-21  3:26     ` Ankur Arora
  2023-11-28 17:04     ` Thomas Gleixner
  2023-11-21  3:55   ` Z qiang
  1 sibling, 2 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  0:38 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> cond_resched() is used to provide urgent quiescent states for
> read-side critical sections on PREEMPT_RCU=n configurations.
> This was necessary because lacking preempt_count, there was no
> way for the tick handler to know if we were executing in RCU
> read-side critical section or not.
> 
> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> reliably report quiescent states.
> 
> Accordingly, evaluate preempt_count() based quiescence in
> rcu_flavor_sched_clock_irq().
> 
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/rcu/tree_plugin.h |  3 ++-
>  kernel/sched/core.c      | 15 +--------------
>  2 files changed, 3 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index f87191e008ff..618f055f8028 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>   */
>  static void rcu_flavor_sched_clock_irq(int user)
>  {
> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {

This looks good.

>  		/*
>  		 * Get here if this CPU took its interrupt from user
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bf5df2b866df..15db5fb7acc7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
>  		preempt_schedule_common();
>  		return 1;
>  	}
> -	/*
> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> -	 * whether the current CPU is in an RCU read-side critical section,
> -	 * so the tick can report quiescent states even for CPUs looping
> -	 * in kernel context.  In contrast, in non-preemptible kernels,
> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
> -	 * processes executing in kernel context might never report an
> -	 * RCU quiescent state.  Therefore, the following code causes
> -	 * cond_resched() to report a quiescent state, but only when RCU
> -	 * is in urgent need of one.
> -	 */
> -#ifndef CONFIG_PREEMPT_RCU
> -	rcu_all_qs();
> -#endif

But...

Suppose we have a long-running loop in the kernel that regularly
enables preemption, but only momentarily.  Then the added
rcu_flavor_sched_clock_irq() check would almost always fail, making
for extremely long grace periods.  Or did I miss a change that causes
preempt_enable() to help RCU out?

							Thanx, Paul

>  	return 0;
>  }
>  EXPORT_SYMBOL(_cond_resched);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 50/86] rcu: TASKS_RCU does not need to depend on PREEMPTION
  2023-11-07 21:57 ` [RFC PATCH 50/86] rcu: TASKS_RCU does not need to depend on PREEMPTION Ankur Arora
@ 2023-11-21  0:38   ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  0:38 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 07, 2023 at 01:57:36PM -0800, Ankur Arora wrote:
> With PREEMPTION being always enabled, we don't need TASKS_RCU
> to be explicitly conditioned on it.
> 
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  arch/Kconfig             | 4 ++--
>  include/linux/rcupdate.h | 4 ----
>  kernel/bpf/Kconfig       | 2 +-
>  kernel/trace/Kconfig     | 4 ++--
>  4 files changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 05ce60036ecc..f5179b24072c 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -55,7 +55,7 @@ config KPROBES
>  	depends on MODULES
>  	depends on HAVE_KPROBES
>  	select KALLSYMS
> -	select TASKS_RCU if PREEMPTION
> +	select TASKS_RCU
>  	help
>  	  Kprobes allows you to trap at almost any kernel address and
>  	  execute a callback function.  register_kprobe() establishes
> @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
>  config OPTPROBES
>  	def_bool y
>  	depends on KPROBES && HAVE_OPTPROBES
> -	select TASKS_RCU if PREEMPTION
> +	select TASKS_RCU
>  
>  config KPROBES_ON_FTRACE
>  	def_bool y
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 5e5f920ade90..7246ee602b0b 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -171,10 +171,6 @@ static inline void rcu_nocb_flush_deferred_wakeup(void) { }
>  	} while (0)
>  void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
>  void synchronize_rcu_tasks(void);
> -# else
> -# define rcu_tasks_classic_qs(t, preempt) do { } while (0)
> -# define call_rcu_tasks call_rcu
> -# define synchronize_rcu_tasks synchronize_rcu
>  # endif
>  
>  # ifdef CONFIG_TASKS_TRACE_RCU
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 6a906ff93006..e3231b28e2a0 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -27,7 +27,7 @@ config BPF_SYSCALL
>  	bool "Enable bpf() system call"
>  	select BPF
>  	select IRQ_WORK
> -	select TASKS_RCU if PREEMPTION
> +	select TASKS_RCU
>  	select TASKS_TRACE_RCU
>  	select BINARY_PRINTF
>  	select NET_SOCK_MSG if NET
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 61c541c36596..e090387b1c2d 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -163,7 +163,7 @@ config TRACING
>  	select BINARY_PRINTF
>  	select EVENT_TRACING
>  	select TRACE_CLOCK
> -	select TASKS_RCU if PREEMPTION
> +	select TASKS_RCU
>  
>  config GENERIC_TRACER
>  	bool
> @@ -204,7 +204,7 @@ config FUNCTION_TRACER
>  	select GENERIC_TRACER
>  	select CONTEXT_SWITCH_TRACER
>  	select GLOB
> -	select TASKS_RCU if PREEMPTION
> +	select TASKS_RCU
>  	select TASKS_RUDE_RCU
>  	help
>  	  Enable the kernel to trace every kernel function. This is done
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()
  2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
                     ` (29 preceding siblings ...)
  2023-11-07 23:19   ` [RFC PATCH 57/86] coccinelle: script to " Julia Lawall
@ 2023-11-21  0:45   ` Paul E. McKenney
  2023-11-21  5:16     ` Ankur Arora
  30 siblings, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  0:45 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Julia Lawall,
	Nicolas Palix

On Tue, Nov 07, 2023 at 03:07:53PM -0800, Ankur Arora wrote:
> Rudimentary script to remove the straight-forward subset of
> cond_resched() and allies:
> 
> 1)  if (need_resched())
> 	  cond_resched()
> 
> 2)  expression*;
>     cond_resched();  /* or in the reverse order */
> 
> 3)  if (expression)
> 	statement
>     cond_resched();  /* or in the reverse order */
> 
> The last two patterns depend on the control flow level to ensure
> that the complex cond_resched() patterns (ex. conditioned ones)
> are left alone and we only pick up ones which are only minimally
> related the neighbouring code.

This series looks to get rid of stall warnings for long in-kernel
preempt-enabled code paths, which is of course a very good thing.
But removing all of the cond_resched() calls can actually increase
scheduling latency compared to the current CONFIG_PREEMPT_NONE=y state,
correct?

If so, it would be good to take a measured approach.  For example, it
is clear that a loop that does a cond_resched() every (say) ten jiffies
can remove that cond_resched() without penalty, at least in kernels built
with either CONFIG_NO_HZ_FULL=n or CONFIG_PREEMPT=y.  But this is not so
clear for a loop that does a cond_resched() every (say) ten microseconds.

Or am I missing something here?

							Thanx, Paul

> Cc: Julia Lawall <Julia.Lawall@inria.fr>
> Cc: Nicolas Palix <nicolas.palix@imag.fr>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>  create mode 100644 scripts/coccinelle/api/cond_resched.cocci
> 
> diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
> new file mode 100644
> index 000000000000..bf43768a8f8c
> --- /dev/null
> +++ b/scripts/coccinelle/api/cond_resched.cocci
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/// Remove naked cond_resched() statements
> +///
> +//# Remove cond_resched() statements when:
> +//#   - executing at the same control flow level as the previous or the
> +//#     next statement (this lets us avoid complicated conditionals in
> +//#     the neighbourhood.)
> +//#   - they are of the form "if (need_resched()) cond_resched()" which
> +//#     is always safe.
> +//#
> +//# Coccinelle generally takes care of comments in the immediate neighbourhood
> +//# but might need to handle other comments alluding to rescheduling.
> +//#
> +virtual patch
> +virtual context
> +
> +@ r1 @
> +identifier r;
> +@@
> +
> +(
> + r = cond_resched();
> +|
> +-if (need_resched())
> +-	cond_resched();
> +)
> +
> +@ r2 @
> +expression E;
> +statement S,T;
> +@@
> +(
> + E;
> +|
> + if (E) S
> +|
> + if (E) S else T
> +|
> +)
> +-cond_resched();
> +
> +@ r3 @
> +expression E;
> +statement S,T;
> +@@
> +-cond_resched();
> +(
> + E;
> +|
> + if (E) S
> +|
> + if (E) S else T
> +)
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 59/86] treewide: rcu: remove cond_resched()
  2023-11-07 23:07   ` [RFC PATCH 59/86] treewide: rcu: " Ankur Arora
@ 2023-11-21  1:01     ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  1:01 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik,
	Frederic Weisbecker

On Tue, Nov 07, 2023 at 03:07:55PM -0800, Ankur Arora wrote:
> All the cond_resched() calls in the RCU interfaces here are to
> drive preemption once it has reported a potentially quiescent
> state, or to exit the grace period. With PREEMPTION=y that should
> happen implicitly.
> 
> So we can remove these.
> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: "Paul E. McKenney" <paulmck@kernel.org> 
> Cc: Frederic Weisbecker <frederic@kernel.org> 
> Cc: Ingo Molnar <mingo@redhat.com> 
> Cc: Peter Zijlstra <peterz@infradead.org> 
> Cc: Juri Lelli <juri.lelli@redhat.com> 
> Cc: Vincent Guittot <vincent.guittot@linaro.org> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/rcupdate.h | 6 ++----
>  include/linux/sched.h    | 7 ++++++-
>  kernel/hung_task.c       | 6 +++---
>  kernel/rcu/tasks.h       | 5 +----
>  4 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 7246ee602b0b..58f8c7faaa52 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -238,14 +238,12 @@ static inline bool rcu_trace_implies_rcu_gp(void) { return true; }
>  /**
>   * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
>   *
> - * This macro resembles cond_resched(), except that it is defined to
> - * report potential quiescent states to RCU-tasks even if the cond_resched()
> - * machinery were to be shut off, as some advocate for PREEMPTION kernels.
> + * This macro resembles cond_resched(), in that it reports potential
> + * quiescent states to RCU-tasks.
>   */
>  #define cond_resched_tasks_rcu_qs() \
>  do { \
>  	rcu_tasks_qs(current, false); \
> -	cond_resched(); \

I am a bit nervous about dropping the cond_resched() in a few cases,
for example, the call from rcu_tasks_trace_pregp_step() only momentarily
enables interrupts.  This should be OK given a scheduling-clock interrupt,
except that nohz_full CPUs don't necessarily have these.  At least not
unless RCU happens to be in a grace period at the time.

>  } while (0)
>  
>  /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 199f8f7211f2..bae6eed534dd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2145,7 +2145,12 @@ static inline void cond_resched_rcu(void)
>  {
>  #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
>  	rcu_read_unlock();
> -	cond_resched();
> +
> +	/*
> +	 * Might reschedule here as we exit the RCU read-side
> +	 * critical section.
> +	 */
> +
>  	rcu_read_lock();

And here I am wondering if some of my nervousness about increased
grace-period latency due to removing cond_resched() might be addressed
by making preempt_enable() take over the help-RCU functionality currently
being provided by cond_resched()...

>  #endif
>  }
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 9a24574988d2..4bdfad08a2e8 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -153,8 +153,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>   * To avoid extending the RCU grace period for an unbounded amount of time,
>   * periodically exit the critical section and enter a new one.
>   *
> - * For preemptible RCU it is sufficient to call rcu_read_unlock in order
> - * to exit the grace period. For classic RCU, a reschedule is required.
> + * Under a preemptive kernel, or with preemptible RCU, it is sufficient to
> + * call rcu_read_unlock in order to exit the grace period.
>   */
>  static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
>  {
> @@ -163,7 +163,7 @@ static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
>  	get_task_struct(g);
>  	get_task_struct(t);
>  	rcu_read_unlock();
> -	cond_resched();
> +
>  	rcu_read_lock();
>  	can_cont = pid_alive(g) && pid_alive(t);
>  	put_task_struct(t);
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 8d65f7d576a3..fa1d9aa31b36 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -541,7 +541,6 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
>  		local_bh_disable();
>  		rhp->func(rhp);
>  		local_bh_enable();
> -		cond_resched();

...and by local_bh_enable().

						Thanx, Paul

>  	}
>  	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
>  	rcu_segcblist_add_len(&rtpcp->cblist, -len);
> @@ -974,10 +973,8 @@ static void check_all_holdout_tasks(struct list_head *hop,
>  {
>  	struct task_struct *t, *t1;
>  
> -	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list) {
> +	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list)
>  		check_holdout_task(t, needreport, firstreport);
> -		cond_resched();
> -	}
>  }
>  
>  /* Finish off the Tasks-RCU grace period. */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 60/86] treewide: torture: remove cond_resched()
  2023-11-07 23:07   ` [RFC PATCH 60/86] treewide: torture: " Ankur Arora
@ 2023-11-21  1:02     ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  1:02 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik,
	Davidlohr Bueso, Josh Triplett, Frederic Weisbecker

On Tue, Nov 07, 2023 at 03:07:56PM -0800, Ankur Arora wrote:
> Some cases changed to cond_resched_stall() to avoid changing
> the behaviour of the test too drastically.
> 
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Given lazy preemption, I am OK with dropping the cond_resched()
invocations from the various torture tests.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/rcu/rcuscale.c   | 2 --
>  kernel/rcu/rcutorture.c | 8 ++++----
>  kernel/scftorture.c     | 1 -
>  kernel/torture.c        | 1 -
>  4 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
> index ffdb30495e3c..737620bbec83 100644
> --- a/kernel/rcu/rcuscale.c
> +++ b/kernel/rcu/rcuscale.c
> @@ -672,8 +672,6 @@ kfree_scale_thread(void *arg)
>  			else
>  				kfree_rcu(alloc_ptr, rh);
>  		}
> -
> -		cond_resched();
>  	} while (!torture_must_stop() && ++loop < kfree_loops);
>  
>  	if (atomic_inc_return(&n_kfree_scale_thread_ended) >= kfree_nrealthreads) {
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index ade42d6a9d9b..158d58710b51 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -81,7 +81,7 @@ torture_param(int, fqs_stutter, 3, "Wait time between fqs bursts (s)");
>  torture_param(int, fwd_progress, 1, "Number of grace-period forward progress tasks (0 to disable)");
>  torture_param(int, fwd_progress_div, 4, "Fraction of CPU stall to wait");
>  torture_param(int, fwd_progress_holdoff, 60, "Time between forward-progress tests (s)");
> -torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched() behind need_resched()");
> +torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched_stall() behind need_resched()");
>  torture_param(bool, gp_cond, false, "Use conditional/async GP wait primitives");
>  torture_param(bool, gp_cond_exp, false, "Use conditional/async expedited GP wait primitives");
>  torture_param(bool, gp_cond_full, false, "Use conditional/async full-state GP wait primitives");
> @@ -2611,7 +2611,7 @@ static void rcu_torture_fwd_prog_cond_resched(unsigned long iter)
>  		return;
>  	}
>  	// No userspace emulation: CB invocation throttles call_rcu()
> -	cond_resched();
> +	cond_resched_stall();
>  }
>  
>  /*
> @@ -2691,7 +2691,7 @@ static void rcu_torture_fwd_prog_nr(struct rcu_fwd *rfp,
>  		udelay(10);
>  		cur_ops->readunlock(idx);
>  		if (!fwd_progress_need_resched || need_resched())
> -			cond_resched();
> +			cond_resched_stall();
>  	}
>  	(*tested_tries)++;
>  	if (!time_before(jiffies, stopat) &&
> @@ -3232,7 +3232,7 @@ static int rcu_torture_read_exit(void *unused)
>  				errexit = true;
>  				break;
>  			}
> -			cond_resched();
> +			cond_resched_stall();
>  			kthread_stop(tsp);
>  			n_read_exits++;
>  		}
> diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> index 59032aaccd18..24192fe01125 100644
> --- a/kernel/scftorture.c
> +++ b/kernel/scftorture.c
> @@ -487,7 +487,6 @@ static int scftorture_invoker(void *arg)
>  			set_cpus_allowed_ptr(current, cpumask_of(cpu));
>  			was_offline = false;
>  		}
> -		cond_resched();
>  		stutter_wait("scftorture_invoker");
>  	} while (!torture_must_stop());
>  
> diff --git a/kernel/torture.c b/kernel/torture.c
> index b28b05bbef02..0c0224c76275 100644
> --- a/kernel/torture.c
> +++ b/kernel/torture.c
> @@ -747,7 +747,6 @@ bool stutter_wait(const char *title)
>  			while (READ_ONCE(stutter_pause_test)) {
>  				if (!(i++ & 0xffff))
>  					torture_hrtimeout_us(10, 0, NULL);
> -				cond_resched();
>  			}
>  		} else {
>  			torture_hrtimeout_jiffies(round_jiffies_relative(HZ), NULL);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21  0:38   ` Paul E. McKenney
@ 2023-11-21  3:26     ` Ankur Arora
  2023-11-21  5:17       ` Paul E. McKenney
  2023-11-28 17:04     ` Thomas Gleixner
  1 sibling, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  3:26 UTC (permalink / raw)
  To: paulmck
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
>> cond_resched() is used to provide urgent quiescent states for
>> read-side critical sections on PREEMPT_RCU=n configurations.
>> This was necessary because lacking preempt_count, there was no
>> way for the tick handler to know if we were executing in RCU
>> read-side critical section or not.
>>
>> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
>> reliably report quiescent states.
>>
>> Accordingly, evaluate preempt_count() based quiescence in
>> rcu_flavor_sched_clock_irq().
>>
>> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/rcu/tree_plugin.h |  3 ++-
>>  kernel/sched/core.c      | 15 +--------------
>>  2 files changed, 3 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index f87191e008ff..618f055f8028 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>>   */
>>  static void rcu_flavor_sched_clock_irq(int user)
>>  {
>> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
>> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
>> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
>
> This looks good.
>
>>  		/*
>>  		 * Get here if this CPU took its interrupt from user
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index bf5df2b866df..15db5fb7acc7 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
>>  		preempt_schedule_common();
>>  		return 1;
>>  	}
>> -	/*
>> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
>> -	 * whether the current CPU is in an RCU read-side critical section,
>> -	 * so the tick can report quiescent states even for CPUs looping
>> -	 * in kernel context.  In contrast, in non-preemptible kernels,
>> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
>> -	 * processes executing in kernel context might never report an
>> -	 * RCU quiescent state.  Therefore, the following code causes
>> -	 * cond_resched() to report a quiescent state, but only when RCU
>> -	 * is in urgent need of one.
>> -	 *      /
>> -#ifndef CONFIG_PREEMPT_RCU
>> -	rcu_all_qs();
>> -#endif
>
> But...
>
> Suppose we have a long-running loop in the kernel that regularly
> enables preemption, but only momentarily.  Then the added
> rcu_flavor_sched_clock_irq() check would almost always fail, making
> for extremely long grace periods.

So, my thinking was that if RCU wants to end a grace period, it would
force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
RCU always uses the the eager version) causing __schedule() to call
rcu_note_context_switch().
That's similar to the preempt_schedule_common() case in the
_cond_resched() above.

But if I see your point, RCU might just want to register a quiescent
state and for this long-running loop rcu_flavor_sched_clock_irq() does
seem to fall down.

> Or did I miss a change that causes preempt_enable() to help RCU out?

Something like this?

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index dc5125b9c36b..e50f358f1548 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -222,6 +222,8 @@ do { \
        barrier(); \
        if (unlikely(preempt_count_dec_and_test())) \
                __preempt_schedule(); \
+       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
+               rcu_all_qs(); \
 } while (0)

Though I do wonder about the likelihood of hitting the case you describe
and maybe instead of adding the check on every preempt_enable()
it might be better to instead force a context switch in the
rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)

Thanks

--
ankur

^ permalink raw reply related	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-21  0:28     ` Paul E. McKenney
@ 2023-11-21  3:43       ` Steven Rostedt
  2023-11-21  5:04         ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-21  3:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Mon, 20 Nov 2023 16:28:50 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Nov 07, 2023 at 07:27:03PM -0500, Steven Rostedt wrote:
> > On Tue,  7 Nov 2023 13:57:33 -0800
> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >   
> > > With PREEMPTION being always-on, some configurations might prefer
> > > the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> > > as compared to PREEMPT_RCU=y.
> > > 
> > > So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> > > enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> > > 
> > > Note that the preemption model can be changed at runtime (modulo
> > > configurations with ARCH_NO_PREEMPT), but the RCU configuration
> > > is statically compiled.  
> > 
> > I wonder if we should make this a separate patch, and allow PREEMPT_RCU=n
> > when PREEMPT=y?  
> 
> You mean independent of this series?  If so, I am not all that excited
> about allowing a new option due to the effect on testing.  With this full
> series, the number of test scenarios is preserved.
> 
> Actually, that is not exactly true, is it?  It would be if we instead had
> something like this:
> 
> config PREEMPT_RCU
> 	bool
> 	default y if PREEMPT || PREEMPT_RT
> 	depends on !PREEMPT_NONE && !PREEMPT_VOLUNTARY
> 	select TREE_RCU
> 
> Any reason why this would be a problem?

Yes, because with this series, there isn't going to be PREEMPT_NONE,
PREEMPT_VOLUNTARY and PREEMPT as a config option. I mean, you could define
the preference you want at boot up. But it could change at run time.

> 
> Or to put it another way, do you know of anyone who really wants
> a preemptible kernel (CONFIG_PREEMPT=y, CONFIG_PREEMPT_NONE=n
> and CONFIG_PREEMPT_VOLUNTARY=n) but also non-preemptible RCU
> (CONFIG_PREEMPT_RCU=y)?  If so, why?  I am having some difficulty seeing
> how this combination could be at all helpful.  And if it is not helpful,
> we should not allow people to shoot themselves in the foot with it.

With the new preemption model, NONE, VOLUNTARY and PREEMPT are now going to
determine when NEED_RESCHED is set as supposed to NEED_RESCHED_LAZY. As
NEED_RESCHED_LAZY only schedules at kernel / user space transaction, and
NEED_RESCHED will schedule when possible (non-preempt disable section).

 Key: L - NEED_RESCHED_LAZY - schedule only at kernel/user boundary
      N - NEED_RESCHED - schedule whenever possible (like PREEMPT does today)

			SCHED_OTHER	REAL-TIME/DL
			  Schedule	  Schedule

NONE:			      L		     L

VOLUNTARY:		      L		     N

PREEMPT:		      N		     N


So on NONE, NEED_RESCHED_LAZY is set only on scheduling SCHED_OTHER and RT.
Which means, it will not schedule until it goes into user space (*).

On VOLUNTARY, NEED_RESCHED is set on RT/DL tasks, and LAZY on SCHED_OTHER.
So that RT and DL get scheduled just like PREEMPT does today.

On PREEMPT, NEED_RESCHED is always set on all scheduling.

(*) - caveat - After the next tick, if NEED_RESCHED_LAZY is set, then
NEED_RESCHED will be set and the kernel will schedule at the next available
moment, this is true for all three models!

There may be more details to work out, but the above is basically the gist
of the idea. Now, what do you want to do with RCU_PREEMPT? At run time, we
can go from NONE to PREEMPT full! But there may be use cases that do not
want the overhead of always having RCU_PREEMPT, and will want RCU to be a
preempt_disable() section no matter what.

Unless we can switch between RCU_PREEMPT and !RCU_PREEMPT at run time, the
dependency on RCU_PREEMPT tied to PREEMPT doesn't make sense anymore.

> 
> > This could allow us to test this without this having to be part of this
> > series.  
> 
> OK, if you mean for testing purposes but not to go to mainline without
> the rest of the series, I am good with that idea.
> 
> And thank you to Ankur for preserving non-preemptible RCU for those of us
> using system that are adequately but not generously endowed with memory!

Exactly. It sounds like having non-preempt RCU is agnostic to the
preemption model of the system, which is why I think we need to make them
disjoint.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-07 21:57 ` [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n Ankur Arora
  2023-11-21  0:38   ` Paul E. McKenney
@ 2023-11-21  3:55   ` Z qiang
  1 sibling, 0 replies; 250+ messages in thread
From: Z qiang @ 2023-11-21  3:55 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

>
> cond_resched() is used to provide urgent quiescent states for
> read-side critical sections on PREEMPT_RCU=n configurations.
> This was necessary because lacking preempt_count, there was no
> way for the tick handler to know if we were executing in RCU
> read-side critical section or not.
>
> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> reliably report quiescent states.
>
> Accordingly, evaluate preempt_count() based quiescence in
> rcu_flavor_sched_clock_irq().
>
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/rcu/tree_plugin.h |  3 ++-
>  kernel/sched/core.c      | 15 +--------------
>  2 files changed, 3 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index f87191e008ff..618f055f8028 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>   */
>  static void rcu_flavor_sched_clock_irq(int user)
>  {
> -       if (user || rcu_is_cpu_rrupt_from_idle()) {
> +       if (user || rcu_is_cpu_rrupt_from_idle() ||
> +           !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {


should ensure CONFIG_PREEMPT_COUNT=y
((IS_ENABLED(CONFIG_PREEMPT_COUNT) && !(preempt_count() &
(PREEMPT_MASK | SOFTIRQ_MASK)))

Thanks
Zqiang


>
>                 /*
>                  * Get here if this CPU took its interrupt from user
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bf5df2b866df..15db5fb7acc7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
>                 preempt_schedule_common();
>                 return 1;
>         }
> -       /*
> -        * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> -        * whether the current CPU is in an RCU read-side critical section,
> -        * so the tick can report quiescent states even for CPUs looping
> -        * in kernel context.  In contrast, in non-preemptible kernels,
> -        * RCU readers leave no in-memory hints, which means that CPU-bound
> -        * processes executing in kernel context might never report an
> -        * RCU quiescent state.  Therefore, the following code causes
> -        * cond_resched() to report a quiescent state, but only when RCU
> -        * is in urgent need of one.
> -        */
> -#ifndef CONFIG_PREEMPT_RCU
> -       rcu_all_qs();
> -#endif
> +
>         return 0;
>  }
>  EXPORT_SYMBOL(_cond_resched);
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-21  3:43       ` Steven Rostedt
@ 2023-11-21  5:04         ` Paul E. McKenney
  2023-11-21  5:39           ` Ankur Arora
  2023-11-21 15:00           ` Steven Rostedt
  0 siblings, 2 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  5:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Mon, Nov 20, 2023 at 10:43:56PM -0500, Steven Rostedt wrote:
> On Mon, 20 Nov 2023 16:28:50 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Tue, Nov 07, 2023 at 07:27:03PM -0500, Steven Rostedt wrote:
> > > On Tue,  7 Nov 2023 13:57:33 -0800
> > > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> > >   
> > > > With PREEMPTION being always-on, some configurations might prefer
> > > > the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> > > > as compared to PREEMPT_RCU=y.
> > > > 
> > > > So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> > > > enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> > > > 
> > > > Note that the preemption model can be changed at runtime (modulo
> > > > configurations with ARCH_NO_PREEMPT), but the RCU configuration
> > > > is statically compiled.  
> > > 
> > > I wonder if we should make this a separate patch, and allow PREEMPT_RCU=n
> > > when PREEMPT=y?  
> > 
> > You mean independent of this series?  If so, I am not all that excited
> > about allowing a new option due to the effect on testing.  With this full
> > series, the number of test scenarios is preserved.
> > 
> > Actually, that is not exactly true, is it?  It would be if we instead had
> > something like this:
> > 
> > config PREEMPT_RCU
> > 	bool
> > 	default y if PREEMPT || PREEMPT_RT
> > 	depends on !PREEMPT_NONE && !PREEMPT_VOLUNTARY
> > 	select TREE_RCU
> > 
> > Any reason why this would be a problem?
> 
> Yes, because with this series, there isn't going to be PREEMPT_NONE,
> PREEMPT_VOLUNTARY and PREEMPT as a config option. I mean, you could define
> the preference you want at boot up. But it could change at run time.

I applied the series, and there was still a PREEMPT_NONE.  Some might
consider the name to be a bit misleading, perhaps, but it was still there.

Ah, I missed patch 30/86.  The idea is to make CONFIG_PREEMPT_DYNAMIC
unconditional?  Why?  

> > Or to put it another way, do you know of anyone who really wants
> > a preemptible kernel (CONFIG_PREEMPT=y, CONFIG_PREEMPT_NONE=n
> > and CONFIG_PREEMPT_VOLUNTARY=n) but also non-preemptible RCU
> > (CONFIG_PREEMPT_RCU=y)?  If so, why?  I am having some difficulty seeing
> > how this combination could be at all helpful.  And if it is not helpful,
> > we should not allow people to shoot themselves in the foot with it.
> 
> With the new preemption model, NONE, VOLUNTARY and PREEMPT are now going to
> determine when NEED_RESCHED is set as supposed to NEED_RESCHED_LAZY. As
> NEED_RESCHED_LAZY only schedules at kernel / user space transaction, and
> NEED_RESCHED will schedule when possible (non-preempt disable section).

So NONE really is still supposed to be there.  ;-)

>  Key: L - NEED_RESCHED_LAZY - schedule only at kernel/user boundary
>       N - NEED_RESCHED - schedule whenever possible (like PREEMPT does today)
> 
> 			SCHED_OTHER	REAL-TIME/DL
> 			  Schedule	  Schedule
> 
> NONE:			      L		     L
> 
> VOLUNTARY:		      L		     N
> 
> PREEMPT:		      N		     N
> 
> 
> So on NONE, NEED_RESCHED_LAZY is set only on scheduling SCHED_OTHER and RT.
> Which means, it will not schedule until it goes into user space (*).
> 
> On VOLUNTARY, NEED_RESCHED is set on RT/DL tasks, and LAZY on SCHED_OTHER.
> So that RT and DL get scheduled just like PREEMPT does today.
> 
> On PREEMPT, NEED_RESCHED is always set on all scheduling.
> 
> (*) - caveat - After the next tick, if NEED_RESCHED_LAZY is set, then
> NEED_RESCHED will be set and the kernel will schedule at the next available
> moment, this is true for all three models!

OK, so I see that this is now a SCHED_FEAT, and is initialized based
on CONFIG_PREEMPT_* in kernel/sched/feature.h.  Huh.  OK, we can still
control this at build time, which is fine.  I don't see how to set it
at boot time, only at build time or from debugfs.  I will let those who
want to set this at boot time complain, should they choose to do so.

> There may be more details to work out, but the above is basically the gist
> of the idea. Now, what do you want to do with RCU_PREEMPT? At run time, we
> can go from NONE to PREEMPT full! But there may be use cases that do not
> want the overhead of always having RCU_PREEMPT, and will want RCU to be a
> preempt_disable() section no matter what.

Understood, actually.  And as noted in other replies, I am a bit concerned
about added latencies from too aggressively removing cond_resched().

More testing required.

> Unless we can switch between RCU_PREEMPT and !RCU_PREEMPT at run time, the
> dependency on RCU_PREEMPT tied to PREEMPT doesn't make sense anymore.

I strongly recommend against runtime switching of RCU's preemptibility,
just in case you were wondering.  ;-)

My question is different.

Would anyone want PREEMPT (N N above) in combination with non-preemptible
RCU?  I cannot see why anyone would want this.

> > > This could allow us to test this without this having to be part of this
> > > series.  
> > 
> > OK, if you mean for testing purposes but not to go to mainline without
> > the rest of the series, I am good with that idea.
> > 
> > And thank you to Ankur for preserving non-preemptible RCU for those of us
> > using system that are adequately but not generously endowed with memory!
> 
> Exactly. It sounds like having non-preempt RCU is agnostic to the
> preemption model of the system, which is why I think we need to make them
> disjoint.

How about like this, where "Y" means allowed and "N" means not allowed:

			Non-Preemptible RCU	Preemptible RCU

NONE:				Y			Y

VOLUNTARY:			Y			Y

PREEMPT:			N			Y

PREEMPT_RT:			N			Y


We need preemptible RCU for NONE and VOLUNTARY, as you say,
to allow CONFIG_PREEMPT_DYNAMIC to continue to work.  (OK, OK,
CONFIG_PREEMPT_DYNAMIC is no longer, but appears to be unconditional.)
But again, I don't see why anyone would want (much less need)
non-preemptible RCU in the PREEMPT and PREEMPT_RT cases.  And if it is
neither wanted nor needed, there is no point in enabling it, much less
testing it.

Or am I missing a use case in there somewhere?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()
  2023-11-21  0:45   ` Paul E. McKenney
@ 2023-11-21  5:16     ` Ankur Arora
  2023-11-21 15:26       ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  5:16 UTC (permalink / raw)
  To: paulmck
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik, Julia Lawall, Nicolas Palix


Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Nov 07, 2023 at 03:07:53PM -0800, Ankur Arora wrote:
>> Rudimentary script to remove the straight-forward subset of
>> cond_resched() and allies:
>>
>> 1)  if (need_resched())
>> 	  cond_resched()
>>
>> 2)  expression*;
>>     cond_resched();  /* or in the reverse order */
>>
>> 3)  if (expression)
>> 	statement
>>     cond_resched();  /* or in the reverse order */
>>
>> The last two patterns depend on the control flow level to ensure
>> that the complex cond_resched() patterns (ex. conditioned ones)
>> are left alone and we only pick up ones which are only minimally
>> related the neighbouring code.
>
> This series looks to get rid of stall warnings for long in-kernel
> preempt-enabled code paths, which is of course a very good thing.
> But removing all of the cond_resched() calls can actually increase
> scheduling latency compared to the current CONFIG_PREEMPT_NONE=y state,
> correct?

Not necessarily.

If TIF_NEED_RESCHED_LAZY is set, then we let the current task finish
before preempting. If that task runs for arbitrarily long (what Thomas
calls the hog problem) -- currently we allow them to run for upto one
extra tick (which might shorten/become a tunable.)

If TIF_NEED_RESCHED is set, then it gets folded the same it does now
and preemption happens at the next safe preemption point.

So, I guess the scheduling latency would always be bounded but how much
latency a task would incur would be scheduler policy dependent.

This is early days, so the policy (or really the rest of it) isn't set
in stone but having two levels of preemption -- immediate and
deferred -- does seem to give the scheduler greater freedom of policy.

Btw, are you concerned about the scheduling latencies in general or the
scheduling latency of a particular set of tasks?

> If so, it would be good to take a measured approach.  For example, it
> is clear that a loop that does a cond_resched() every (say) ten jiffies
> can remove that cond_resched() without penalty, at least in kernels built
> with either CONFIG_NO_HZ_FULL=n or CONFIG_PREEMPT=y.  But this is not so
> clear for a loop that does a cond_resched() every (say) ten microseconds.

True. Though both of those loops sound bad :).

Yeah, and as we were discussing offlist, the question is the comparative
density of preempt_dec_and_test() is true vs calls to cond_resched().

And if they are similar then we could replace cond_resched() quiescence
reporting with ones in preempt_enable() (as you mention elsewhere in the
thread.)


Thanks

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21  3:26     ` Ankur Arora
@ 2023-11-21  5:17       ` Paul E. McKenney
  2023-11-21  5:34         ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  5:17 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> >> cond_resched() is used to provide urgent quiescent states for
> >> read-side critical sections on PREEMPT_RCU=n configurations.
> >> This was necessary because lacking preempt_count, there was no
> >> way for the tick handler to know if we were executing in RCU
> >> read-side critical section or not.
> >>
> >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> >> reliably report quiescent states.
> >>
> >> Accordingly, evaluate preempt_count() based quiescence in
> >> rcu_flavor_sched_clock_irq().
> >>
> >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >>  kernel/rcu/tree_plugin.h |  3 ++-
> >>  kernel/sched/core.c      | 15 +--------------
> >>  2 files changed, 3 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> >> index f87191e008ff..618f055f8028 100644
> >> --- a/kernel/rcu/tree_plugin.h
> >> +++ b/kernel/rcu/tree_plugin.h
> >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> >>   */
> >>  static void rcu_flavor_sched_clock_irq(int user)
> >>  {
> >> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
> >> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
> >> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> >
> > This looks good.
> >
> >>  		/*
> >>  		 * Get here if this CPU took its interrupt from user
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index bf5df2b866df..15db5fb7acc7 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> >>  		preempt_schedule_common();
> >>  		return 1;
> >>  	}
> >> -	/*
> >> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> >> -	 * whether the current CPU is in an RCU read-side critical section,
> >> -	 * so the tick can report quiescent states even for CPUs looping
> >> -	 * in kernel context.  In contrast, in non-preemptible kernels,
> >> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
> >> -	 * processes executing in kernel context might never report an
> >> -	 * RCU quiescent state.  Therefore, the following code causes
> >> -	 * cond_resched() to report a quiescent state, but only when RCU
> >> -	 * is in urgent need of one.
> >> -	 *      /
> >> -#ifndef CONFIG_PREEMPT_RCU
> >> -	rcu_all_qs();
> >> -#endif
> >
> > But...
> >
> > Suppose we have a long-running loop in the kernel that regularly
> > enables preemption, but only momentarily.  Then the added
> > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > for extremely long grace periods.
> 
> So, my thinking was that if RCU wants to end a grace period, it would
> force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> RCU always uses the the eager version) causing __schedule() to call
> rcu_note_context_switch().
> That's similar to the preempt_schedule_common() case in the
> _cond_resched() above.

But that requires IPIing that CPU, correct?

> But if I see your point, RCU might just want to register a quiescent
> state and for this long-running loop rcu_flavor_sched_clock_irq() does
> seem to fall down.
> 
> > Or did I miss a change that causes preempt_enable() to help RCU out?
> 
> Something like this?
> 
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index dc5125b9c36b..e50f358f1548 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -222,6 +222,8 @@ do { \
>         barrier(); \
>         if (unlikely(preempt_count_dec_and_test())) \
>                 __preempt_schedule(); \
> +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> +               rcu_all_qs(); \
>  } while (0)

Or maybe something like this to lighten the load a bit:

#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) { \
		__preempt_schedule(); \
		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
		    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
			rcu_all_qs(); \
	} \
} while (0)

And at that point, we should be able to drop the PREEMPT_MASK, not
that it makes any difference that I am aware of:

#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) { \
		__preempt_schedule(); \
		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
		    !(preempt_count() & SOFTIRQ_MASK)) \
			rcu_all_qs(); \
	} \
} while (0)

Except that we can migrate as soon as that preempt_count_dec_and_test()
returns.  And that rcu_all_qs() disables and re-enables preemption,
which will result in undesired recursion.  Sigh.

So maybe something like this:

#define preempt_enable() \
do { \
	if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
	    !(preempt_count() & SOFTIRQ_MASK)) \
		rcu_all_qs(); \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) { \
		__preempt_schedule(); \
	} \
} while (0)

Then rcu_all_qs() becomes something like this:

void rcu_all_qs(void)
{
	unsigned long flags;

	/* Load rcu_urgent_qs before other flags. */
	if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
		return;
	this_cpu_write(rcu_data.rcu_urgent_qs, false);
	if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
		local_irq_save(flags);
		rcu_momentary_dyntick_idle();
		local_irq_restore(flags);
	}
	rcu_qs();
}
EXPORT_SYMBOL_GPL(rcu_all_qs);

> Though I do wonder about the likelihood of hitting the case you describe
> and maybe instead of adding the check on every preempt_enable()
> it might be better to instead force a context switch in the
> rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)

Maybe.  But rcu_all_qs() is way lighter weight than a context switch.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21  5:17       ` Paul E. McKenney
@ 2023-11-21  5:34         ` Paul E. McKenney
  2023-11-21  6:13           ` Z qiang
  2023-11-21 19:25           ` Paul E. McKenney
  0 siblings, 2 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21  5:34 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> > 
> > Paul E. McKenney <paulmck@kernel.org> writes:
> > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> > >> cond_resched() is used to provide urgent quiescent states for
> > >> read-side critical sections on PREEMPT_RCU=n configurations.
> > >> This was necessary because lacking preempt_count, there was no
> > >> way for the tick handler to know if we were executing in RCU
> > >> read-side critical section or not.
> > >>
> > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> > >> reliably report quiescent states.
> > >>
> > >> Accordingly, evaluate preempt_count() based quiescence in
> > >> rcu_flavor_sched_clock_irq().
> > >>
> > >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > >> ---
> > >>  kernel/rcu/tree_plugin.h |  3 ++-
> > >>  kernel/sched/core.c      | 15 +--------------
> > >>  2 files changed, 3 insertions(+), 15 deletions(-)
> > >>
> > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > >> index f87191e008ff..618f055f8028 100644
> > >> --- a/kernel/rcu/tree_plugin.h
> > >> +++ b/kernel/rcu/tree_plugin.h
> > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > >>   */
> > >>  static void rcu_flavor_sched_clock_irq(int user)
> > >>  {
> > >> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
> > >> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
> > >> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> > >
> > > This looks good.
> > >
> > >>  		/*
> > >>  		 * Get here if this CPU took its interrupt from user
> > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > >> index bf5df2b866df..15db5fb7acc7 100644
> > >> --- a/kernel/sched/core.c
> > >> +++ b/kernel/sched/core.c
> > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> > >>  		preempt_schedule_common();
> > >>  		return 1;
> > >>  	}
> > >> -	/*
> > >> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> > >> -	 * whether the current CPU is in an RCU read-side critical section,
> > >> -	 * so the tick can report quiescent states even for CPUs looping
> > >> -	 * in kernel context.  In contrast, in non-preemptible kernels,
> > >> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
> > >> -	 * processes executing in kernel context might never report an
> > >> -	 * RCU quiescent state.  Therefore, the following code causes
> > >> -	 * cond_resched() to report a quiescent state, but only when RCU
> > >> -	 * is in urgent need of one.
> > >> -	 *      /
> > >> -#ifndef CONFIG_PREEMPT_RCU
> > >> -	rcu_all_qs();
> > >> -#endif
> > >
> > > But...
> > >
> > > Suppose we have a long-running loop in the kernel that regularly
> > > enables preemption, but only momentarily.  Then the added
> > > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > > for extremely long grace periods.
> > 
> > So, my thinking was that if RCU wants to end a grace period, it would
> > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> > RCU always uses the the eager version) causing __schedule() to call
> > rcu_note_context_switch().
> > That's similar to the preempt_schedule_common() case in the
> > _cond_resched() above.
> 
> But that requires IPIing that CPU, correct?
> 
> > But if I see your point, RCU might just want to register a quiescent
> > state and for this long-running loop rcu_flavor_sched_clock_irq() does
> > seem to fall down.
> > 
> > > Or did I miss a change that causes preempt_enable() to help RCU out?
> > 
> > Something like this?
> > 
> > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > index dc5125b9c36b..e50f358f1548 100644
> > --- a/include/linux/preempt.h
> > +++ b/include/linux/preempt.h
> > @@ -222,6 +222,8 @@ do { \
> >         barrier(); \
> >         if (unlikely(preempt_count_dec_and_test())) \
> >                 __preempt_schedule(); \
> > +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > +               rcu_all_qs(); \
> >  } while (0)
> 
> Or maybe something like this to lighten the load a bit:
> 
> #define preempt_enable() \
> do { \
> 	barrier(); \
> 	if (unlikely(preempt_count_dec_and_test())) { \
> 		__preempt_schedule(); \
> 		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> 		    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> 			rcu_all_qs(); \
> 	} \
> } while (0)
> 
> And at that point, we should be able to drop the PREEMPT_MASK, not
> that it makes any difference that I am aware of:
> 
> #define preempt_enable() \
> do { \
> 	barrier(); \
> 	if (unlikely(preempt_count_dec_and_test())) { \
> 		__preempt_schedule(); \
> 		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> 		    !(preempt_count() & SOFTIRQ_MASK)) \
> 			rcu_all_qs(); \
> 	} \
> } while (0)
> 
> Except that we can migrate as soon as that preempt_count_dec_and_test()
> returns.  And that rcu_all_qs() disables and re-enables preemption,
> which will result in undesired recursion.  Sigh.
> 
> So maybe something like this:
> 
> #define preempt_enable() \
> do { \
> 	if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> 	    !(preempt_count() & SOFTIRQ_MASK)) \

Sigh.  This needs to include (PREEMPT_MASK | SOFTIRQ_MASK),
but check for equality to something like (1UL << PREEMPT_SHIFT).

Clearly time to sleep.  :-/

							Thanx, Paul

> 		rcu_all_qs(); \
> 	barrier(); \
> 	if (unlikely(preempt_count_dec_and_test())) { \
> 		__preempt_schedule(); \
> 	} \
> } while (0)
> 
> Then rcu_all_qs() becomes something like this:
> 
> void rcu_all_qs(void)
> {
> 	unsigned long flags;
> 
> 	/* Load rcu_urgent_qs before other flags. */
> 	if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
> 		return;
> 	this_cpu_write(rcu_data.rcu_urgent_qs, false);
> 	if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
> 		local_irq_save(flags);
> 		rcu_momentary_dyntick_idle();
> 		local_irq_restore(flags);
> 	}
> 	rcu_qs();
> }
> EXPORT_SYMBOL_GPL(rcu_all_qs);
> 
> > Though I do wonder about the likelihood of hitting the case you describe
> > and maybe instead of adding the check on every preempt_enable()
> > it might be better to instead force a context switch in the
> > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)
> 
> Maybe.  But rcu_all_qs() is way lighter weight than a context switch.
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-21  5:04         ` Paul E. McKenney
@ 2023-11-21  5:39           ` Ankur Arora
  2023-11-21 15:00           ` Steven Rostedt
  1 sibling, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  5:39 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, Ankur Arora, linux-kernel, tglx, peterz,
	torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann


Paul E. McKenney <paulmck@kernel.org> writes:

> On Mon, Nov 20, 2023 at 10:43:56PM -0500, Steven Rostedt wrote:
>> On Mon, 20 Nov 2023 16:28:50 -0800
>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>
>> > On Tue, Nov 07, 2023 at 07:27:03PM -0500, Steven Rostedt wrote:
>> > > On Tue,  7 Nov 2023 13:57:33 -0800
>> > > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> > >
>> With the new preemption model, NONE, VOLUNTARY and PREEMPT are now going to
>> determine when NEED_RESCHED is set as supposed to NEED_RESCHED_LAZY. As
>> NEED_RESCHED_LAZY only schedules at kernel / user space transaction, and
>> NEED_RESCHED will schedule when possible (non-preempt disable section).
>
> So NONE really is still supposed to be there.  ;-)
>
>>  Key: L - NEED_RESCHED_LAZY - schedule only at kernel/user boundary
>>       N - NEED_RESCHED - schedule whenever possible (like PREEMPT does today)
>>
>> 			SCHED_OTHER	REAL-TIME/DL
>> 			  Schedule	  Schedule
>>
>> NONE:		      L		     L
>>
>> VOLUNTARY:		      L		     N
>>
>> PREEMPT:		      N		     N
>>
>>
>> So on NONE, NEED_RESCHED_LAZY is set only on scheduling SCHED_OTHER and RT.
>> Which means, it will not schedule until it goes into user space (*).
>>
>> On VOLUNTARY, NEED_RESCHED is set on RT/DL tasks, and LAZY on SCHED_OTHER.
>> So that RT and DL get scheduled just like PREEMPT does today.
>>
>> On PREEMPT, NEED_RESCHED is always set on all scheduling.
>>
>> (*) - caveat - After the next tick, if NEED_RESCHED_LAZY is set, then
>> NEED_RESCHED will be set and the kernel will schedule at the next available
>> moment, this is true for all three models!
>
> OK, so I see that this is now a SCHED_FEAT, and is initialized based
> on CONFIG_PREEMPT_* in kernel/sched/feature.h.  Huh.  OK, we can still
> control this at build time, which is fine.  I don't see how to set it
> at boot time, only at build time or from debugfs.  I will let those who
> want to set this at boot time complain, should they choose to do so.

Needless to say, these patches were just an RFC and so they both overreach
and also miss out things.

v1 adds a new preemption model which coexists with the current
CONFIG_PREEMPT_* and is broadly based on Thomas' PoC and is on the lines
we have been discussing.

>> There may be more details to work out, but the above is basically the gist
>> of the idea. Now, what do you want to do with RCU_PREEMPT? At run time, we
>> can go from NONE to PREEMPT full! But there may be use cases that do not
>> want the overhead of always having RCU_PREEMPT, and will want RCU to be a
>> preempt_disable() section no matter what.
>
> Understood, actually.  And as noted in other replies, I am a bit concerned
> about added latencies from too aggressively removing cond_resched().
>
> More testing required.

Yes, agreed.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 34/86] thread_info: accessors for TIF_NEED_RESCHED*
  2023-11-08  8:58   ` Peter Zijlstra
@ 2023-11-21  5:59     ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  5:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Nov 07, 2023 at 01:57:20PM -0800, Ankur Arora wrote:
>> Add tif_resched() which will be used as an accessor for TIF_NEED_RESCHED
>> and TIF_NEED_RESCHED_LAZY. The intent is to force the caller to make an
>> explicit choice of how eagerly they want a reschedule.
>>
>> This interface will be used almost entirely from core kernel code, so
>> forcing a choice shouldn't be too onerous.
>>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>
>> ---
>>  include/linux/thread_info.h | 21 +++++++++++++++++++++
>>  1 file changed, 21 insertions(+)
>>
>> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
>> index 9ea0b28068f4..4eb22b13bf64 100644
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -59,6 +59,27 @@ enum syscall_work_bit {
>>
>>  #include <asm/thread_info.h>
>>
>> +#ifndef TIF_NEED_RESCHED_LAZY
>> +#error "Arch needs to define TIF_NEED_RESCHED_LAZY"
>> +#endif
>> +
>> +#define TIF_NEED_RESCHED_LAZY_OFFSET	(TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
>> +
>> +typedef enum {
>> +	RESCHED_eager = 0,
>> +	RESCHED_lazy = TIF_NEED_RESCHED_LAZY_OFFSET,
>> +} resched_t;
>> +
>> +static inline int tif_resched(resched_t r)
>> +{
>> +	return TIF_NEED_RESCHED + r;
>> +}
>> +
>> +static inline int _tif_resched(resched_t r)
>> +{
>> +	return 1 << tif_resched(r);
>> +}
>
> So either I'm confused or I'm thinking this is wrong. If you want to
> preempt eagerly you want to preempt more than when you're not eager to
> preempt, right?
>
> So an eager preemption site wants to include the LAZY bit.
>
> Whereas a site that wants to lazily preempt would prefer to not preempt
> until forced, and hence would not include LAZY bit.

This wasn't meant to be quite that sophisticated.
tif_resched(RESCHED_eager) means you preempt immediately/eagerly and
tif_resched(RESCHED_lazy) means you want deferred preemption.

I changed it to:

typedef enum {
	NR_now = 0,
	NR_lazy = TIF_NEED_RESCHED_LAZY_OFFSET,
} resched_t;

So, to get the respective bit we would have: tif_resched(NR_now) or
tif_resched(NR_lazy).

And the immediate preemption checks would be...

	if (tif_need_resched(NR_now))
		preempt_schedule_irq();

Does this read better?

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 36/86] entry: irqentry_exit only preempts TIF_NEED_RESCHED
  2023-11-08  9:01   ` Peter Zijlstra
@ 2023-11-21  6:00     ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  6:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Nov 07, 2023 at 01:57:22PM -0800, Ankur Arora wrote:
>> The scheduling policy for RESCHED_lazy (TIF_NEED_RESCHED_LAZY) is
>> to let anything running in the kernel run to completion.
>> Accordingly, while deciding whether to call preempt_schedule_irq()
>> narrow the check to tif_need_resched(RESCHED_eager).
>>
>> Also add a comment about why we need to check at all, given that we
>> have aleady checked the preempt_count().
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/entry/common.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index 0d055c39690b..6433e6c77185 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -384,7 +384,15 @@ void irqentry_exit_cond_resched(void)
>>  		rcu_irq_exit_check_preempt();
>>  		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>>  			WARN_ON_ONCE(!on_thread_stack());
>> -		if (need_resched())
>> +
>> +		/*
>> +		 * If the scheduler really wants us to preempt while returning
>> +		 * to kernel, it would set TIF_NEED_RESCHED.
>> +		 * On some archs the flag gets folded in preempt_count, and
>> +		 * thus would be covered in the conditional above, but not all
>> +		 * archs do that, so check explicitly.
>> +		 */
>> +		if (tif_need_resched(RESCHED_eager))
>>  			preempt_schedule_irq();
>
> See, I'm reading this like if we're eager to preempt, but then it's not
> actually eager at all and only wants to preempt when forced.
>
> This naming sucks...

Yeah, it reads like it's trying to say something when it is just trying to
check a bit.

Does the new one read better?

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21  5:34         ` Paul E. McKenney
@ 2023-11-21  6:13           ` Z qiang
  2023-11-21 15:32             ` Paul E. McKenney
  2023-11-21 19:25           ` Paul E. McKenney
  1 sibling, 1 reply; 250+ messages in thread
From: Z qiang @ 2023-11-21  6:13 UTC (permalink / raw)
  To: paulmck
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

>
> On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> > >
> > > Paul E. McKenney <paulmck@kernel.org> writes:
> > > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> > > >> cond_resched() is used to provide urgent quiescent states for
> > > >> read-side critical sections on PREEMPT_RCU=n configurations.
> > > >> This was necessary because lacking preempt_count, there was no
> > > >> way for the tick handler to know if we were executing in RCU
> > > >> read-side critical section or not.
> > > >>
> > > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> > > >> reliably report quiescent states.
> > > >>
> > > >> Accordingly, evaluate preempt_count() based quiescence in
> > > >> rcu_flavor_sched_clock_irq().
> > > >>
> > > >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > > >> ---
> > > >>  kernel/rcu/tree_plugin.h |  3 ++-
> > > >>  kernel/sched/core.c      | 15 +--------------
> > > >>  2 files changed, 3 insertions(+), 15 deletions(-)
> > > >>
> > > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > >> index f87191e008ff..618f055f8028 100644
> > > >> --- a/kernel/rcu/tree_plugin.h
> > > >> +++ b/kernel/rcu/tree_plugin.h
> > > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > > >>   */
> > > >>  static void rcu_flavor_sched_clock_irq(int user)
> > > >>  {
> > > >> -        if (user || rcu_is_cpu_rrupt_from_idle()) {
> > > >> +        if (user || rcu_is_cpu_rrupt_from_idle() ||
> > > >> +            !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> > > >
> > > > This looks good.
> > > >
> > > >>                  /*
> > > >>                   * Get here if this CPU took its interrupt from user
> > > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > >> index bf5df2b866df..15db5fb7acc7 100644
> > > >> --- a/kernel/sched/core.c
> > > >> +++ b/kernel/sched/core.c
> > > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> > > >>                  preempt_schedule_common();
> > > >>                  return 1;
> > > >>          }
> > > >> -        /*
> > > >> -         * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> > > >> -         * whether the current CPU is in an RCU read-side critical section,
> > > >> -         * so the tick can report quiescent states even for CPUs looping
> > > >> -         * in kernel context.  In contrast, in non-preemptible kernels,
> > > >> -         * RCU readers leave no in-memory hints, which means that CPU-bound
> > > >> -         * processes executing in kernel context might never report an
> > > >> -         * RCU quiescent state.  Therefore, the following code causes
> > > >> -         * cond_resched() to report a quiescent state, but only when RCU
> > > >> -         * is in urgent need of one.
> > > >> -         *      /
> > > >> -#ifndef CONFIG_PREEMPT_RCU
> > > >> -        rcu_all_qs();
> > > >> -#endif
> > > >
> > > > But...
> > > >
> > > > Suppose we have a long-running loop in the kernel that regularly
> > > > enables preemption, but only momentarily.  Then the added
> > > > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > > > for extremely long grace periods.
> > >
> > > So, my thinking was that if RCU wants to end a grace period, it would
> > > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> > > RCU always uses the the eager version) causing __schedule() to call
> > > rcu_note_context_switch().
> > > That's similar to the preempt_schedule_common() case in the
> > > _cond_resched() above.
> >
> > But that requires IPIing that CPU, correct?
> >
> > > But if I see your point, RCU might just want to register a quiescent
> > > state and for this long-running loop rcu_flavor_sched_clock_irq() does
> > > seem to fall down.
> > >
> > > > Or did I miss a change that causes preempt_enable() to help RCU out?
> > >
> > > Something like this?
> > >
> > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > index dc5125b9c36b..e50f358f1548 100644
> > > --- a/include/linux/preempt.h
> > > +++ b/include/linux/preempt.h
> > > @@ -222,6 +222,8 @@ do { \
> > >         barrier(); \
> > >         if (unlikely(preempt_count_dec_and_test())) \
> > >                 __preempt_schedule(); \
> > > +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > > +               rcu_all_qs(); \
> > >  } while (0)
> >
> > Or maybe something like this to lighten the load a bit:
> >
> > #define preempt_enable() \
> > do { \
> >       barrier(); \
> >       if (unlikely(preempt_count_dec_and_test())) { \
> >               __preempt_schedule(); \
> >               if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> >                   !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> >                       rcu_all_qs(); \
> >       } \
> > } while (0)
> >
> > And at that point, we should be able to drop the PREEMPT_MASK, not
> > that it makes any difference that I am aware of:
> >
> > #define preempt_enable() \
> > do { \
> >       barrier(); \
> >       if (unlikely(preempt_count_dec_and_test())) { \
> >               __preempt_schedule(); \
> >               if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> >                   !(preempt_count() & SOFTIRQ_MASK)) \
> >                       rcu_all_qs(); \
> >       } \
> > } while (0)
> >
> > Except that we can migrate as soon as that preempt_count_dec_and_test()
> > returns.  And that rcu_all_qs() disables and re-enables preemption,
> > which will result in undesired recursion.  Sigh.
> >
> > So maybe something like this:
> >
> > #define preempt_enable() \
> > do { \
> >       if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> >           !(preempt_count() & SOFTIRQ_MASK)) \
>
> Sigh.  This needs to include (PREEMPT_MASK | SOFTIRQ_MASK),
> but check for equality to something like (1UL << PREEMPT_SHIFT).
>

For PREEMPT_RCU=n and CONFIG_PREEMPT_COUNT=y kernels
for report QS in preempt_enable(),  we can refer to this:

void rcu_read_unlock_strict(void)
{
        struct rcu_data *rdp;

        if (irqs_disabled() || preempt_count() || !rcu_state.gp_kthread)
                return;
        rdp = this_cpu_ptr(&rcu_data);
        rdp->cpu_no_qs.b.norm = false;
        rcu_report_qs_rdp(rdp);
        udelay(rcu_unlock_delay);
}

The rcu critical section may be in the NMI handler  needs to be considered.


Thanks
Zqiang



>
> Clearly time to sleep.  :-/
>
>                                                         Thanx, Paul
>
> >               rcu_all_qs(); \
> >       barrier(); \
> >       if (unlikely(preempt_count_dec_and_test())) { \
> >               __preempt_schedule(); \
> >       } \
> > } while (0)
> >
> > Then rcu_all_qs() becomes something like this:
> >
> > void rcu_all_qs(void)
> > {
> >       unsigned long flags;
> >
> >       /* Load rcu_urgent_qs before other flags. */
> >       if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
> >               return;
> >       this_cpu_write(rcu_data.rcu_urgent_qs, false);
> >       if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
> >               local_irq_save(flags);
> >               rcu_momentary_dyntick_idle();
> >               local_irq_restore(flags);
> >       }
> >       rcu_qs();
> > }
> > EXPORT_SYMBOL_GPL(rcu_all_qs);
> >
> > > Though I do wonder about the likelihood of hitting the case you describe
> > > and maybe instead of adding the check on every preempt_enable()
> > > it might be better to instead force a context switch in the
> > > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)
> >
> > Maybe.  But rcu_all_qs() is way lighter weight than a context switch.
> >
> >                                                       Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 41/86] sched: handle resched policy in resched_curr()
  2023-11-08 10:26     ` Ankur Arora
  2023-11-08 10:46       ` Peter Zijlstra
@ 2023-11-21  6:31       ` Ankur Arora
  1 sibling, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  6:31 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Peter Zijlstra, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Ankur Arora <ankur.a.arora@oracle.com> writes:

> Peter Zijlstra <peterz@infradead.org> writes:
>
>> On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
>>
>>> +	 * We might race with the target CPU while checking its ct_state:
>>> +	 *
>>> +	 * 1. The task might have just entered the kernel, but has not yet
>>> +	 * called user_exit(). We will see stale state (CONTEXT_USER) and
>>> +	 * send an unnecessary resched-IPI.
>>> +	 *
>>> +	 * 2. The user task is through with exit_to_user_mode_loop() but has
>>> +	 * not yet called user_enter().
>>> +	 *
>>> +	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
>>> +	 * schedule it lazily. There's obviously nothing that will handle
>>> +	 * this need-resched bit until the thread enters the kernel next.
>>> +	 *
>>> +	 * The scheduler will still do tick accounting, but a potentially
>>> +	 * higher priority task waited to be scheduled for a user tick,
>>> +	 * instead of execution time in the kernel.
>>> +	 */
>>> +	context = ct_state_cpu(cpu_of(rq));
>>> +	if ((context == CONTEXT_USER) ||
>>> +	    (context == CONTEXT_GUEST)) {
>>> +
>>> +		rs = RESCHED_eager;
>>> +		goto resched;
>>> +	}
>>
>> Like said, this simply cannot be. You must not rely on the remote CPU
>> being in some state or not. Also, it's racy, you could observe USER and
>> then it enters KERNEL.
>
> Or worse. We might observe KERNEL and it enters USER.
>
> I think we would be fine if we observe USER: it would be upgrade
> to RESCHED_eager and send an unnecessary IPI.
>
> But if we observe KERNEL and it enters USER, then we will have
> set the need-resched-lazy bit which the thread might not see
> (it might have left exit_to_user_mode_loop()) until the next
> entry to the kernel.
>
> But, yes I would like to avoid the ct_state as well. But
> need-resched-lazy only makes sense when the task on the runqueue
> is executing in the kernel...

So, I discussed this with Thomas offlist, and he pointed out that I'm
overengineering this.

If we decide to wake up a remote rq lazily with (!sched_feat(TTWU_QUEUE)),
and if the target is running in user space, then the resched would
happen when the process enters kernel mode.

That's somewhat similar to how in this preemption model we let a task
run for upto one extra tick while in kernel mode. So I'll drop this and
allow the same behaviour in userspace instead of solving it in
unnecessarily complicated ways.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 40/86] context_tracking: add ct_state_cpu()
  2023-11-08  9:16   ` Peter Zijlstra
@ 2023-11-21  6:32     ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  6:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Nov 07, 2023 at 01:57:26PM -0800, Ankur Arora wrote:
>> While making up its mind about whether to reschedule a target
>> runqueue eagerly or lazily, resched_curr() needs to know if the
>> target is executing in the kernel or in userspace.
>>
>> Add ct_state_cpu().
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>
>> ---
>> Using context-tracking for this seems like overkill. Is there a better
>> way to achieve this? One problem with depending on user_enter() is that
>> it happens much too late for our purposes. From the scheduler's
>> point-of-view the exit state has effectively transitioned once the
>> task exits the exit_to_user_loop() so we will see stale state
>> while the task is done with exit_to_user_loop() but has not yet
>> executed user_enter().
>>
>> ---
>>  include/linux/context_tracking_state.h | 21 +++++++++++++++++++++
>>  kernel/Kconfig.preempt                 |  1 +
>>  2 files changed, 22 insertions(+)
>>
>> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
>> index bbff5f7f8803..6a8f1c7ba105 100644
>> --- a/include/linux/context_tracking_state.h
>> +++ b/include/linux/context_tracking_state.h
>> @@ -53,6 +53,13 @@ static __always_inline int __ct_state(void)
>>  {
>>  	return raw_atomic_read(this_cpu_ptr(&context_tracking.state)) & CT_STATE_MASK;
>>  }
>> +
>> +static __always_inline int __ct_state_cpu(int cpu)
>> +{
>> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
>> +
>> +	return atomic_read(&ct->state) & CT_STATE_MASK;
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
>> @@ -139,6 +146,20 @@ static __always_inline int ct_state(void)
>>  	return ret;
>>  }
>>
>> +static __always_inline int ct_state_cpu(int cpu)
>> +{
>> +	int ret;
>> +
>> +	if (!context_tracking_enabled_cpu(cpu))
>> +		return CONTEXT_DISABLED;
>> +
>> +	preempt_disable();
>> +	ret = __ct_state_cpu(cpu);
>> +	preempt_enable();
>> +
>> +	return ret;
>> +}
>
> Those preempt_disable/enable are pointless.
>
> But this patch is problematic, you do *NOT* want to rely on context
> tracking. Context tracking adds atomics to the entry path, this is slow
> and even with CONFIG_CONTEXT_TRACKING it is disabled until you configure
> the NOHZ_FULL nonsense.

Yeah, I had missed the fact that even though the ct->state was updated
for both ct->active, !ct->active but the static branch was only enabled
with NOHZ_FULL.

Will drop.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 41/86] sched: handle resched policy in resched_curr()
  2023-11-08 10:46       ` Peter Zijlstra
@ 2023-11-21  6:34         ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  6:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Nov 08, 2023 at 02:26:37AM -0800, Ankur Arora wrote:
>>
>> Peter Zijlstra <peterz@infradead.org> writes:
>>
>> > On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
>> >
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>> >>  }
>> >>
>> >>  /*
>> >> - * resched_curr - mark rq's current task 'to be rescheduled now'.
>> >> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>> >>   *
>> >> - * On UP this means the setting of the need_resched flag, on SMP it
>> >> - * might also involve a cross-CPU call to trigger the scheduler on
>> >> - * the target CPU.
>> >> + * On UP this means the setting of the need_resched flag, on SMP, for
>> >> + * eager resched it might also involve a cross-CPU call to trigger
>> >> + * the scheduler on the target CPU.
>> >>   */
>> >> -void resched_curr(struct rq *rq)
>> >> +void __resched_curr(struct rq *rq, resched_t rs)
>> >>  {
>> >>  	struct task_struct *curr = rq->curr;
>> >>  	int cpu;
>> >> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>> >>  	cpu = cpu_of(rq);
>> >>
>> >>  	if (cpu == smp_processor_id()) {
>> >> -		set_tsk_need_resched(curr, RESCHED_eager);
>> >> -		set_preempt_need_resched();
>> >> +		set_tsk_need_resched(curr, rs);
>> >> +		if (rs == RESCHED_eager)
>> >> +			set_preempt_need_resched();
>> >>  		return;
>> >>  	}
>> >>
>> >> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
>> >> -		smp_send_reschedule(cpu);
>> >> -	else
>> >> +	if (set_nr_and_not_polling(curr, rs)) {
>> >> +		if (rs == RESCHED_eager)
>> >> +			smp_send_reschedule(cpu);
>> >
>> > I think you just broke things.
>> >
>> > Not all idle threads have POLLING support, in which case you need that
>> > IPI to wake them up, even if it's LAZY.
>>
>> Yes, I was concerned about that too. But doesn't this check against the
>> idle_sched_class in resched_curr() cover that?
>
> I that's what that was. Hmm, maybe.
>
> I mean, we have idle-injection too, those don't as FIFO, but as such,
> they can only get preempted from RT/DL, and those will already force
> preempt anyway.

Aah yes, of course those are FIFO. Thanks missed that.

> The way you've split and structured the code makes it very hard to
> follow. Something like:
>
> 	if (set_nr_and_not_polling(curr, rs) &&
> 	    (rs == RESCHED_force || is_idle_task(curr)))
> 		smp_send_reschedule();
>
> is *far* clearer, no?

Nods. I was trying to separate where we decide whether we do things
eagerly or lazily. But this is way clearer.


--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 42/86] sched: force preemption on tick expiration
  2023-11-08  9:56   ` Peter Zijlstra
@ 2023-11-21  6:44     ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-11-21  6:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, linux-mm,
	x86, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Nov 07, 2023 at 01:57:28PM -0800, Ankur Arora wrote:
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4d86c618ffa2..fe7e5e9b2207 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1016,8 +1016,11 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>>   * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>>   * this is probably good enough.
>>   */
>> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +static void update_deadline(struct cfs_rq *cfs_rq,
>> +			    struct sched_entity *se, bool tick)
>>  {
>> +	struct rq *rq = rq_of(cfs_rq);
>> +
>>  	if ((s64)(se->vruntime - se->deadline) < 0)
>>  		return;
>>
>> @@ -1033,13 +1036,19 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>  	 */
>>  	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
>>
>> +	if (cfs_rq->nr_running < 2)
>> +		return;
>> +
>>  	/*
>> -	 * The task has consumed its request, reschedule.
>> +	 * The task has consumed its request, reschedule; eagerly
>> +	 * if it ignored our last lazy reschedule.
>>  	 */
>> -	if (cfs_rq->nr_running > 1) {
>> -		resched_curr(rq_of(cfs_rq));
>> -		clear_buddies(cfs_rq, se);
>> -	}
>> +	if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
>> +		__resched_curr(rq, RESCHED_eager);
>> +	else
>> +		resched_curr(rq);
>> +
>> +	clear_buddies(cfs_rq, se);
>>  }
>>
>>  #include "pelt.h"
>> @@ -1147,7 +1156,7 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
>>  /*
>>   * Update the current task's runtime statistics.
>>   */
>> -static void update_curr(struct cfs_rq *cfs_rq)
>> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
>>  {
>>  	struct sched_entity *curr = cfs_rq->curr;
>>  	u64 now = rq_clock_task(rq_of(cfs_rq));
>> @@ -1174,7 +1183,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>  	schedstat_add(cfs_rq->exec_clock, delta_exec);
>>
>>  	curr->vruntime += calc_delta_fair(delta_exec, curr);
>> -	update_deadline(cfs_rq, curr);
>> +	update_deadline(cfs_rq, curr, tick);
>>  	update_min_vruntime(cfs_rq);
>>
>>  	if (entity_is_task(curr)) {
>> @@ -1188,6 +1197,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
>>  }
>>
>> +static void update_curr(struct cfs_rq *cfs_rq)
>> +{
>> +	__update_curr(cfs_rq, false);
>> +}
>> +
>>  static void update_curr_fair(struct rq *rq)
>>  {
>>  	update_curr(cfs_rq_of(&rq->curr->se));
>> @@ -5309,7 +5323,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>  	/*
>>  	 * Update run-time statistics of the 'current'.
>>  	 */
>> -	update_curr(cfs_rq);
>> +	__update_curr(cfs_rq, true);
>>
>>  	/*
>>  	 * Ensure that runnable average is periodically updated.
>
> I'm thinking this will be less of a mess if you flip it around some.
>
> (ignore the hrtick mess, I'll try and get that cleaned up)
>
> This way you have two distinct sites to handle the preemption. the
> update_curr() would be 'FULL ? force : lazy' while the tick one gets the
> special magic bits.

Thanks that simplified changes here quite nicely.

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-21  5:04         ` Paul E. McKenney
  2023-11-21  5:39           ` Ankur Arora
@ 2023-11-21 15:00           ` Steven Rostedt
  2023-11-21 15:19             ` Paul E. McKenney
  1 sibling, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-21 15:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Mon, 20 Nov 2023 21:04:28 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> How about like this, where "Y" means allowed and "N" means not allowed:
> 
> 			Non-Preemptible RCU	Preemptible RCU
> 
> NONE:				Y			Y
> 
> VOLUNTARY:			Y			Y
> 
> PREEMPT:			N			Y
> 
> PREEMPT_RT:			N			Y
> 
> 
> We need preemptible RCU for NONE and VOLUNTARY, as you say,
> to allow CONFIG_PREEMPT_DYNAMIC to continue to work.  (OK, OK,
> CONFIG_PREEMPT_DYNAMIC is no longer, but appears to be unconditional.)
> But again, I don't see why anyone would want (much less need)
> non-preemptible RCU in the PREEMPT and PREEMPT_RT cases.  And if it is
> neither wanted nor needed, there is no point in enabling it, much less
> testing it.
> 
> Or am I missing a use case in there somewhere?

As Ankur replied, this is just an RFC, not the main goal. I'm talking about
the end product which will get rid of the PREEMPT_NONE, PREEMPT_VOLUNTARY
and PREEMPT conifgs, and there will *only* be the PREEMPT_DYNAMIC and
PREEMPT_RT.

And yes, this is going to be a slow and long processes, to find and fix all
regressions. I too am concerned about the latency that this may add. I'm
thinking we could have NEED_RESCHED_LAZY preempt when there is no mutex or
other semi critical section held (like migrate_disable()).

Right now, the use of cond_resched() is basically a whack-a-mole game where
we need to whack all the mole loops with the cond_resched() hammer. As
Thomas said, this is backwards. It makes more sense to just not preempt in
areas that can cause pain (like holding a mutex or in an RCU critical
section), but still have the general kernel be fully preemptable.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-21 15:00           ` Steven Rostedt
@ 2023-11-21 15:19             ` Paul E. McKenney
  2023-11-28 10:53               ` Thomas Gleixner
  0 siblings, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21 15:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
> On Mon, 20 Nov 2023 21:04:28 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > How about like this, where "Y" means allowed and "N" means not allowed:
> > 
> > 			Non-Preemptible RCU	Preemptible RCU
> > 
> > NONE:				Y			Y
> > 
> > VOLUNTARY:			Y			Y
> > 
> > PREEMPT:			N			Y
> > 
> > PREEMPT_RT:			N			Y
> > 
> > 
> > We need preemptible RCU for NONE and VOLUNTARY, as you say,
> > to allow CONFIG_PREEMPT_DYNAMIC to continue to work.  (OK, OK,
> > CONFIG_PREEMPT_DYNAMIC is no longer, but appears to be unconditional.)
> > But again, I don't see why anyone would want (much less need)
> > non-preemptible RCU in the PREEMPT and PREEMPT_RT cases.  And if it is
> > neither wanted nor needed, there is no point in enabling it, much less
> > testing it.
> > 
> > Or am I missing a use case in there somewhere?
> 
> As Ankur replied, this is just an RFC, not the main goal. I'm talking about
> the end product which will get rid of the PREEMPT_NONE, PREEMPT_VOLUNTARY
> and PREEMPT conifgs, and there will *only* be the PREEMPT_DYNAMIC and
> PREEMPT_RT.
> 
> And yes, this is going to be a slow and long processes, to find and fix all
> regressions. I too am concerned about the latency that this may add. I'm
> thinking we could have NEED_RESCHED_LAZY preempt when there is no mutex or
> other semi critical section held (like migrate_disable()).

Indeed.  For one thing, you have a lot of work to do to demonstrate
that this would actually be a good thing.  For example, what is so
horribly bad about selecting minimal preemption (NONE and/or VOLUNTARY)
at build time???

> Right now, the use of cond_resched() is basically a whack-a-mole game where
> we need to whack all the mole loops with the cond_resched() hammer. As
> Thomas said, this is backwards. It makes more sense to just not preempt in
> areas that can cause pain (like holding a mutex or in an RCU critical
> section), but still have the general kernel be fully preemptable.

Which is quite true, but that whack-a-mole game can be ended without
getting rid of build-time selection of the preemption model.  Also,
that whack-a-mole game can be ended without eliminating all calls to
cond_resched().

Additionally, if the end goal is to be fully preemptible as in eventually
eliminating lazy preemption, you have a lot more convincing to do.
For but one example, given the high cost of the additional context
switches that will visit on a number of performance-sensitive workloads.

So what exactly are you guys trying to accomplish here?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()
  2023-11-21  5:16     ` Ankur Arora
@ 2023-11-21 15:26       ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21 15:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik, Julia Lawall,
	Nicolas Palix

On Mon, Nov 20, 2023 at 09:16:19PM -0800, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > On Tue, Nov 07, 2023 at 03:07:53PM -0800, Ankur Arora wrote:
> >> Rudimentary script to remove the straight-forward subset of
> >> cond_resched() and allies:
> >>
> >> 1)  if (need_resched())
> >> 	  cond_resched()
> >>
> >> 2)  expression*;
> >>     cond_resched();  /* or in the reverse order */
> >>
> >> 3)  if (expression)
> >> 	statement
> >>     cond_resched();  /* or in the reverse order */
> >>
> >> The last two patterns depend on the control flow level to ensure
> >> that the complex cond_resched() patterns (ex. conditioned ones)
> >> are left alone and we only pick up ones which are only minimally
> >> related the neighbouring code.
> >
> > This series looks to get rid of stall warnings for long in-kernel
> > preempt-enabled code paths, which is of course a very good thing.
> > But removing all of the cond_resched() calls can actually increase
> > scheduling latency compared to the current CONFIG_PREEMPT_NONE=y state,
> > correct?
> 
> Not necessarily.
> 
> If TIF_NEED_RESCHED_LAZY is set, then we let the current task finish
> before preempting. If that task runs for arbitrarily long (what Thomas
> calls the hog problem) -- currently we allow them to run for upto one
> extra tick (which might shorten/become a tunable.)

Agreed, and that is the easy case.  But getting rid of the cond_resched()
calls really can increase scheduling latency of this patchset compared
to status-quo mainline.

> If TIF_NEED_RESCHED is set, then it gets folded the same it does now
> and preemption happens at the next safe preemption point.
> 
> So, I guess the scheduling latency would always be bounded but how much
> latency a task would incur would be scheduler policy dependent.
> 
> This is early days, so the policy (or really the rest of it) isn't set
> in stone but having two levels of preemption -- immediate and
> deferred -- does seem to give the scheduler greater freedom of policy.

"Give the scheduler freedom!" is a wonderful slogan, but not necessarily
a useful one-size-fits-all design principle.  The scheduler does not
and cannot know everything, after all.

> Btw, are you concerned about the scheduling latencies in general or the
> scheduling latency of a particular set of tasks?

There are a lot of workloads out there with a lot of objective functions
and constraints, but it is safe to say that both will be important, as
will other things, depending on the workload.

But you knew that already, right?  ;-)

> > If so, it would be good to take a measured approach.  For example, it
> > is clear that a loop that does a cond_resched() every (say) ten jiffies
> > can remove that cond_resched() without penalty, at least in kernels built
> > with either CONFIG_NO_HZ_FULL=n or CONFIG_PREEMPT=y.  But this is not so
> > clear for a loop that does a cond_resched() every (say) ten microseconds.
> 
> True. Though both of those loops sound bad :).

Yes, but do they sound bad enough to be useful in the real world?  ;-)

> Yeah, and as we were discussing offlist, the question is the comparative
> density of preempt_dec_and_test() is true vs calls to cond_resched().
> 
> And if they are similar then we could replace cond_resched() quiescence
> reporting with ones in preempt_enable() (as you mention elsewhere in the
> thread.)

Here is hoping that something like that can help.

I am quite happy with the thought of reducing the number of cond_resched()
invocations, but not at the expense of the Linux kernel failing to do
its job.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21  6:13           ` Z qiang
@ 2023-11-21 15:32             ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21 15:32 UTC (permalink / raw)
  To: Z qiang
  Cc: Ankur Arora, linux-kernel, tglx, peterz, torvalds, linux-mm, x86,
	akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik

On Tue, Nov 21, 2023 at 02:13:53PM +0800, Z qiang wrote:
> >
> > On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote:
> > > On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> > > >
> > > > Paul E. McKenney <paulmck@kernel.org> writes:
> > > > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> > > > >> cond_resched() is used to provide urgent quiescent states for
> > > > >> read-side critical sections on PREEMPT_RCU=n configurations.
> > > > >> This was necessary because lacking preempt_count, there was no
> > > > >> way for the tick handler to know if we were executing in RCU
> > > > >> read-side critical section or not.
> > > > >>
> > > > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> > > > >> reliably report quiescent states.
> > > > >>
> > > > >> Accordingly, evaluate preempt_count() based quiescence in
> > > > >> rcu_flavor_sched_clock_irq().
> > > > >>
> > > > >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > > > >> ---
> > > > >>  kernel/rcu/tree_plugin.h |  3 ++-
> > > > >>  kernel/sched/core.c      | 15 +--------------
> > > > >>  2 files changed, 3 insertions(+), 15 deletions(-)
> > > > >>
> > > > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > >> index f87191e008ff..618f055f8028 100644
> > > > >> --- a/kernel/rcu/tree_plugin.h
> > > > >> +++ b/kernel/rcu/tree_plugin.h
> > > > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > > > >>   */
> > > > >>  static void rcu_flavor_sched_clock_irq(int user)
> > > > >>  {
> > > > >> -        if (user || rcu_is_cpu_rrupt_from_idle()) {
> > > > >> +        if (user || rcu_is_cpu_rrupt_from_idle() ||
> > > > >> +            !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> > > > >
> > > > > This looks good.
> > > > >
> > > > >>                  /*
> > > > >>                   * Get here if this CPU took its interrupt from user
> > > > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > >> index bf5df2b866df..15db5fb7acc7 100644
> > > > >> --- a/kernel/sched/core.c
> > > > >> +++ b/kernel/sched/core.c
> > > > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> > > > >>                  preempt_schedule_common();
> > > > >>                  return 1;
> > > > >>          }
> > > > >> -        /*
> > > > >> -         * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> > > > >> -         * whether the current CPU is in an RCU read-side critical section,
> > > > >> -         * so the tick can report quiescent states even for CPUs looping
> > > > >> -         * in kernel context.  In contrast, in non-preemptible kernels,
> > > > >> -         * RCU readers leave no in-memory hints, which means that CPU-bound
> > > > >> -         * processes executing in kernel context might never report an
> > > > >> -         * RCU quiescent state.  Therefore, the following code causes
> > > > >> -         * cond_resched() to report a quiescent state, but only when RCU
> > > > >> -         * is in urgent need of one.
> > > > >> -         *      /
> > > > >> -#ifndef CONFIG_PREEMPT_RCU
> > > > >> -        rcu_all_qs();
> > > > >> -#endif
> > > > >
> > > > > But...
> > > > >
> > > > > Suppose we have a long-running loop in the kernel that regularly
> > > > > enables preemption, but only momentarily.  Then the added
> > > > > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > > > > for extremely long grace periods.
> > > >
> > > > So, my thinking was that if RCU wants to end a grace period, it would
> > > > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> > > > RCU always uses the the eager version) causing __schedule() to call
> > > > rcu_note_context_switch().
> > > > That's similar to the preempt_schedule_common() case in the
> > > > _cond_resched() above.
> > >
> > > But that requires IPIing that CPU, correct?
> > >
> > > > But if I see your point, RCU might just want to register a quiescent
> > > > state and for this long-running loop rcu_flavor_sched_clock_irq() does
> > > > seem to fall down.
> > > >
> > > > > Or did I miss a change that causes preempt_enable() to help RCU out?
> > > >
> > > > Something like this?
> > > >
> > > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > > index dc5125b9c36b..e50f358f1548 100644
> > > > --- a/include/linux/preempt.h
> > > > +++ b/include/linux/preempt.h
> > > > @@ -222,6 +222,8 @@ do { \
> > > >         barrier(); \
> > > >         if (unlikely(preempt_count_dec_and_test())) \
> > > >                 __preempt_schedule(); \
> > > > +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > > > +               rcu_all_qs(); \
> > > >  } while (0)
> > >
> > > Or maybe something like this to lighten the load a bit:
> > >
> > > #define preempt_enable() \
> > > do { \
> > >       barrier(); \
> > >       if (unlikely(preempt_count_dec_and_test())) { \
> > >               __preempt_schedule(); \
> > >               if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > >                   !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > >                       rcu_all_qs(); \
> > >       } \
> > > } while (0)
> > >
> > > And at that point, we should be able to drop the PREEMPT_MASK, not
> > > that it makes any difference that I am aware of:
> > >
> > > #define preempt_enable() \
> > > do { \
> > >       barrier(); \
> > >       if (unlikely(preempt_count_dec_and_test())) { \
> > >               __preempt_schedule(); \
> > >               if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > >                   !(preempt_count() & SOFTIRQ_MASK)) \
> > >                       rcu_all_qs(); \
> > >       } \
> > > } while (0)
> > >
> > > Except that we can migrate as soon as that preempt_count_dec_and_test()
> > > returns.  And that rcu_all_qs() disables and re-enables preemption,
> > > which will result in undesired recursion.  Sigh.
> > >
> > > So maybe something like this:
> > >
> > > #define preempt_enable() \
> > > do { \
> > >       if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > >           !(preempt_count() & SOFTIRQ_MASK)) \
> >
> > Sigh.  This needs to include (PREEMPT_MASK | SOFTIRQ_MASK),
> > but check for equality to something like (1UL << PREEMPT_SHIFT).
> >
> 
> For PREEMPT_RCU=n and CONFIG_PREEMPT_COUNT=y kernels
> for report QS in preempt_enable(),  we can refer to this:
> 
> void rcu_read_unlock_strict(void)
> {
>         struct rcu_data *rdp;
> 
>         if (irqs_disabled() || preempt_count() || !rcu_state.gp_kthread)
>                 return;
>         rdp = this_cpu_ptr(&rcu_data);
>         rdp->cpu_no_qs.b.norm = false;
>         rcu_report_qs_rdp(rdp);
>         udelay(rcu_unlock_delay);
> }
> 
> The rcu critical section may be in the NMI handler  needs to be considered.

You are quite right, though one advantage of leveraging preempt_enable()
is that it cannot really enable preemption in an NMI handler.
But yes, that might need to be accounted for in the comparison with
preempt_count().

The actual condition needs to also allow for the possibility that
this preempt_enable() happened in a kernel built with preemptible RCU.
And probably a few other things that I have not yet thought of.

For one thing, rcu_implicit_dynticks_qs() might need adjustment.
Though I am currently hoping that it will still be able to enlist the
help of other things, for example, preempt_enable() and local_bh_enable().

Yes, it is the easiest thing in the world to just whip out the
resched_cpu() hammer earlier in the grace period, and maybe that is the
eventual solution.  But I would like to try avoiding the extra IPIs if
that can be done reasonably.  ;-)

							Thanx, Paul

> Thanks
> Zqiang
> 
> 
> 
> >
> > Clearly time to sleep.  :-/
> >
> >                                                         Thanx, Paul
> >
> > >               rcu_all_qs(); \
> > >       barrier(); \
> > >       if (unlikely(preempt_count_dec_and_test())) { \
> > >               __preempt_schedule(); \
> > >       } \
> > > } while (0)
> > >
> > > Then rcu_all_qs() becomes something like this:
> > >
> > > void rcu_all_qs(void)
> > > {
> > >       unsigned long flags;
> > >
> > >       /* Load rcu_urgent_qs before other flags. */
> > >       if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
> > >               return;
> > >       this_cpu_write(rcu_data.rcu_urgent_qs, false);
> > >       if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
> > >               local_irq_save(flags);
> > >               rcu_momentary_dyntick_idle();
> > >               local_irq_restore(flags);
> > >       }
> > >       rcu_qs();
> > > }
> > > EXPORT_SYMBOL_GPL(rcu_all_qs);
> > >
> > > > Though I do wonder about the likelihood of hitting the case you describe
> > > > and maybe instead of adding the check on every preempt_enable()
> > > > it might be better to instead force a context switch in the
> > > > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)
> > >
> > > Maybe.  But rcu_all_qs() is way lighter weight than a context switch.
> > >
> > >                                                       Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21  5:34         ` Paul E. McKenney
  2023-11-21  6:13           ` Z qiang
@ 2023-11-21 19:25           ` Paul E. McKenney
  2023-11-21 20:30             ` Peter Zijlstra
  1 sibling, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21 19:25 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, linux-mm, x86, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

On Mon, Nov 20, 2023 at 09:34:05PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> > > 
> > > Paul E. McKenney <paulmck@kernel.org> writes:
> > > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> > > >> cond_resched() is used to provide urgent quiescent states for
> > > >> read-side critical sections on PREEMPT_RCU=n configurations.
> > > >> This was necessary because lacking preempt_count, there was no
> > > >> way for the tick handler to know if we were executing in RCU
> > > >> read-side critical section or not.
> > > >>
> > > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> > > >> reliably report quiescent states.
> > > >>
> > > >> Accordingly, evaluate preempt_count() based quiescence in
> > > >> rcu_flavor_sched_clock_irq().
> > > >>
> > > >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > > >> ---
> > > >>  kernel/rcu/tree_plugin.h |  3 ++-
> > > >>  kernel/sched/core.c      | 15 +--------------
> > > >>  2 files changed, 3 insertions(+), 15 deletions(-)
> > > >>
> > > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > >> index f87191e008ff..618f055f8028 100644
> > > >> --- a/kernel/rcu/tree_plugin.h
> > > >> +++ b/kernel/rcu/tree_plugin.h
> > > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > > >>   */
> > > >>  static void rcu_flavor_sched_clock_irq(int user)
> > > >>  {
> > > >> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
> > > >> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
> > > >> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> > > >
> > > > This looks good.
> > > >
> > > >>  		/*
> > > >>  		 * Get here if this CPU took its interrupt from user
> > > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > >> index bf5df2b866df..15db5fb7acc7 100644
> > > >> --- a/kernel/sched/core.c
> > > >> +++ b/kernel/sched/core.c
> > > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> > > >>  		preempt_schedule_common();
> > > >>  		return 1;
> > > >>  	}
> > > >> -	/*
> > > >> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> > > >> -	 * whether the current CPU is in an RCU read-side critical section,
> > > >> -	 * so the tick can report quiescent states even for CPUs looping
> > > >> -	 * in kernel context.  In contrast, in non-preemptible kernels,
> > > >> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
> > > >> -	 * processes executing in kernel context might never report an
> > > >> -	 * RCU quiescent state.  Therefore, the following code causes
> > > >> -	 * cond_resched() to report a quiescent state, but only when RCU
> > > >> -	 * is in urgent need of one.
> > > >> -	 *      /
> > > >> -#ifndef CONFIG_PREEMPT_RCU
> > > >> -	rcu_all_qs();
> > > >> -#endif
> > > >
> > > > But...
> > > >
> > > > Suppose we have a long-running loop in the kernel that regularly
> > > > enables preemption, but only momentarily.  Then the added
> > > > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > > > for extremely long grace periods.
> > > 
> > > So, my thinking was that if RCU wants to end a grace period, it would
> > > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> > > RCU always uses the the eager version) causing __schedule() to call
> > > rcu_note_context_switch().
> > > That's similar to the preempt_schedule_common() case in the
> > > _cond_resched() above.
> > 
> > But that requires IPIing that CPU, correct?
> > 
> > > But if I see your point, RCU might just want to register a quiescent
> > > state and for this long-running loop rcu_flavor_sched_clock_irq() does
> > > seem to fall down.
> > > 
> > > > Or did I miss a change that causes preempt_enable() to help RCU out?
> > > 
> > > Something like this?
> > > 
> > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > index dc5125b9c36b..e50f358f1548 100644
> > > --- a/include/linux/preempt.h
> > > +++ b/include/linux/preempt.h
> > > @@ -222,6 +222,8 @@ do { \
> > >         barrier(); \
> > >         if (unlikely(preempt_count_dec_and_test())) \
> > >                 __preempt_schedule(); \
> > > +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > > +               rcu_all_qs(); \
> > >  } while (0)
> > 
> > Or maybe something like this to lighten the load a bit:
> > 
> > #define preempt_enable() \
> > do { \
> > 	barrier(); \
> > 	if (unlikely(preempt_count_dec_and_test())) { \
> > 		__preempt_schedule(); \
> > 		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > 		    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > 			rcu_all_qs(); \
> > 	} \
> > } while (0)
> > 
> > And at that point, we should be able to drop the PREEMPT_MASK, not
> > that it makes any difference that I am aware of:
> > 
> > #define preempt_enable() \
> > do { \
> > 	barrier(); \
> > 	if (unlikely(preempt_count_dec_and_test())) { \
> > 		__preempt_schedule(); \
> > 		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > 		    !(preempt_count() & SOFTIRQ_MASK)) \
> > 			rcu_all_qs(); \
> > 	} \
> > } while (0)
> > 
> > Except that we can migrate as soon as that preempt_count_dec_and_test()
> > returns.  And that rcu_all_qs() disables and re-enables preemption,
> > which will result in undesired recursion.  Sigh.
> > 
> > So maybe something like this:
> > 
> > #define preempt_enable() \
> > do { \
> > 	if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > 	    !(preempt_count() & SOFTIRQ_MASK)) \
> 
> Sigh.  This needs to include (PREEMPT_MASK | SOFTIRQ_MASK),
> but check for equality to something like (1UL << PREEMPT_SHIFT).
> 
> Clearly time to sleep.  :-/

Maybe this might actually work:

#define preempt_enable() \
do { \
	barrier(); \
	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
	    !irqs_disabled()) \
		rcu_all_qs(); \
	if (unlikely(preempt_count_dec_and_test())) { \
		__preempt_schedule(); \
	} \
} while (0)

And the rcu_all_qs() below might also work.

							Thanx, Paul

> > 		rcu_all_qs(); \
> > 	barrier(); \
> > 	if (unlikely(preempt_count_dec_and_test())) { \
> > 		__preempt_schedule(); \
> > 	} \
> > } while (0)
> > 
> > Then rcu_all_qs() becomes something like this:
> > 
> > void rcu_all_qs(void)
> > {
> > 	unsigned long flags;
> > 
> > 	/* Load rcu_urgent_qs before other flags. */
> > 	if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
> > 		return;
> > 	this_cpu_write(rcu_data.rcu_urgent_qs, false);
> > 	if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
> > 		local_irq_save(flags);
> > 		rcu_momentary_dyntick_idle();
> > 		local_irq_restore(flags);
> > 	}
> > 	rcu_qs();
> > }
> > EXPORT_SYMBOL_GPL(rcu_all_qs);
> > 
> > > Though I do wonder about the likelihood of hitting the case you describe
> > > and maybe instead of adding the check on every preempt_enable()
> > > it might be better to instead force a context switch in the
> > > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)
> > 
> > Maybe.  But rcu_all_qs() is way lighter weight than a context switch.
> > 
> > 							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21 19:25           ` Paul E. McKenney
@ 2023-11-21 20:30             ` Peter Zijlstra
  2023-11-21 21:14               ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Peter Zijlstra @ 2023-11-21 20:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, linux-mm, x86, akpm,
	luto, bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot,
	willy, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:
> #define preempt_enable() \
> do { \
> 	barrier(); \
> 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> 	    !irqs_disabled()) \
> 		rcu_all_qs(); \
> 	if (unlikely(preempt_count_dec_and_test())) { \
> 		__preempt_schedule(); \
> 	} \
> } while (0)

Aaaaahhh, please no. We spend so much time reducing preempt_enable() to
the minimal thing it is today, this will make it blow up into something
giant again.

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21 20:30             ` Peter Zijlstra
@ 2023-11-21 21:14               ` Paul E. McKenney
  2023-11-21 21:38                 ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21 21:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, linux-mm, x86, akpm,
	luto, bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot,
	willy, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:
> > #define preempt_enable() \
> > do { \
> > 	barrier(); \
> > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > 	    !irqs_disabled()) \
> > 		rcu_all_qs(); \
> > 	if (unlikely(preempt_count_dec_and_test())) { \
> > 		__preempt_schedule(); \
> > 	} \
> > } while (0)
> 
> Aaaaahhh, please no. We spend so much time reducing preempt_enable() to
> the minimal thing it is today, this will make it blow up into something
> giant again.

IPIs, then.  Or retaining some cond_resched() calls, as needed.

Or is there another way?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21 21:14               ` Paul E. McKenney
@ 2023-11-21 21:38                 ` Steven Rostedt
  2023-11-21 22:26                   ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-21 21:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, tglx, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, 21 Nov 2023 13:14:16 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:  
> > > #define preempt_enable() \
> > > do { \
> > > 	barrier(); \
> > > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > > 	    !irqs_disabled()) \

Could we make the above an else case of the below if ?

> > > 		rcu_all_qs(); \
> > > 	if (unlikely(preempt_count_dec_and_test())) { \
> > > 		__preempt_schedule(); \
> > > 	} \
> > > } while (0)  
> > 
> > Aaaaahhh, please no. We spend so much time reducing preempt_enable() to
> > the minimal thing it is today, this will make it blow up into something
> > giant again.  

Note, the above is only true with "CONFIG_PREEMPT_RCU is not set", which
keeps the preempt_count() for preemptable kernels with PREEMPT_RCU still minimal.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21 21:38                 ` Steven Rostedt
@ 2023-11-21 22:26                   ` Paul E. McKenney
  2023-11-21 22:52                     ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-21 22:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, tglx, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, Nov 21, 2023 at 04:38:34PM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2023 13:14:16 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:  
> > > > #define preempt_enable() \
> > > > do { \
> > > > 	barrier(); \
> > > > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > > > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > > > 	    !irqs_disabled()) \
> 
> Could we make the above an else case of the below if ?

Wouldn't that cause the above preempt_count() test to always fail?

Another approach is to bury the test in preempt_count_dec_and_test(),
but I suspect that this would not make Peter any more happy than my
earlier suggestion.  ;-)

> > > > 		rcu_all_qs(); \
> > > > 	if (unlikely(preempt_count_dec_and_test())) { \
> > > > 		__preempt_schedule(); \
> > > > 	} \
> > > > } while (0)  
> > > 
> > > Aaaaahhh, please no. We spend so much time reducing preempt_enable() to
> > > the minimal thing it is today, this will make it blow up into something
> > > giant again.  
> 
> Note, the above is only true with "CONFIG_PREEMPT_RCU is not set", which
> keeps the preempt_count() for preemptable kernels with PREEMPT_RCU still minimal.

Agreed, and there is probably some workload that does not like this.
After all, current CONFIG_PREEMPT_DYNAMIC=y booted with preempt=none
would have those cond_resched() invocations.  I was leary of checking
dynamic information, but maybe sched_feat() is faster than I am thinking?
(It should be with the static_branch, but not sure about the other two
access modes.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21 22:26                   ` Paul E. McKenney
@ 2023-11-21 22:52                     ` Steven Rostedt
  2023-11-22  0:01                       ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-21 22:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, tglx, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, 21 Nov 2023 14:26:33 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Nov 21, 2023 at 04:38:34PM -0500, Steven Rostedt wrote:
> > On Tue, 21 Nov 2023 13:14:16 -0800
> > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >   
> > > On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:  
> > > > On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:    
> > > > > #define preempt_enable() \
> > > > > do { \
> > > > > 	barrier(); \
> > > > > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > > > > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > > > > 	    !irqs_disabled()) \  
> > 
> > Could we make the above an else case of the below if ?  
> 
> Wouldn't that cause the above preempt_count() test to always fail?

preempt_count_dec_and_test() returns true if preempt_count() is zero, which
happens only if NEED_RESCHED is set, and the rest of preempt_count() is not
set. (NEED_RESCHED bit in preempt_count() is really the inverse of
NEED_RESCHED). Do we need to call rcu_all_qs() when we call the scheduler?
Isn't scheduling a quiescent state for most RCU flavors?

I thought this was to help move along the quiescent states without added
cond_resched() around, which has:

int __sched __cond_resched(void)
{
	if (should_resched(0)) {
		preempt_schedule_common();
		return 1;
	}
	/*
	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
	 * whether the current CPU is in an RCU read-side critical section,
	 * so the tick can report quiescent states even for CPUs looping
	 * in kernel context.  In contrast, in non-preemptible kernels,
	 * RCU readers leave no in-memory hints, which means that CPU-bound
	 * processes executing in kernel context might never report an
	 * RCU quiescent state.  Therefore, the following code causes
	 * cond_resched() to report a quiescent state, but only when RCU
	 * is in urgent need of one.
	 */
#ifndef CONFIG_PREEMPT_RCU
	rcu_all_qs();
#endif
	return 0;
}

Where if we schedule, we don't call rcu_all_qs().

I stand by that being in the else statement. It looks like that would keep
the previous work flow.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21 22:52                     ` Steven Rostedt
@ 2023-11-22  0:01                       ` Paul E. McKenney
  2023-11-22  0:12                         ` Steven Rostedt
  0 siblings, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-22  0:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, tglx, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, Nov 21, 2023 at 05:52:09PM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2023 14:26:33 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Tue, Nov 21, 2023 at 04:38:34PM -0500, Steven Rostedt wrote:
> > > On Tue, 21 Nov 2023 13:14:16 -0800
> > > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > >   
> > > > On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:  
> > > > > On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:    
> > > > > > #define preempt_enable() \
> > > > > > do { \
> > > > > > 	barrier(); \
> > > > > > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > > > > > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > > > > > 	    !irqs_disabled()) \  
> > > 
> > > Could we make the above an else case of the below if ?  
> > 
> > Wouldn't that cause the above preempt_count() test to always fail?
> 
> preempt_count_dec_and_test() returns true if preempt_count() is zero, which
> happens only if NEED_RESCHED is set, and the rest of preempt_count() is not
> set. (NEED_RESCHED bit in preempt_count() is really the inverse of
> NEED_RESCHED). Do we need to call rcu_all_qs() when we call the scheduler?
> Isn't scheduling a quiescent state for most RCU flavors?
> 
> I thought this was to help move along the quiescent states without added
> cond_resched() around, which has:
> 
> int __sched __cond_resched(void)
> {
> 	if (should_resched(0)) {
> 		preempt_schedule_common();
> 		return 1;
> 	}
> 	/*
> 	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> 	 * whether the current CPU is in an RCU read-side critical section,
> 	 * so the tick can report quiescent states even for CPUs looping
> 	 * in kernel context.  In contrast, in non-preemptible kernels,
> 	 * RCU readers leave no in-memory hints, which means that CPU-bound
> 	 * processes executing in kernel context might never report an
> 	 * RCU quiescent state.  Therefore, the following code causes
> 	 * cond_resched() to report a quiescent state, but only when RCU
> 	 * is in urgent need of one.
> 	 */
> #ifndef CONFIG_PREEMPT_RCU
> 	rcu_all_qs();
> #endif
> 	return 0;
> }
> 
> Where if we schedule, we don't call rcu_all_qs().

True enough, but in this __cond_resched() case we know that we are in
an RCU quiescent state regardless of what should_resched() says.

In contrast, with preempt_enable(), we are only in a quiescent state
if __preempt_count_dec_and_test() returns true, and even then only if
interrupts are enabled.

> I stand by that being in the else statement. It looks like that would keep
> the previous work flow.

Ah, because PREEMPT_NEED_RESCHED is zero when we need to reschedule,
so that when __preempt_count_dec_and_test() returns false, we might
still be in an RCU quiescent state in the case where there was no need
to reschedule.  Good point!

In which case...

#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) \
		__preempt_schedule(); \
	else if (!sched_feat(FORCE_PREEMPT) && \
		 (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) && \
		 !irqs_disabled()) \
) \
			rcu_all_qs(); \
} while (0)

Keeping rcu_all_qs() pretty much as is.  Or some or all of the "else if"
condition could be pushed down into rcu_all_qs(), depending on whether
Peter's objection was call-site object code size, execution path length,
or both.  ;-)

If the objection is both call-site object code size and execution path
length, then maybe all but the preempt_count() check should be pushed
into rcu_all_qs().

Was that what you had in mind, or am I missing your point?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-22  0:01                       ` Paul E. McKenney
@ 2023-11-22  0:12                         ` Steven Rostedt
  2023-11-22  1:09                           ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-11-22  0:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, tglx, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, 21 Nov 2023 16:01:24 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> 
> > I stand by that being in the else statement. It looks like that would keep
> > the previous work flow.  
> 
> Ah, because PREEMPT_NEED_RESCHED is zero when we need to reschedule,
> so that when __preempt_count_dec_and_test() returns false, we might
> still be in an RCU quiescent state in the case where there was no need
> to reschedule.  Good point!
> 
> In which case...
> 
> #define preempt_enable() \
> do { \
> 	barrier(); \
> 	if (unlikely(preempt_count_dec_and_test())) \
> 		__preempt_schedule(); \
> 	else if (!sched_feat(FORCE_PREEMPT) && \
> 		 (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) && \
> 		 !irqs_disabled()) \
> ) \
> 			rcu_all_qs(); \
> } while (0)
> 
> Keeping rcu_all_qs() pretty much as is.  Or some or all of the "else if"
> condition could be pushed down into rcu_all_qs(), depending on whether
> Peter's objection was call-site object code size, execution path length,
> or both.  ;-)
> 
> If the objection is both call-site object code size and execution path
> length, then maybe all but the preempt_count() check should be pushed
> into rcu_all_qs().
> 
> Was that what you had in mind, or am I missing your point?

Yes, that is what I had in mind.

Should we also keep the !IS_ENABLED(CONFIG_PREEMPT_RCU) check, which makes
the entire thing optimized out when PREEMPT_RCU is enabled?

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-22  0:12                         ` Steven Rostedt
@ 2023-11-22  1:09                           ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-11-22  1:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, tglx, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Tue, Nov 21, 2023 at 07:12:32PM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2023 16:01:24 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > 
> > > I stand by that being in the else statement. It looks like that would keep
> > > the previous work flow.  
> > 
> > Ah, because PREEMPT_NEED_RESCHED is zero when we need to reschedule,
> > so that when __preempt_count_dec_and_test() returns false, we might
> > still be in an RCU quiescent state in the case where there was no need
> > to reschedule.  Good point!
> > 
> > In which case...
> > 
> > #define preempt_enable() \
> > do { \
> > 	barrier(); \
> > 	if (unlikely(preempt_count_dec_and_test())) \
> > 		__preempt_schedule(); \
> > 	else if (!sched_feat(FORCE_PREEMPT) && \
> > 		 (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) && \
> > 		 !irqs_disabled()) \
> > ) \
> > 			rcu_all_qs(); \
> > } while (0)
> > 
> > Keeping rcu_all_qs() pretty much as is.  Or some or all of the "else if"
> > condition could be pushed down into rcu_all_qs(), depending on whether
> > Peter's objection was call-site object code size, execution path length,
> > or both.  ;-)
> > 
> > If the objection is both call-site object code size and execution path
> > length, then maybe all but the preempt_count() check should be pushed
> > into rcu_all_qs().
> > 
> > Was that what you had in mind, or am I missing your point?
> 
> Yes, that is what I had in mind.
> 
> Should we also keep the !IS_ENABLED(CONFIG_PREEMPT_RCU) check, which makes
> the entire thing optimized out when PREEMPT_RCU is enabled?

I substituted the !sched_feat(FORCE_PREEMPT) for this because as I
understand it, sites currently using CONFIG_PREEMPT_DYNAMIC=y (which is
the default) and booting with preempt=none will currently have their
grace periods helped by cond_resched(), so likely also need help,
perhaps also from preempt_enable().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-21 15:19             ` Paul E. McKenney
@ 2023-11-28 10:53               ` Thomas Gleixner
  2023-11-28 18:30                 ` Ankur Arora
  2023-12-05  1:01                 ` Paul E. McKenney
  0 siblings, 2 replies; 250+ messages in thread
From: Thomas Gleixner @ 2023-11-28 10:53 UTC (permalink / raw)
  To: paulmck, Steven Rostedt
  Cc: Ankur Arora, linux-kernel, peterz, torvalds, linux-mm, x86, akpm,
	luto, bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot,
	willy, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, David.Laight, richard, mjguzik, Simon Horman,
	Julian Anastasov, Alexei Starovoitov, Daniel Borkmann

Paul!

On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
> On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
>> Right now, the use of cond_resched() is basically a whack-a-mole game where
>> we need to whack all the mole loops with the cond_resched() hammer. As
>> Thomas said, this is backwards. It makes more sense to just not preempt in
>> areas that can cause pain (like holding a mutex or in an RCU critical
>> section), but still have the general kernel be fully preemptable.
>
> Which is quite true, but that whack-a-mole game can be ended without
> getting rid of build-time selection of the preemption model.  Also,
> that whack-a-mole game can be ended without eliminating all calls to
> cond_resched().

Which calls to cond_resched() should not be eliminated?

They all suck and keeping some of them is just counterproductive as
again people will sprinkle them all over the place for the very wrong
reasons.

> Additionally, if the end goal is to be fully preemptible as in eventually
> eliminating lazy preemption, you have a lot more convincing to do.

That's absolutely not the case. Even RT uses the lazy mode to prevent
overeager preemption for non RT tasks.

The whole point of the exercise is to keep the kernel always fully
preemptible, but only enforce the immediate preemption at the next
possible preemption point when necessary.

The decision when it is necessary is made by the scheduler and not
delegated to the whim of cond/might_resched() placement.

That is serving both worlds best IMO:

  1) LAZY preemption prevents the negative side effects of overeager
     preemption, aka. lock contention and pointless context switching.

     The whole thing behaves like a NONE kernel unless there are
     real-time tasks or a task did not comply to the lazy request within
     a given time.

  2) It does not prevent the scheduler from making decisions to preempt
     at the next possible preemption point in order to get some
     important computation on the CPU.

     A NONE kernel sucks vs. any sporadic [real-time] task. Just run
     NONE and watch the latencies. The latencies are determined by the
     interrupted context, the placement of the cond_resched() call and
     the length of the loop which is running.

     People have complained about that and the only way out for them is
     to switch to VOLUNTARY or FULL preemption and thereby paying the
     price for overeager preemption.

     A price which you don't want to pay for good reasons but at the
     same time you care about latencies in some aspects and the only
     answer you have for that is cond_resched() or similar which is not
     an answer at all.

  3) Looking at the initial problem Ankur was trying to solve there is
     absolutely no acceptable solution to solve that unless you think
     that the semantically invers 'allow_preempt()/disallow_preempt()'
     is anywhere near acceptable.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-21  0:38   ` Paul E. McKenney
  2023-11-21  3:26     ` Ankur Arora
@ 2023-11-28 17:04     ` Thomas Gleixner
  2023-12-05  1:33       ` Paul E. McKenney
  1 sibling, 1 reply; 250+ messages in thread
From: Thomas Gleixner @ 2023-11-28 17:04 UTC (permalink / raw)
  To: paulmck, Ankur Arora
  Cc: linux-kernel, peterz, torvalds, linux-mm, x86, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, vincent.guittot, willy,
	mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, jgross, andrew.cooper3, mingo, bristot,
	mathieu.desnoyers, geert, glaubitz, anton.ivanov, mattst88,
	krypton, rostedt, David.Laight, richard, mjguzik

Paul!

On Mon, Nov 20 2023 at 16:38, Paul E. McKenney wrote:
> But...
>
> Suppose we have a long-running loop in the kernel that regularly
> enables preemption, but only momentarily.  Then the added
> rcu_flavor_sched_clock_irq() check would almost always fail, making
> for extremely long grace periods.  Or did I miss a change that causes
> preempt_enable() to help RCU out?

So first of all this is not any different from today and even with
RCU_PREEMPT=y a tight loop:

    do {
    	preempt_disable();
        do_stuff();
        preempt_enable();
    }

will not allow rcu_flavor_sched_clock_irq() to detect QS reliably. All
it can do is to force reschedule/preemption after some time, which in
turn ends up in a QS.

The current NONE/VOLUNTARY models, which imply RCU_PRREMPT=n cannot do
that at all because the preempt_enable() is a NOOP and there is no
preemption point at return from interrupt to kernel.

    do {
        do_stuff();
    }

So the only thing which makes that "work" is slapping a cond_resched()
into the loop:

    do {
        do_stuff();
        cond_resched();
    }

But the whole concept behind LAZY is that the loop will always be:

    do {
    	preempt_disable();
        do_stuff();
        preempt_enable();
    }

and the preempt_enable() will always be a functional preemption point.

So let's look at the simple case where more than one task is runnable on
a given CPU:

    loop()

      preempt_disable();

      --> tick interrupt
          set LAZY_NEED_RESCHED

      preempt_enable() -> Does nothing because NEED_RESCHED is not set

      preempt_disable();

      --> tick interrupt
          set NEED_RESCHED

      preempt_enable()
        preempt_schedule()
          schedule()
            report_QS()

which means that on the second tick a quiesent state is
reported. Whether that's really going to be a full tick which is granted
that's a scheduler decision and implementation detail and not really
relevant for discussing the concept.

Now the problematic case is when there is only one task runnable on a
given CPU because then the tick interrupt will set neither of the
preemption bits. Which is fine from a scheduler perspective, but not so
much from a RCU perspective.

But the whole point of LAZY is to be able to enforce rescheduling at the
next possible preemption point. So RCU can utilize that too:

rcu_flavor_sched_clock_irq(bool user)
{
	if (user || rcu_is_cpu_rrupt_from_idle() ||
	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
		rcu_qs();
                return;
	}

        if (this_cpu_read(rcu_data.rcu_urgent_qs))
        	set_need_resched();
}

So:

    loop()

      preempt_disable();

      --> tick interrupt
            rcu_flavor_sched_clock_irq()
                sets NEED_RESCHED

      preempt_enable()
        preempt_schedule()
          schedule()
            report_QS()

See? No magic nonsense in preempt_enable(), no cond_resched(), nothing.

The above rcu_flavor_sched_clock_irq() check for rcu_data.rcu_urgent_qs
is not really fundamentaly different from the check in rcu_all_gs(). The
main difference is that it is bound to the tick, so the detection/action
might be delayed by a tick. If that turns out to be a problem, then this
stuff has far more serious issues underneath.

So now you might argue that for a loop like this:

    do {
        mutex_lock();
        do_stuff();
        mutex_unlock();
    }

the ideal preemption point is post mutex_unlock(), which is where
someone would mindfully (*cough*) place a cond_resched(), right?

So if that turns out to matter in reality and not just by academic
inspection, then we are far better off to annotate such code with:

    do {
        preempt_lazy_disable();
        mutex_lock();
        do_stuff();
        mutex_unlock();
        preempt_lazy_enable();
    }

and let preempt_lazy_enable() evaluate the NEED_RESCHED_LAZY bit.

Then rcu_flavor_sched_clock_irq(bool user) can then use a two stage
approach like the scheduler:

rcu_flavor_sched_clock_irq(bool user)
{
	if (user || rcu_is_cpu_rrupt_from_idle() ||
	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
		rcu_qs();
                return;
	}

        if (this_cpu_read(rcu_data.rcu_urgent_qs)) {
        	if (!need_resched_lazy()))
                	set_need_resched_lazy();
                else
                	set_need_resched();
	}
}

But for a start I would just use the trivial

        if (this_cpu_read(rcu_data.rcu_urgent_qs))
        	set_need_resched();

approach and see where this gets us.

With the approach I suggested to Ankur, i.e. having PREEMPT_AUTO(or
LAZY) as a config option we can work on the details of the AUTO and
RCU_PREEMPT=n flavour up to the point where we are happy to get rid
of the whole zoo of config options alltogether.

Just insisting that RCU_PREEMPT=n requires cond_resched() and whatsoever
is not really getting us anywhere.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-28 10:53               ` Thomas Gleixner
@ 2023-11-28 18:30                 ` Ankur Arora
  2023-12-05  1:03                   ` Paul E. McKenney
  2023-12-05  1:01                 ` Paul E. McKenney
  1 sibling, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-11-28 18:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: paulmck, Steven Rostedt, Ankur Arora, linux-kernel, peterz,
	torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann


Thomas Gleixner <tglx@linutronix.de> writes:

> Paul!
>
> On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
>> On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
>>> Right now, the use of cond_resched() is basically a whack-a-mole game where
>>> we need to whack all the mole loops with the cond_resched() hammer. As
>>> Thomas said, this is backwards. It makes more sense to just not preempt in
>>> areas that can cause pain (like holding a mutex or in an RCU critical
>>> section), but still have the general kernel be fully preemptable.
>>
>> Which is quite true, but that whack-a-mole game can be ended without
>> getting rid of build-time selection of the preemption model.  Also,
>> that whack-a-mole game can be ended without eliminating all calls to
>> cond_resched().
>
> Which calls to cond_resched() should not be eliminated?
>
> They all suck and keeping some of them is just counterproductive as
> again people will sprinkle them all over the place for the very wrong
> reasons.

And, as Thomas alludes to here, cond_resched() is not always cost free.
Needing to call cond_resched() forces us to restructure hot paths in
ways that results in worse performance/complex code.

One example is clear_huge_page(), where removing the need to call
cond_resched() every once in a while allows the processor to optimize
differently.

  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%

(See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

And, that's one of the simpler examples from mm. We do this kind of arbitrary
batching all over the place.

Or see the filemap_read() example that Linus gives here:
 https://lore.kernel.org/lkml/CAHk-=whpYjm_AizQij6XEfTd7xvGjrVCx5gzHcHm=2Xijt+Kyg@mail.gmail.com/#t

Thanks
--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-28 10:53               ` Thomas Gleixner
  2023-11-28 18:30                 ` Ankur Arora
@ 2023-12-05  1:01                 ` Paul E. McKenney
  2023-12-05 15:01                   ` Steven Rostedt
  1 sibling, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-05  1:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
> > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
> >> Right now, the use of cond_resched() is basically a whack-a-mole game where
> >> we need to whack all the mole loops with the cond_resched() hammer. As
> >> Thomas said, this is backwards. It makes more sense to just not preempt in
> >> areas that can cause pain (like holding a mutex or in an RCU critical
> >> section), but still have the general kernel be fully preemptable.
> >
> > Which is quite true, but that whack-a-mole game can be ended without
> > getting rid of build-time selection of the preemption model.  Also,
> > that whack-a-mole game can be ended without eliminating all calls to
> > cond_resched().
> 
> Which calls to cond_resched() should not be eliminated?

The ones which, if eliminated, will result in excessive latencies.

This question is going to take some time to answer.  One type of potential
issue is where the cond_resched() precedes something like mutex_lock(),
where that mutex_lock() takes the fast path and preemption follows
shortly thereafter.  It would clearly have been better to have preempted
before acquisition.

Another is the aforementioned situations where removing the cond_resched()
increases latency.  Yes, capping the preemption latency is a wonderful
thing, and the people I chatted with are all for that, but it is only
natural that there would be a corresponding level of concern about the
cases where removing the cond_resched() calls increases latency.

There might be others as well.  These are the possibilities that have
come up thus far.

> They all suck and keeping some of them is just counterproductive as
> again people will sprinkle them all over the place for the very wrong
> reasons.

Yes, but do they suck enough and are they counterproductive enough to
be useful and necessary?  ;-)

> > Additionally, if the end goal is to be fully preemptible as in eventually
> > eliminating lazy preemption, you have a lot more convincing to do.
> 
> That's absolutely not the case. Even RT uses the lazy mode to prevent
> overeager preemption for non RT tasks.

OK, that is very good to hear.

> The whole point of the exercise is to keep the kernel always fully
> preemptible, but only enforce the immediate preemption at the next
> possible preemption point when necessary.
> 
> The decision when it is necessary is made by the scheduler and not
> delegated to the whim of cond/might_resched() placement.

I am not arguing that the developer placing a given cond_resched()
always knows best, but you have some work to do to convince me that the
scheduler always knows best.

> That is serving both worlds best IMO:
> 
>   1) LAZY preemption prevents the negative side effects of overeager
>      preemption, aka. lock contention and pointless context switching.
> 
>      The whole thing behaves like a NONE kernel unless there are
>      real-time tasks or a task did not comply to the lazy request within
>      a given time.

Almost, give or take the potential issues called out above for the
possible downsides of removing all of the cond_resched() invocations.

>   2) It does not prevent the scheduler from making decisions to preempt
>      at the next possible preemption point in order to get some
>      important computation on the CPU.
> 
>      A NONE kernel sucks vs. any sporadic [real-time] task. Just run
>      NONE and watch the latencies. The latencies are determined by the
>      interrupted context, the placement of the cond_resched() call and
>      the length of the loop which is running.
> 
>      People have complained about that and the only way out for them is
>      to switch to VOLUNTARY or FULL preemption and thereby paying the
>      price for overeager preemption.
> 
>      A price which you don't want to pay for good reasons but at the
>      same time you care about latencies in some aspects and the only
>      answer you have for that is cond_resched() or similar which is not
>      an answer at all.

All good points, but none of them are in conflict with the possibility
of leaving some cond_resched() calls behind if they ar needed.

>   3) Looking at the initial problem Ankur was trying to solve there is
>      absolutely no acceptable solution to solve that unless you think
>      that the semantically invers 'allow_preempt()/disallow_preempt()'
>      is anywhere near acceptable.

I am not arguing for allow_preempt()/disallow_preempt(), so for that
argument, you need to find someone else to argue with.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-11-28 18:30                 ` Ankur Arora
@ 2023-12-05  1:03                   ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-05  1:03 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Thomas Gleixner, Steven Rostedt, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue, Nov 28, 2023 at 10:30:53AM -0800, Ankur Arora wrote:
> 
> Thomas Gleixner <tglx@linutronix.de> writes:
> 
> > Paul!
> >
> > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
> >> On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
> >>> Right now, the use of cond_resched() is basically a whack-a-mole game where
> >>> we need to whack all the mole loops with the cond_resched() hammer. As
> >>> Thomas said, this is backwards. It makes more sense to just not preempt in
> >>> areas that can cause pain (like holding a mutex or in an RCU critical
> >>> section), but still have the general kernel be fully preemptable.
> >>
> >> Which is quite true, but that whack-a-mole game can be ended without
> >> getting rid of build-time selection of the preemption model.  Also,
> >> that whack-a-mole game can be ended without eliminating all calls to
> >> cond_resched().
> >
> > Which calls to cond_resched() should not be eliminated?
> >
> > They all suck and keeping some of them is just counterproductive as
> > again people will sprinkle them all over the place for the very wrong
> > reasons.
> 
> And, as Thomas alludes to here, cond_resched() is not always cost free.
> Needing to call cond_resched() forces us to restructure hot paths in
> ways that results in worse performance/complex code.
> 
> One example is clear_huge_page(), where removing the need to call
> cond_resched() every once in a while allows the processor to optimize
> differently.
> 
>   *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>                           (GB/s)                (GB/s)
> 
>   pg-sz=2MB                14.55                 19.29    +32.5%
>   pg-sz=1GB                19.34                 49.60   +156.4%
> 
> (See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/)
> 
> And, that's one of the simpler examples from mm. We do this kind of arbitrary
> batching all over the place.
> 
> Or see the filemap_read() example that Linus gives here:
>  https://lore.kernel.org/lkml/CAHk-=whpYjm_AizQij6XEfTd7xvGjrVCx5gzHcHm=2Xijt+Kyg@mail.gmail.com/#t

I already agree that some cond_resched() calls can cause difficulties.
But that is not the same as proving that they *all* should be removed.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-11-28 17:04     ` Thomas Gleixner
@ 2023-12-05  1:33       ` Paul E. McKenney
  2023-12-06 15:10         ` Thomas Gleixner
  2023-12-07  1:31         ` Ankur Arora
  0 siblings, 2 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-05  1:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ankur Arora, linux-kernel, peterz, torvalds, linux-mm, x86, akpm,
	luto, bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot,
	willy, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, rostedt, David.Laight, richard, mjguzik

On Tue, Nov 28, 2023 at 06:04:33PM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Mon, Nov 20 2023 at 16:38, Paul E. McKenney wrote:
> > But...
> >
> > Suppose we have a long-running loop in the kernel that regularly
> > enables preemption, but only momentarily.  Then the added
> > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > for extremely long grace periods.  Or did I miss a change that causes
> > preempt_enable() to help RCU out?
> 
> So first of all this is not any different from today and even with
> RCU_PREEMPT=y a tight loop:
> 
>     do {
>     	preempt_disable();
>         do_stuff();
>         preempt_enable();
>     }
> 
> will not allow rcu_flavor_sched_clock_irq() to detect QS reliably. All
> it can do is to force reschedule/preemption after some time, which in
> turn ends up in a QS.

True, but we don't run RCU_PREEMPT=y on the fleet.  So although this
argument should offer comfort to those who would like to switch from
forced preemption to lazy preemption, it doesn't help for those of us
running NONE/VOLUNTARY.

I can of course compensate if need be by making RCU more aggressive with
the resched_cpu() hammer, which includes an IPI.  For non-nohz_full CPUs,
it currently waits halfway to the stall-warning timeout.

> The current NONE/VOLUNTARY models, which imply RCU_PRREMPT=n cannot do
> that at all because the preempt_enable() is a NOOP and there is no
> preemption point at return from interrupt to kernel.
> 
>     do {
>         do_stuff();
>     }
> 
> So the only thing which makes that "work" is slapping a cond_resched()
> into the loop:
> 
>     do {
>         do_stuff();
>         cond_resched();
>     }

Yes, exactly.

> But the whole concept behind LAZY is that the loop will always be:
> 
>     do {
>     	preempt_disable();
>         do_stuff();
>         preempt_enable();
>     }
> 
> and the preempt_enable() will always be a functional preemption point.

Understood.  And if preempt_enable() can interact with RCU when requested,
I would expect that this could make quite a few calls to cond_resched()
provably unnecessary.  There was some discussion of this:

https://lore.kernel.org/all/0d6a8e80-c89b-4ded-8de1-8c946874f787@paulmck-laptop/

There were objections to an earlier version.  Is this version OK?

> So let's look at the simple case where more than one task is runnable on
> a given CPU:
> 
>     loop()
> 
>       preempt_disable();
> 
>       --> tick interrupt
>           set LAZY_NEED_RESCHED
> 
>       preempt_enable() -> Does nothing because NEED_RESCHED is not set
> 
>       preempt_disable();
> 
>       --> tick interrupt
>           set NEED_RESCHED
> 
>       preempt_enable()
>         preempt_schedule()
>           schedule()
>             report_QS()
> 
> which means that on the second tick a quiesent state is
> reported. Whether that's really going to be a full tick which is granted
> that's a scheduler decision and implementation detail and not really
> relevant for discussing the concept.

In my experience, the implementation details make themselves relevant
sooner or later, and often sooner...

> Now the problematic case is when there is only one task runnable on a
> given CPU because then the tick interrupt will set neither of the
> preemption bits. Which is fine from a scheduler perspective, but not so
> much from a RCU perspective.
> 
> But the whole point of LAZY is to be able to enforce rescheduling at the
> next possible preemption point. So RCU can utilize that too:
> 
> rcu_flavor_sched_clock_irq(bool user)
> {
> 	if (user || rcu_is_cpu_rrupt_from_idle() ||
> 	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> 		rcu_qs();
>                 return;
> 	}
> 
>         if (this_cpu_read(rcu_data.rcu_urgent_qs))
>         	set_need_resched();
> }

Yes, that is one of the changes that would be good to make, as discussed
previously.

> So:
> 
>     loop()
> 
>       preempt_disable();
> 
>       --> tick interrupt
>             rcu_flavor_sched_clock_irq()
>                 sets NEED_RESCHED
> 
>       preempt_enable()
>         preempt_schedule()
>           schedule()
>             report_QS()
> 
> See? No magic nonsense in preempt_enable(), no cond_resched(), nothing.

Understood, but that does delay detection of that quiescent state by up
to one tick.

> The above rcu_flavor_sched_clock_irq() check for rcu_data.rcu_urgent_qs
> is not really fundamentaly different from the check in rcu_all_gs(). The
> main difference is that it is bound to the tick, so the detection/action
> might be delayed by a tick. If that turns out to be a problem, then this
> stuff has far more serious issues underneath.

Again, as I have stated before, one advantage of PREEMPT_COUNT=y is this
ability, so yes, believe it or not, I really do understand this.  ;-)

> So now you might argue that for a loop like this:
> 
>     do {
>         mutex_lock();
>         do_stuff();
>         mutex_unlock();
>     }
> 
> the ideal preemption point is post mutex_unlock(), which is where
> someone would mindfully (*cough*) place a cond_resched(), right?
> 
> So if that turns out to matter in reality and not just by academic
> inspection, then we are far better off to annotate such code with:
> 
>     do {
>         preempt_lazy_disable();
>         mutex_lock();
>         do_stuff();
>         mutex_unlock();
>         preempt_lazy_enable();
>     }
> 
> and let preempt_lazy_enable() evaluate the NEED_RESCHED_LAZY bit.

I am not exactly sure what semantics you are proposing with this pairing
as opposed to "this would be a good time to preempt in response to the
pending lazy request".  But I do agree that something like this could
replace at least a few more instance of cond_resched(), so that is good.
Not necessarily all of them, though.

> Then rcu_flavor_sched_clock_irq(bool user) can then use a two stage
> approach like the scheduler:
> 
> rcu_flavor_sched_clock_irq(bool user)
> {
> 	if (user || rcu_is_cpu_rrupt_from_idle() ||
> 	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> 		rcu_qs();
>                 return;
> 	}
> 
>         if (this_cpu_read(rcu_data.rcu_urgent_qs)) {
>         	if (!need_resched_lazy()))
>                 	set_need_resched_lazy();
>                 else
>                 	set_need_resched();
> 	}
> }
> 
> But for a start I would just use the trivial
> 
>         if (this_cpu_read(rcu_data.rcu_urgent_qs))
>         	set_need_resched();
> 
> approach and see where this gets us.

Agreed, I would start with the plain set_need_resched().  This shouldn't
happen all that often, on default x86 builds nine milliseconds into the
grace period.

> With the approach I suggested to Ankur, i.e. having PREEMPT_AUTO(or
> LAZY) as a config option we can work on the details of the AUTO and
> RCU_PREEMPT=n flavour up to the point where we are happy to get rid
> of the whole zoo of config options alltogether.

I agree that some simplification should be possible and would be
desireable.

> Just insisting that RCU_PREEMPT=n requires cond_resched() and whatsoever
> is not really getting us anywhere.

Except that this is not what is happening, Thomas.  ;-)

You are asserting that all of the cond_resched() calls can safely be
eliminated.  That might well be, but more than assertion is required.
You have come up with some good ways of getting rid of some classes of
them, which is a very good and very welcome thing.  But that is not the
same as having proved that all of them may be safely removed.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-05  1:01                 ` Paul E. McKenney
@ 2023-12-05 15:01                   ` Steven Rostedt
  2023-12-05 19:38                     ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-12-05 15:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Mon, 4 Dec 2023 17:01:21 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
> > Paul!
> > 
> > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:  
> > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:  
> > >> Right now, the use of cond_resched() is basically a whack-a-mole game where
> > >> we need to whack all the mole loops with the cond_resched() hammer. As
> > >> Thomas said, this is backwards. It makes more sense to just not preempt in
> > >> areas that can cause pain (like holding a mutex or in an RCU critical
> > >> section), but still have the general kernel be fully preemptable.  
> > >
> > > Which is quite true, but that whack-a-mole game can be ended without
> > > getting rid of build-time selection of the preemption model.  Also,
> > > that whack-a-mole game can be ended without eliminating all calls to
> > > cond_resched().  
> > 
> > Which calls to cond_resched() should not be eliminated?  
> 
> The ones which, if eliminated, will result in excessive latencies.
> 
> This question is going to take some time to answer.  One type of potential
> issue is where the cond_resched() precedes something like mutex_lock(),
> where that mutex_lock() takes the fast path and preemption follows
> shortly thereafter.  It would clearly have been better to have preempted
> before acquisition.

Note that the new preemption model is a new paradigm and we need to start
thinking a bit differently if we go to it.

One thing I would like to look into with the new work is to have holding a
mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
converted to mutex in the RT kernel). That way you are less likely to be
preempted while holding a mutex.

> 
> Another is the aforementioned situations where removing the cond_resched()
> increases latency.  Yes, capping the preemption latency is a wonderful
> thing, and the people I chatted with are all for that, but it is only
> natural that there would be a corresponding level of concern about the
> cases where removing the cond_resched() calls increases latency.

With the "capped preemption" I'm not sure that would still be the case.
cond_resched() currently only preempts if NEED_RESCHED is set. That means
the system had to already be in a situation that a schedule needs to
happen. There's lots of places in the kernel that run for over a tick
without any cond_resched(). The cond_resched() is usually added for
locations that show tremendous latency (where either a watchdog triggered,
or showed up in some analysis that had a latency that was much greater than
a tick).

The point is, if/when we switch to the new preemption model, we would need
to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
done to prevent regressions. But the reasons I see cond_resched() being
added today, should no longer exist with this new model.

> 
> There might be others as well.  These are the possibilities that have
> come up thus far.
> 
> > They all suck and keeping some of them is just counterproductive as
> > again people will sprinkle them all over the place for the very wrong
> > reasons.  
> 
> Yes, but do they suck enough and are they counterproductive enough to
> be useful and necessary?  ;-)

They are only useful and necessary because of the way we handle preemption
today. With the new preemption model, they are all likely to be useless and
unnecessary ;-)

> 
> > > Additionally, if the end goal is to be fully preemptible as in
> > > eventually eliminating lazy preemption, you have a lot more
> > > convincing to do.  
> > 
> > That's absolutely not the case. Even RT uses the lazy mode to prevent
> > overeager preemption for non RT tasks.  
> 
> OK, that is very good to hear.

But the paradigm is changing. The kernel will be fully preemptible, it just
won't be preempting often. That is, if the CPU is running kernel code for
too long, and the scheduler tick wants a reschedule, the kernel has one
more tick to get back to user space before it will become fully
preemptible. That is, we force a "cond_resched()".

> 
> > The whole point of the exercise is to keep the kernel always fully
> > preemptible, but only enforce the immediate preemption at the next
> > possible preemption point when necessary.
> > 
> > The decision when it is necessary is made by the scheduler and not
> > delegated to the whim of cond/might_resched() placement.  
> 
> I am not arguing that the developer placing a given cond_resched()
> always knows best, but you have some work to do to convince me that the
> scheduler always knows best.

The cond_resched() already expects the scheduler to know best. It doesn't
resched unless NEED_RESCHED is set and that's determined by the scheduler.
If the code knows best, then it should just call schedule() and be done
with it.

> 
> > That is serving both worlds best IMO:
> > 
> >   1) LAZY preemption prevents the negative side effects of overeager
> >      preemption, aka. lock contention and pointless context switching.
> > 
> >      The whole thing behaves like a NONE kernel unless there are
> >      real-time tasks or a task did not comply to the lazy request within
> >      a given time.  
> 
> Almost, give or take the potential issues called out above for the
> possible downsides of removing all of the cond_resched() invocations.

I still don't believe there are any issues "called out above", as I called
out those called outs.

> 
> >   2) It does not prevent the scheduler from making decisions to preempt
> >      at the next possible preemption point in order to get some
> >      important computation on the CPU.
> > 
> >      A NONE kernel sucks vs. any sporadic [real-time] task. Just run
> >      NONE and watch the latencies. The latencies are determined by the
> >      interrupted context, the placement of the cond_resched() call and
> >      the length of the loop which is running.
> > 
> >      People have complained about that and the only way out for them is
> >      to switch to VOLUNTARY or FULL preemption and thereby paying the
> >      price for overeager preemption.
> > 
> >      A price which you don't want to pay for good reasons but at the
> >      same time you care about latencies in some aspects and the only
> >      answer you have for that is cond_resched() or similar which is not
> >      an answer at all.  
> 
> All good points, but none of them are in conflict with the possibility
> of leaving some cond_resched() calls behind if they ar needed.

The conflict is with the new paradigm (I love that word! It's so "buzzy").
As I mentioned above, cond_resched() is usually added when a problem was
seen. I really believe that those problems would never had been seen if
the new paradigm had already been in place.

> 
> >   3) Looking at the initial problem Ankur was trying to solve there is
> >      absolutely no acceptable solution to solve that unless you think
> >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> >      is anywhere near acceptable.  
> 
> I am not arguing for allow_preempt()/disallow_preempt(), so for that
> argument, you need to find someone else to argue with.  ;-)

Anyway, there's still a long path before cond_resched() can be removed. It
was a mistake by Ankur to add those removals this early (and he has
acknowledged that mistake).

First we need to get the new preemption modeled implemented. When it is, it
can be just a config option at first. Then when that config option is set,
you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
between them at run time as they are just a way to tell the scheduler when
to set NEED_RESCHED_LAZY vs NEED_RSECHED.

At that moment, when that config is set, the cond_resched() can turn into a
nop. This will allow for testing to make sure there are no regressions in
latency, even with the NONE mode enabled.

The real test is implementing the code and seeing how it affects things in
the real world. Us arguing about it isn't going to get anywhere. I just
don't want blind NACK. A NACK to a removal of a cond_resched() needs to
show that there was a real regression with that removal.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-05 15:01                   ` Steven Rostedt
@ 2023-12-05 19:38                     ` Paul E. McKenney
  2023-12-05 20:18                       ` Ankur Arora
  2023-12-05 20:45                       ` Steven Rostedt
  0 siblings, 2 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-05 19:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue, Dec 05, 2023 at 10:01:14AM -0500, Steven Rostedt wrote:
> On Mon, 4 Dec 2023 17:01:21 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
> > > Paul!
> > > 
> > > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:  
> > > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:  
> > > >> Right now, the use of cond_resched() is basically a whack-a-mole game where
> > > >> we need to whack all the mole loops with the cond_resched() hammer. As
> > > >> Thomas said, this is backwards. It makes more sense to just not preempt in
> > > >> areas that can cause pain (like holding a mutex or in an RCU critical
> > > >> section), but still have the general kernel be fully preemptable.  
> > > >
> > > > Which is quite true, but that whack-a-mole game can be ended without
> > > > getting rid of build-time selection of the preemption model.  Also,
> > > > that whack-a-mole game can be ended without eliminating all calls to
> > > > cond_resched().  
> > > 
> > > Which calls to cond_resched() should not be eliminated?  
> > 
> > The ones which, if eliminated, will result in excessive latencies.
> > 
> > This question is going to take some time to answer.  One type of potential
> > issue is where the cond_resched() precedes something like mutex_lock(),
> > where that mutex_lock() takes the fast path and preemption follows
> > shortly thereafter.  It would clearly have been better to have preempted
> > before acquisition.
> 
> Note that the new preemption model is a new paradigm and we need to start
> thinking a bit differently if we go to it.

We can of course think differently, but existing hardware and software
will probably be a bit more stubborn.

> One thing I would like to look into with the new work is to have holding a
> mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
> converted to mutex in the RT kernel). That way you are less likely to be
> preempted while holding a mutex.

I like the concept, but those with mutex_lock() of rarely-held mutexes
in their fastpaths might have workloads that have a contrary opinion.

> > Another is the aforementioned situations where removing the cond_resched()
> > increases latency.  Yes, capping the preemption latency is a wonderful
> > thing, and the people I chatted with are all for that, but it is only
> > natural that there would be a corresponding level of concern about the
> > cases where removing the cond_resched() calls increases latency.
> 
> With the "capped preemption" I'm not sure that would still be the case.
> cond_resched() currently only preempts if NEED_RESCHED is set. That means
> the system had to already be in a situation that a schedule needs to
> happen. There's lots of places in the kernel that run for over a tick
> without any cond_resched(). The cond_resched() is usually added for
> locations that show tremendous latency (where either a watchdog triggered,
> or showed up in some analysis that had a latency that was much greater than
> a tick).

For non-real-time workloads, the average case is important, not just the
worst case.  In the new lazily preemptible mode of thought, a preemption
by a non-real-time task will wait a tick.  Earlier, it would have waited
for the next cond_resched().  Which, in the average case, might have
arrived much sooner than one tick.

> The point is, if/when we switch to the new preemption model, we would need
> to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> done to prevent regressions. But the reasons I see cond_resched() being
> added today, should no longer exist with this new model.

This I agree with.  Also, with the new paradigm and new mode of thought
in place, it should be safe to drop any cond_resched() that is in a loop
that consumes more than a tick of CPU time per iteration.

> > There might be others as well.  These are the possibilities that have
> > come up thus far.
> > 
> > > They all suck and keeping some of them is just counterproductive as
> > > again people will sprinkle them all over the place for the very wrong
> > > reasons.  
> > 
> > Yes, but do they suck enough and are they counterproductive enough to
> > be useful and necessary?  ;-)
> 
> They are only useful and necessary because of the way we handle preemption
> today. With the new preemption model, they are all likely to be useless and
> unnecessary ;-)

The "all likely" needs some demonstration.  I agree that a great many
of them would be useless and unnecessary.  Maybe even the vast majority.
But that is different than "all".  ;-)

> > > > Additionally, if the end goal is to be fully preemptible as in
> > > > eventually eliminating lazy preemption, you have a lot more
> > > > convincing to do.  
> > > 
> > > That's absolutely not the case. Even RT uses the lazy mode to prevent
> > > overeager preemption for non RT tasks.  
> > 
> > OK, that is very good to hear.
> 
> But the paradigm is changing. The kernel will be fully preemptible, it just
> won't be preempting often. That is, if the CPU is running kernel code for
> too long, and the scheduler tick wants a reschedule, the kernel has one
> more tick to get back to user space before it will become fully
> preemptible. That is, we force a "cond_resched()".

And as stated quite a few times previously in this and earlier threads,
yes, removing the need to drop cond_resched() into longer-than-average
loops is a very good thing.

> > > The whole point of the exercise is to keep the kernel always fully
> > > preemptible, but only enforce the immediate preemption at the next
> > > possible preemption point when necessary.
> > > 
> > > The decision when it is necessary is made by the scheduler and not
> > > delegated to the whim of cond/might_resched() placement.  
> > 
> > I am not arguing that the developer placing a given cond_resched()
> > always knows best, but you have some work to do to convince me that the
> > scheduler always knows best.
> 
> The cond_resched() already expects the scheduler to know best. It doesn't
> resched unless NEED_RESCHED is set and that's determined by the scheduler.
> If the code knows best, then it should just call schedule() and be done
> with it.

A distinction without a difference.  After all, if the scheduler really
knew best, it would be able to intuit the cond_resched() without that
cond_resched() actually being there.  Which is arguably the whole point
of this patch series, aside from mutexes, the possibility of extending
what are now short preemption times, and who knows what all else.

> > > That is serving both worlds best IMO:
> > > 
> > >   1) LAZY preemption prevents the negative side effects of overeager
> > >      preemption, aka. lock contention and pointless context switching.
> > > 
> > >      The whole thing behaves like a NONE kernel unless there are
> > >      real-time tasks or a task did not comply to the lazy request within
> > >      a given time.  
> > 
> > Almost, give or take the potential issues called out above for the
> > possible downsides of removing all of the cond_resched() invocations.
> 
> I still don't believe there are any issues "called out above", as I called
> out those called outs.

Well, you did write some words, if that is what you meant.  ;-)

> > >   2) It does not prevent the scheduler from making decisions to preempt
> > >      at the next possible preemption point in order to get some
> > >      important computation on the CPU.
> > > 
> > >      A NONE kernel sucks vs. any sporadic [real-time] task. Just run
> > >      NONE and watch the latencies. The latencies are determined by the
> > >      interrupted context, the placement of the cond_resched() call and
> > >      the length of the loop which is running.
> > > 
> > >      People have complained about that and the only way out for them is
> > >      to switch to VOLUNTARY or FULL preemption and thereby paying the
> > >      price for overeager preemption.
> > > 
> > >      A price which you don't want to pay for good reasons but at the
> > >      same time you care about latencies in some aspects and the only
> > >      answer you have for that is cond_resched() or similar which is not
> > >      an answer at all.  
> > 
> > All good points, but none of them are in conflict with the possibility
> > of leaving some cond_resched() calls behind if they ar needed.
> 
> The conflict is with the new paradigm (I love that word! It's so "buzzy").
> As I mentioned above, cond_resched() is usually added when a problem was
> seen. I really believe that those problems would never had been seen if
> the new paradigm had already been in place.

Indeed, that sort of wording does quite the opposite of raising my
confidence levels.  ;-)

You know, the ancient Romans would have had no problem dealing with the
dot-com boom, cryptocurrency, some of the shadier areas of artificial
intelligence and machine learning, and who knows what all else.  As the
Romans used to say, "Beware of geeks bearing grifts."

> > >   3) Looking at the initial problem Ankur was trying to solve there is
> > >      absolutely no acceptable solution to solve that unless you think
> > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > >      is anywhere near acceptable.  
> > 
> > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > argument, you need to find someone else to argue with.  ;-)
> 
> Anyway, there's still a long path before cond_resched() can be removed. It
> was a mistake by Ankur to add those removals this early (and he has
> acknowledged that mistake).

OK, that I can live with.  But that seems to be a bit different of a
take than that of some earlier emails in this thread.  ;-)

> First we need to get the new preemption modeled implemented. When it is, it
> can be just a config option at first. Then when that config option is set,
> you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> between them at run time as they are just a way to tell the scheduler when
> to set NEED_RESCHED_LAZY vs NEED_RSECHED.

Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
the runtime switching needs to be limited to NONE and VOLUNTARY.
Which is fine.

> At that moment, when that config is set, the cond_resched() can turn into a
> nop. This will allow for testing to make sure there are no regressions in
> latency, even with the NONE mode enabled.

And once it appears to be reasonably stable (in concept as well as
implementation), heavy testing should get underway.

> The real test is implementing the code and seeing how it affects things in
> the real world. Us arguing about it isn't going to get anywhere.

Indeed, the opinion of the objective universe always wins.  It all too
often takes longer than necessary for the people arguing with each other
to realize this, but such is life.

>                                                                  I just
> don't want blind NACK. A NACK to a removal of a cond_resched() needs to
> show that there was a real regression with that removal.

Fair enough, although a single commit bulk removing a large number of
cond_resched() calls will likely get a bulk NAK.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-05 19:38                     ` Paul E. McKenney
@ 2023-12-05 20:18                       ` Ankur Arora
  2023-12-06  4:07                         ` Paul E. McKenney
  2023-12-05 20:45                       ` Steven Rostedt
  1 sibling, 1 reply; 250+ messages in thread
From: Ankur Arora @ 2023-12-05 20:18 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, Thomas Gleixner, Ankur Arora, linux-kernel,
	peterz, torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen,
	hpa, mingo, juri.lelli, vincent.guittot, willy, mgorman,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	jgross, andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann


Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Dec 05, 2023 at 10:01:14AM -0500, Steven Rostedt wrote:
>> On Mon, 4 Dec 2023 17:01:21 -0800
>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>
>> > On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
>> > > Paul!
>> > >
>> > > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
>> > > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
...
>> > >   3) Looking at the initial problem Ankur was trying to solve there is
>> > >      absolutely no acceptable solution to solve that unless you think
>> > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
>> > >      is anywhere near acceptable.
>> >
>> > I am not arguing for allow_preempt()/disallow_preempt(), so for that
>> > argument, you need to find someone else to argue with.  ;-)
>>
>> Anyway, there's still a long path before cond_resched() can be removed. It
>> was a mistake by Ankur to add those removals this early (and he has
>> acknowledged that mistake).
>
> OK, that I can live with.  But that seems to be a bit different of a
> take than that of some earlier emails in this thread.  ;-)

Heh I think it's just that this thread goes to (far) too many places :).

As Steven says, the initial series touching everything all together
was a mistake. V1 adds the new preemption model alongside the existing
ones locally defines cond_resched() as nop.

That'll allow us to experiment and figure out where there are latency
gaps.

Ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-05 19:38                     ` Paul E. McKenney
  2023-12-05 20:18                       ` Ankur Arora
@ 2023-12-05 20:45                       ` Steven Rostedt
  2023-12-06 10:08                         ` David Laight
  2023-12-07  4:34                         ` Paul E. McKenney
  1 sibling, 2 replies; 250+ messages in thread
From: Steven Rostedt @ 2023-12-05 20:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue, 5 Dec 2023 11:38:42 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > 
> > Note that the new preemption model is a new paradigm and we need to start
> > thinking a bit differently if we go to it.  
> 
> We can of course think differently, but existing hardware and software
> will probably be a bit more stubborn.

Not at all. I don't see how hardware plays a role here, but how software is
designed does sometimes require thinking differently.

> 
> > One thing I would like to look into with the new work is to have holding a
> > mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
> > converted to mutex in the RT kernel). That way you are less likely to be
> > preempted while holding a mutex.  
> 
> I like the concept, but those with mutex_lock() of rarely-held mutexes
> in their fastpaths might have workloads that have a contrary opinion.

I don't understand your above statement. Maybe I wasn't clear with my
statement? The above is more about PREEMPT_FULL, as it currently will
preempt immediately. My above comment is that we can have an option for
PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
it would at least hold off until there's no mutex held. Who cares if it's a
fast path when a task needs to give up the CPU for another task? What I
worry about is scheduling out while holding a mutex which increases the
chance of that mutex being contended upon. Which does have drastic impact
on performance.

> 
> > > Another is the aforementioned situations where removing the cond_resched()
> > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > thing, and the people I chatted with are all for that, but it is only
> > > natural that there would be a corresponding level of concern about the
> > > cases where removing the cond_resched() calls increases latency.  
> > 
> > With the "capped preemption" I'm not sure that would still be the case.
> > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > the system had to already be in a situation that a schedule needs to
> > happen. There's lots of places in the kernel that run for over a tick
> > without any cond_resched(). The cond_resched() is usually added for
> > locations that show tremendous latency (where either a watchdog triggered,
> > or showed up in some analysis that had a latency that was much greater than
> > a tick).  
> 
> For non-real-time workloads, the average case is important, not just the
> worst case.  In the new lazily preemptible mode of thought, a preemption
> by a non-real-time task will wait a tick.  Earlier, it would have waited
> for the next cond_resched().  Which, in the average case, might have
> arrived much sooner than one tick.

Or much later. It's random. And what's nice about this model, we can add
more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
say "this task needs to preempt immediately" and not just for RT tasks.

This allows the user to decide which task preempts more and which does not
(defined by the scheduler), instead of some random cond_resched() that can
also preempt a higher priority task that just finished its quota to run a
low priority task causing latency for the higher priority task.

This is what I mean by "think differently".

> 
> > The point is, if/when we switch to the new preemption model, we would need
> > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > done to prevent regressions. But the reasons I see cond_resched() being
> > added today, should no longer exist with this new model.  
> 
> This I agree with.  Also, with the new paradigm and new mode of thought
> in place, it should be safe to drop any cond_resched() that is in a loop
> that consumes more than a tick of CPU time per iteration.

Why does that matter? Is the loop not important? Why stop it from finishing
for some random task that may not be important, and cond_resched() has no
idea if it is or not.

> 
> > > There might be others as well.  These are the possibilities that have
> > > come up thus far.
> > >   
> > > > They all suck and keeping some of them is just counterproductive as
> > > > again people will sprinkle them all over the place for the very wrong
> > > > reasons.    
> > > 
> > > Yes, but do they suck enough and are they counterproductive enough to
> > > be useful and necessary?  ;-)  
> > 
> > They are only useful and necessary because of the way we handle preemption
> > today. With the new preemption model, they are all likely to be useless and
> > unnecessary ;-)  
> 
> The "all likely" needs some demonstration.  I agree that a great many
> of them would be useless and unnecessary.  Maybe even the vast majority.
> But that is different than "all".  ;-)

I'm betting it is "all" ;-) But I also agree that this "needs some
demonstration". We are not there yet, and likely will not be until the
second half of next year. So we have plenty of time to speak rhetorically
to each other!



> > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > As I mentioned above, cond_resched() is usually added when a problem was
> > seen. I really believe that those problems would never had been seen if
> > the new paradigm had already been in place.  
> 
> Indeed, that sort of wording does quite the opposite of raising my
> confidence levels.  ;-)

Yes, I admit the "manager speak" isn't something to brag about here. But I
really do like that word. It's just fun to say (and spell)! Paradigm,
paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
be like gnu, and pronounce it when speaking about free software? Although,
that makes the word sound worse. :-p

> 
> You know, the ancient Romans would have had no problem dealing with the
> dot-com boom, cryptocurrency, some of the shadier areas of artificial
> intelligence and machine learning, and who knows what all else.  As the
> Romans used to say, "Beware of geeks bearing grifts."
> 
> > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > >      absolutely no acceptable solution to solve that unless you think
> > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > >      is anywhere near acceptable.    
> > > 
> > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > argument, you need to find someone else to argue with.  ;-)  
> > 
> > Anyway, there's still a long path before cond_resched() can be removed. It
> > was a mistake by Ankur to add those removals this early (and he has
> > acknowledged that mistake).  
> 
> OK, that I can live with.  But that seems to be a bit different of a
> take than that of some earlier emails in this thread.  ;-)

Well, we are also stating the final goal as well. I think there's some
confusion to what's going to happen immediately and what's going to happen
in the long run.

> 
> > First we need to get the new preemption modeled implemented. When it is, it
> > can be just a config option at first. Then when that config option is set,
> > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > between them at run time as they are just a way to tell the scheduler when
> > to set NEED_RESCHED_LAZY vs NEED_RSECHED.  
> 
> Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> the runtime switching needs to be limited to NONE and VOLUNTARY.
> Which is fine.

But why? Because the run time switches of NONE and VOLUNTARY are no
different than FULL.

Why I say that? Because:

For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
or NEED_RESCHED will be set (of course that one tick may be configurable).
Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.

Even if the user sets the mode to "NONE", after the above scenario (one tick
after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
than PREEMPT_FULL.

So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.

> 
> > At that moment, when that config is set, the cond_resched() can turn into a
> > nop. This will allow for testing to make sure there are no regressions in
> > latency, even with the NONE mode enabled.  
> 
> And once it appears to be reasonably stable (in concept as well as
> implementation), heavy testing should get underway.

Agreed.

> 
> > The real test is implementing the code and seeing how it affects things in
> > the real world. Us arguing about it isn't going to get anywhere.  
> 
> Indeed, the opinion of the objective universe always wins.  It all too
> often takes longer than necessary for the people arguing with each other
> to realize this, but such is life.
> 
> >                                                                  I just
> > don't want blind NACK. A NACK to a removal of a cond_resched() needs to
> > show that there was a real regression with that removal.  
> 
> Fair enough, although a single commit bulk removing a large number of
> cond_resched() calls will likely get a bulk NAK.

We'll see. I now have a goal to hit!

-- Steve


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-05 20:18                       ` Ankur Arora
@ 2023-12-06  4:07                         ` Paul E. McKenney
  2023-12-07  1:33                           ` Ankur Arora
  0 siblings, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-06  4:07 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Steven Rostedt, Thomas Gleixner, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue, Dec 05, 2023 at 12:18:26PM -0800, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > On Tue, Dec 05, 2023 at 10:01:14AM -0500, Steven Rostedt wrote:
> >> On Mon, 4 Dec 2023 17:01:21 -0800
> >> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >>
> >> > On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
> >> > > Paul!
> >> > >
> >> > > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
> >> > > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
> ...
> >> > >   3) Looking at the initial problem Ankur was trying to solve there is
> >> > >      absolutely no acceptable solution to solve that unless you think
> >> > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> >> > >      is anywhere near acceptable.
> >> >
> >> > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> >> > argument, you need to find someone else to argue with.  ;-)
> >>
> >> Anyway, there's still a long path before cond_resched() can be removed. It
> >> was a mistake by Ankur to add those removals this early (and he has
> >> acknowledged that mistake).
> >
> > OK, that I can live with.  But that seems to be a bit different of a
> > take than that of some earlier emails in this thread.  ;-)
> 
> Heh I think it's just that this thread goes to (far) too many places :).
> 
> As Steven says, the initial series touching everything all together
> was a mistake. V1 adds the new preemption model alongside the existing
> ones locally defines cond_resched() as nop.
> 
> That'll allow us to experiment and figure out where there are latency
> gaps.

Sounds very good!

Again, I am very supportive of the overall direction.  Devils and details
and all that.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* RE: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-05 20:45                       ` Steven Rostedt
@ 2023-12-06 10:08                         ` David Laight
  2023-12-07  4:34                         ` Paul E. McKenney
  1 sibling, 0 replies; 250+ messages in thread
From: David Laight @ 2023-12-06 10:08 UTC (permalink / raw)
  To: 'Steven Rostedt', Paul E. McKenney
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, richard, mjguzik,
	Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

...
> > > One thing I would like to look into with the new work is to have holding a
> > > mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
> > > converted to mutex in the RT kernel). That way you are less likely to be
> > > preempted while holding a mutex.
> >
> > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > in their fastpaths might have workloads that have a contrary opinion.
> 
> I don't understand your above statement. Maybe I wasn't clear with my
> statement? The above is more about PREEMPT_FULL, as it currently will
> preempt immediately. My above comment is that we can have an option for
> PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> it would at least hold off until there's no mutex held. Who cares if it's a
> fast path when a task needs to give up the CPU for another task?
>
> What I
> worry about is scheduling out while holding a mutex which increases the
> chance of that mutex being contended upon. Which does have drastic impact
> on performance.

Indeed.
You really don't want to preempt with a mutex held if it is about to be
released. Unfortunately this really required a CBU (Crytal Ball Unit).

But I don't think the scheduler timer ticks can be anywhere near frequent
enough to do the 'slightly delayed preemption' that seems to being talked
about - not without a massive overhead.

I can think of two typical uses of a mutex.
One is a short code path that doesn't usually sleep, but calls kmalloc()
so might do so. That could be quite 'hot' and you wouldn't really want
extra 'random' preemption.
The other is long paths that are going to wait for IO, any concurrent
call is expected to sleep. Extra preemption here probably won't matter.

So maybe the software should give the scheduler a hint.
Perhaps as a property of the mutex or the acquire request?
Or feeding the time the mutex has been held into the mix.

It is all a bit like the difference between using spinlock() and
spinlock_irqsave() for a lock that will never be held by an ISR.
If the lock is ever likely to be contended you (IMHO) really want
to use the 'irqsave' version in order to stop the waiting thread
spinning for the duration of the ISR (and following softint).
There is (probably) little point worrying about IRQ latency, a
single readl() to our fpga based PCIe slave will spin a 3GHz cpu
for around 3000 clocks - that it is a lot of normal code.
I've had terrible problems avoiding the extra pthread_mutex() hold
times caused by ISR and softint, they must have the same effect on
spin_lock().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-12-05  1:33       ` Paul E. McKenney
@ 2023-12-06 15:10         ` Thomas Gleixner
  2023-12-07  4:17           ` Paul E. McKenney
  2023-12-07  1:31         ` Ankur Arora
  1 sibling, 1 reply; 250+ messages in thread
From: Thomas Gleixner @ 2023-12-06 15:10 UTC (permalink / raw)
  To: paulmck
  Cc: Ankur Arora, linux-kernel, peterz, torvalds, linux-mm, x86, akpm,
	luto, bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot,
	willy, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, rostedt, David.Laight, richard, mjguzik

Paul!

On Mon, Dec 04 2023 at 17:33, Paul E. McKenney wrote:
> On Tue, Nov 28, 2023 at 06:04:33PM +0100, Thomas Gleixner wrote:
>> So:
>> 
>>     loop()
>> 
>>       preempt_disable();
>> 
>>       --> tick interrupt
>>             rcu_flavor_sched_clock_irq()
>>                 sets NEED_RESCHED
>> 
>>       preempt_enable()
>>         preempt_schedule()
>>           schedule()
>>             report_QS()
>> 
>> See? No magic nonsense in preempt_enable(), no cond_resched(), nothing.
>
> Understood, but that does delay detection of that quiescent state by up
> to one tick.

Sure, but does that really matter in practice?

>> So if that turns out to matter in reality and not just by academic
>> inspection, then we are far better off to annotate such code with:
>> 
>>     do {
>>         preempt_lazy_disable();
>>         mutex_lock();
>>         do_stuff();
>>         mutex_unlock();
>>         preempt_lazy_enable();
>>     }
>> 
>> and let preempt_lazy_enable() evaluate the NEED_RESCHED_LAZY bit.
>
> I am not exactly sure what semantics you are proposing with this pairing
> as opposed to "this would be a good time to preempt in response to the
> pending lazy request".  But I do agree that something like this could
> replace at least a few more instance of cond_resched(), so that is good.
> Not necessarily all of them, though.

The main semantic difference is that such a mechanism is properly
nesting and can be eventually subsumed into the actual locking
constructs.

>> Just insisting that RCU_PREEMPT=n requires cond_resched() and whatsoever
>> is not really getting us anywhere.
>
> Except that this is not what is happening, Thomas.  ;-)
>
> You are asserting that all of the cond_resched() calls can safely be
> eliminated.  That might well be, but more than assertion is required.
> You have come up with some good ways of getting rid of some classes of
> them, which is a very good and very welcome thing.  But that is not the
> same as having proved that all of them may be safely removed.

Neither have you proven that any of them will be required with the new
PREEMPT_LAZY model. :)

Your experience and knowledge in this area is certainly appreciated, but
under the changed semantics of LAZY it's debatable whether observations
and assumptions which are based on PREEMPT_NONE behaviour still apply.

We'll see.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-12-05  1:33       ` Paul E. McKenney
  2023-12-06 15:10         ` Thomas Gleixner
@ 2023-12-07  1:31         ` Ankur Arora
  2023-12-07  2:10           ` Steven Rostedt
  2023-12-07 14:22           ` Thomas Gleixner
  1 sibling, 2 replies; 250+ messages in thread
From: Ankur Arora @ 2023-12-07  1:31 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, rostedt, David.Laight,
	richard, mjguzik


Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Nov 28, 2023 at 06:04:33PM +0100, Thomas Gleixner wrote:
>> Paul!
>>
>> On Mon, Nov 20 2023 at 16:38, Paul E. McKenney wrote:
>> > But...
>> >
>> > Suppose we have a long-running loop in the kernel that regularly
>> > enables preemption, but only momentarily.  Then the added
>> > rcu_flavor_sched_clock_irq() check would almost always fail, making
>> > for extremely long grace periods.  Or did I miss a change that causes
>> > preempt_enable() to help RCU out?
>>
>> So first of all this is not any different from today and even with
>> RCU_PREEMPT=y a tight loop:
>>
>>     do {
>>     	preempt_disable();
>>         do_stuff();
>>         preempt_enable();
>>     }
>>
>> will not allow rcu_flavor_sched_clock_irq() to detect QS reliably. All
>> it can do is to force reschedule/preemption after some time, which in
>> turn ends up in a QS.
>
> True, but we don't run RCU_PREEMPT=y on the fleet.  So although this
> argument should offer comfort to those who would like to switch from
> forced preemption to lazy preemption, it doesn't help for those of us
> running NONE/VOLUNTARY.
>
> I can of course compensate if need be by making RCU more aggressive with
> the resched_cpu() hammer, which includes an IPI.  For non-nohz_full CPUs,
> it currently waits halfway to the stall-warning timeout.
>
>> The current NONE/VOLUNTARY models, which imply RCU_PRREMPT=n cannot do
>> that at all because the preempt_enable() is a NOOP and there is no
>> preemption point at return from interrupt to kernel.
>>
>>     do {
>>         do_stuff();
>>     }
>>
>> So the only thing which makes that "work" is slapping a cond_resched()
>> into the loop:
>>
>>     do {
>>         do_stuff();
>>         cond_resched();
>>     }
>
> Yes, exactly.
>
>> But the whole concept behind LAZY is that the loop will always be:
>>
>>     do {
>>     	preempt_disable();
>>         do_stuff();
>>         preempt_enable();
>>     }
>>
>> and the preempt_enable() will always be a functional preemption point.
>
> Understood.  And if preempt_enable() can interact with RCU when requested,
> I would expect that this could make quite a few calls to cond_resched()
> provably unnecessary.  There was some discussion of this:
>
> https://lore.kernel.org/all/0d6a8e80-c89b-4ded-8de1-8c946874f787@paulmck-laptop/
>
> There were objections to an earlier version.  Is this version OK?

Copying that version here for discussion purposes:

        #define preempt_enable() \
        do { \
                barrier(); \
                if (unlikely(preempt_count_dec_and_test())) \
                        __preempt_schedule(); \
                else if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && \
                        (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) && \
                        !irqs_disabled()) \
        ) \
                                rcu_all_qs(); \
        } while (0)

(sched_feat is not exposed outside the scheduler so I'm using the
!CONFIG_PREEMPT_RCU version here.)


I have two-fold objections to this: as PeterZ pointed out, this is
quite a bit heavier than the fairly minimal preempt_enable() -- both
conceptually where the preemption logic now needs to know about when
to check for a specific RCU quiescience state, and in terms of code
size (seems to add about a cacheline worth) to every preempt_enable()
site.

If we end up needing this, is it valid to just optimistically check if
a quiescent state needs to be registered (see below)?
Though this version exposes rcu_data.rcu_urgent_qs outside RCU but maybe
we can encapsulate that in linux/rcupdate.h.

For V1 will go with this simple check in rcu_flavor_sched_clock_irq()
and see where that gets us:

>         if (this_cpu_read(rcu_data.rcu_urgent_qs))
>         	set_need_resched();

---
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 9aa6358a1a16..d8139cda8814 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -226,9 +226,11 @@ do { \
 #ifdef CONFIG_PREEMPTION
 #define preempt_enable() \
 do { \
 	barrier(); \
 	if (unlikely(preempt_count_dec_and_test())) \
 		__preempt_schedule(); \
+	else if (unlikely(raw_cpu_read(rcu_data.rcu_urgent_qs))) \
+		rcu_all_qs_check();
 } while (0)

 #define preempt_enable_notrace() \
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 41021080ad25..2ba2743d7ba3 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -887,6 +887,17 @@ void rcu_all_qs(void)
 }
 EXPORT_SYMBOL_GPL(rcu_all_qs);

+void rcu_all_qs_check(void)
+{
+	if (((preempt_count() &
+	      (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK)) == PREEMPT_OFFSET) && \
+	      !irqs_disabled())
+
+		  rcu_all_qs();
+}
+EXPORT_SYMBOL_GP(rcu_all_qs);
+
+
 /*
  * Note a PREEMPTION=n context switch. The caller must have disabled interrupts.
  */


--
ankur

^ permalink raw reply related	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-06  4:07                         ` Paul E. McKenney
@ 2023-12-07  1:33                           ` Ankur Arora
  0 siblings, 0 replies; 250+ messages in thread
From: Ankur Arora @ 2023-12-07  1:33 UTC (permalink / raw)
  To: paulmck
  Cc: Ankur Arora, Steven Rostedt, Thomas Gleixner, linux-kernel,
	peterz, torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen,
	hpa, mingo, juri.lelli, vincent.guittot, willy, mgorman,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	jgross, andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann


Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Dec 05, 2023 at 12:18:26PM -0800, Ankur Arora wrote:
>>
>> Paul E. McKenney <paulmck@kernel.org> writes:
>>
>> > On Tue, Dec 05, 2023 at 10:01:14AM -0500, Steven Rostedt wrote:
>> >> On Mon, 4 Dec 2023 17:01:21 -0800
>> >> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>> >>
>> >> > On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
>> >> > > Paul!
>> >> > >
>> >> > > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
>> >> > > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
>> ...
>> >> > >   3) Looking at the initial problem Ankur was trying to solve there is
>> >> > >      absolutely no acceptable solution to solve that unless you think
>> >> > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
>> >> > >      is anywhere near acceptable.
>> >> >
>> >> > I am not arguing for allow_preempt()/disallow_preempt(), so for that
>> >> > argument, you need to find someone else to argue with.  ;-)
>> >>
>> >> Anyway, there's still a long path before cond_resched() can be removed. It
>> >> was a mistake by Ankur to add those removals this early (and he has
>> >> acknowledged that mistake).
>> >
>> > OK, that I can live with.  But that seems to be a bit different of a
>> > take than that of some earlier emails in this thread.  ;-)
>>
>> Heh I think it's just that this thread goes to (far) too many places :).
>>
>> As Steven says, the initial series touching everything all together
>> was a mistake. V1 adds the new preemption model alongside the existing
>> ones locally defines cond_resched() as nop.
>>
>> That'll allow us to experiment and figure out where there are latency
>> gaps.
>
> Sounds very good!
>
> Again, I am very supportive of the overall direction.  Devils and details
> and all that.  ;-)

Agreed. And thanks!

--
ankur

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-12-07  1:31         ` Ankur Arora
@ 2023-12-07  2:10           ` Steven Rostedt
  2023-12-07  4:37             ` Paul E. McKenney
  2023-12-07 14:22           ` Thomas Gleixner
  1 sibling, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:10 UTC (permalink / raw)
  To: Ankur Arora
  Cc: paulmck, Thomas Gleixner, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Wed, 06 Dec 2023 17:31:30 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> ---
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index 9aa6358a1a16..d8139cda8814 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -226,9 +226,11 @@ do { \
>  #ifdef CONFIG_PREEMPTION
>  #define preempt_enable() \
>  do { \
>  	barrier(); \
>  	if (unlikely(preempt_count_dec_and_test())) \
>  		__preempt_schedule(); \
> +	else if (unlikely(raw_cpu_read(rcu_data.rcu_urgent_qs))) \

Shouldn't this still have the:

	else if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && \

That is, is it needed when PREEMPT_RCU is set?

-- Steve


> +		rcu_all_qs_check();
>  } while (0)
> 
>  #define preempt_enable_notrace() \

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-12-06 15:10         ` Thomas Gleixner
@ 2023-12-07  4:17           ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-07  4:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ankur Arora, linux-kernel, peterz, torvalds, linux-mm, x86, akpm,
	luto, bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot,
	willy, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, rostedt, David.Laight, richard, mjguzik

On Wed, Dec 06, 2023 at 04:10:18PM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Mon, Dec 04 2023 at 17:33, Paul E. McKenney wrote:
> > On Tue, Nov 28, 2023 at 06:04:33PM +0100, Thomas Gleixner wrote:
> >> So:
> >> 
> >>     loop()
> >> 
> >>       preempt_disable();
> >> 
> >>       --> tick interrupt
> >>             rcu_flavor_sched_clock_irq()
> >>                 sets NEED_RESCHED
> >> 
> >>       preempt_enable()
> >>         preempt_schedule()
> >>           schedule()
> >>             report_QS()
> >> 
> >> See? No magic nonsense in preempt_enable(), no cond_resched(), nothing.
> >
> > Understood, but that does delay detection of that quiescent state by up
> > to one tick.
> 
> Sure, but does that really matter in practice?

It might, but yes, I would expect it to matter far less than the other
things I have been calling out.

> >> So if that turns out to matter in reality and not just by academic
> >> inspection, then we are far better off to annotate such code with:
> >> 
> >>     do {
> >>         preempt_lazy_disable();
> >>         mutex_lock();
> >>         do_stuff();
> >>         mutex_unlock();
> >>         preempt_lazy_enable();
> >>     }
> >> 
> >> and let preempt_lazy_enable() evaluate the NEED_RESCHED_LAZY bit.
> >
> > I am not exactly sure what semantics you are proposing with this pairing
> > as opposed to "this would be a good time to preempt in response to the
> > pending lazy request".  But I do agree that something like this could
> > replace at least a few more instance of cond_resched(), so that is good.
> > Not necessarily all of them, though.
> 
> The main semantic difference is that such a mechanism is properly
> nesting and can be eventually subsumed into the actual locking
> constructs.

OK, fair enough.

And noting that testing should include workloads that exercise things
like mutex_lock() and mutex_trylock() fastpaths.

> >> Just insisting that RCU_PREEMPT=n requires cond_resched() and whatsoever
> >> is not really getting us anywhere.
> >
> > Except that this is not what is happening, Thomas.  ;-)
> >
> > You are asserting that all of the cond_resched() calls can safely be
> > eliminated.  That might well be, but more than assertion is required.
> > You have come up with some good ways of getting rid of some classes of
> > them, which is a very good and very welcome thing.  But that is not the
> > same as having proved that all of them may be safely removed.
> 
> Neither have you proven that any of them will be required with the new
> PREEMPT_LAZY model. :)

True.  But nor have you proven them unnecessary.  That will need to
wait for larger-scale testing.

> Your experience and knowledge in this area is certainly appreciated, but
> under the changed semantics of LAZY it's debatable whether observations
> and assumptions which are based on PREEMPT_NONE behaviour still apply.
> 
> We'll see.

That we will!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-05 20:45                       ` Steven Rostedt
  2023-12-06 10:08                         ` David Laight
@ 2023-12-07  4:34                         ` Paul E. McKenney
  2023-12-07 13:44                           ` Steven Rostedt
  1 sibling, 1 reply; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-07  4:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Tue, Dec 05, 2023 at 03:45:18PM -0500, Steven Rostedt wrote:
> On Tue, 5 Dec 2023 11:38:42 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > 
> > > Note that the new preemption model is a new paradigm and we need to start
> > > thinking a bit differently if we go to it.  
> > 
> > We can of course think differently, but existing hardware and software
> > will probably be a bit more stubborn.
> 
> Not at all. I don't see how hardware plays a role here, but how software is
> designed does sometimes require thinking differently.

The hardware runs the software and so gets its say.  And I of course do
agree that changes in software sometimes require thinking differently,
but I can also personally attest to how much work it is and how long it
takes to induce changes in thinking.  ;-)

> > > One thing I would like to look into with the new work is to have holding a
> > > mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
> > > converted to mutex in the RT kernel). That way you are less likely to be
> > > preempted while holding a mutex.  
> > 
> > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > in their fastpaths might have workloads that have a contrary opinion.
> 
> I don't understand your above statement. Maybe I wasn't clear with my
> statement? The above is more about PREEMPT_FULL, as it currently will
> preempt immediately. My above comment is that we can have an option for
> PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> it would at least hold off until there's no mutex held. Who cares if it's a
> fast path when a task needs to give up the CPU for another task? What I
> worry about is scheduling out while holding a mutex which increases the
> chance of that mutex being contended upon. Which does have drastic impact
> on performance.

As I understand the current mutex_lock() code, the fastpaths leave no
scheduler-visible clue that a mutex is in fact held.  If there is no
such clue, it is quite likely that those fastpaths will need to do some
additional clue-leaving work, increasing their overhead.  And while it
is always possible that this overhead will be down in the noise, if it
was too far down in the noise there would be no need for those fastpaths.

So it is possible (but by no means certain) that some workloads will end
up caring.

> > > > Another is the aforementioned situations where removing the cond_resched()
> > > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > > thing, and the people I chatted with are all for that, but it is only
> > > > natural that there would be a corresponding level of concern about the
> > > > cases where removing the cond_resched() calls increases latency.  
> > > 
> > > With the "capped preemption" I'm not sure that would still be the case.
> > > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > > the system had to already be in a situation that a schedule needs to
> > > happen. There's lots of places in the kernel that run for over a tick
> > > without any cond_resched(). The cond_resched() is usually added for
> > > locations that show tremendous latency (where either a watchdog triggered,
> > > or showed up in some analysis that had a latency that was much greater than
> > > a tick).  
> > 
> > For non-real-time workloads, the average case is important, not just the
> > worst case.  In the new lazily preemptible mode of thought, a preemption
> > by a non-real-time task will wait a tick.  Earlier, it would have waited
> > for the next cond_resched().  Which, in the average case, might have
> > arrived much sooner than one tick.
> 
> Or much later. It's random. And what's nice about this model, we can add
> more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
> say "this task needs to preempt immediately" and not just for RT tasks.
> 
> This allows the user to decide which task preempts more and which does not
> (defined by the scheduler), instead of some random cond_resched() that can
> also preempt a higher priority task that just finished its quota to run a
> low priority task causing latency for the higher priority task.
> 
> This is what I mean by "think differently".

I did understand your meaning, and it is a source of some concern.  ;-)

When things become sufficiently stable, larger-scale tests will of course
be needed, not just different thought..

> > > The point is, if/when we switch to the new preemption model, we would need
> > > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > > done to prevent regressions. But the reasons I see cond_resched() being
> > > added today, should no longer exist with this new model.  
> > 
> > This I agree with.  Also, with the new paradigm and new mode of thought
> > in place, it should be safe to drop any cond_resched() that is in a loop
> > that consumes more than a tick of CPU time per iteration.
> 
> Why does that matter? Is the loop not important? Why stop it from finishing
> for some random task that may not be important, and cond_resched() has no
> idea if it is or not.

Because if it takes more than a tick to reach the next cond_resched(),
lazy preemption is likely to preempt before that cond_resched() is
reached.  Which suggests that such a cond_resched() would not be all
that valuable in the new thought paradigm.  Give or take potential issues
with exactly where the preemption happens.

> > > > There might be others as well.  These are the possibilities that have
> > > > come up thus far.
> > > >   
> > > > > They all suck and keeping some of them is just counterproductive as
> > > > > again people will sprinkle them all over the place for the very wrong
> > > > > reasons.    
> > > > 
> > > > Yes, but do they suck enough and are they counterproductive enough to
> > > > be useful and necessary?  ;-)  
> > > 
> > > They are only useful and necessary because of the way we handle preemption
> > > today. With the new preemption model, they are all likely to be useless and
> > > unnecessary ;-)  
> > 
> > The "all likely" needs some demonstration.  I agree that a great many
> > of them would be useless and unnecessary.  Maybe even the vast majority.
> > But that is different than "all".  ;-)
> 
> I'm betting it is "all" ;-) But I also agree that this "needs some
> demonstration". We are not there yet, and likely will not be until the
> second half of next year. So we have plenty of time to speak rhetorically
> to each other!

You know, we usually find time to engage in rhetorical conversation.  ;-)

> > > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > > As I mentioned above, cond_resched() is usually added when a problem was
> > > seen. I really believe that those problems would never had been seen if
> > > the new paradigm had already been in place.  
> > 
> > Indeed, that sort of wording does quite the opposite of raising my
> > confidence levels.  ;-)
> 
> Yes, I admit the "manager speak" isn't something to brag about here. But I
> really do like that word. It's just fun to say (and spell)! Paradigm,
> paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
> be like gnu, and pronounce it when speaking about free software? Although,
> that makes the word sound worse. :-p

Pair a' dime, pair a' quarter, pair a' fifty-cent pieces, whatever it takes!

> > You know, the ancient Romans would have had no problem dealing with the
> > dot-com boom, cryptocurrency, some of the shadier areas of artificial
> > intelligence and machine learning, and who knows what all else.  As the
> > Romans used to say, "Beware of geeks bearing grifts."
> > 
> > > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > > >      absolutely no acceptable solution to solve that unless you think
> > > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > > >      is anywhere near acceptable.    
> > > > 
> > > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > > argument, you need to find someone else to argue with.  ;-)  
> > > 
> > > Anyway, there's still a long path before cond_resched() can be removed. It
> > > was a mistake by Ankur to add those removals this early (and he has
> > > acknowledged that mistake).  
> > 
> > OK, that I can live with.  But that seems to be a bit different of a
> > take than that of some earlier emails in this thread.  ;-)
> 
> Well, we are also stating the final goal as well. I think there's some
> confusion to what's going to happen immediately and what's going to happen
> in the long run.

If I didn't know better, I might suspect that in addition to the
confusion, there are a few differences of opinion.  ;-)

> > > First we need to get the new preemption modeled implemented. When it is, it
> > > can be just a config option at first. Then when that config option is set,
> > > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > > between them at run time as they are just a way to tell the scheduler when
> > > to set NEED_RESCHED_LAZY vs NEED_RSECHED.  
> > 
> > Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> > the runtime switching needs to be limited to NONE and VOLUNTARY.
> > Which is fine.
> 
> But why? Because the run time switches of NONE and VOLUNTARY are no
> different than FULL.
> 
> Why I say that? Because:
> 
> For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
> or NEED_RESCHED will be set (of course that one tick may be configurable).
> Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.
> 
> Even if the user sets the mode to "NONE", after the above scenario (one tick
> after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
> than PREEMPT_FULL.
> 
> So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
> NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
> and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.

Because PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
combination.  All of the gains from PREEMPT_FULL=y are more than lost
due to PREEMPT_RCU=n, especially when the kernel decides to do something
like walk a long task list under RCU protection.  We should not waste
people's time getting burned by this combination, nor should we waste
cycles testing it.

> > > At that moment, when that config is set, the cond_resched() can turn into a
> > > nop. This will allow for testing to make sure there are no regressions in
> > > latency, even with the NONE mode enabled.  
> > 
> > And once it appears to be reasonably stable (in concept as well as
> > implementation), heavy testing should get underway.
> 
> Agreed.
> 
> > 
> > > The real test is implementing the code and seeing how it affects things in
> > > the real world. Us arguing about it isn't going to get anywhere.  
> > 
> > Indeed, the opinion of the objective universe always wins.  It all too
> > often takes longer than necessary for the people arguing with each other
> > to realize this, but such is life.
> > 
> > >                                                                  I just
> > > don't want blind NACK. A NACK to a removal of a cond_resched() needs to
> > > show that there was a real regression with that removal.  
> > 
> > Fair enough, although a single commit bulk removing a large number of
> > cond_resched() calls will likely get a bulk NAK.
> 
> We'll see. I now have a goal to hit!

;-) ;-) ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-12-07  2:10           ` Steven Rostedt
@ 2023-12-07  4:37             ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-07  4:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ankur Arora, Thomas Gleixner, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik

On Wed, Dec 06, 2023 at 09:10:22PM -0500, Steven Rostedt wrote:
> On Wed, 06 Dec 2023 17:31:30 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> > ---
> > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > index 9aa6358a1a16..d8139cda8814 100644
> > --- a/include/linux/preempt.h
> > +++ b/include/linux/preempt.h
> > @@ -226,9 +226,11 @@ do { \
> >  #ifdef CONFIG_PREEMPTION
> >  #define preempt_enable() \
> >  do { \
> >  	barrier(); \
> >  	if (unlikely(preempt_count_dec_and_test())) \
> >  		__preempt_schedule(); \
> > +	else if (unlikely(raw_cpu_read(rcu_data.rcu_urgent_qs))) \
> 
> Shouldn't this still have the:
> 
> 	else if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && \
> 
> That is, is it needed when PREEMPT_RCU is set?

Given that PREEMPT_RCU has been getting along fine without it, I agree
with Steve on this one.  Unless and until someone demonstrates otherwise,
but such a demonstration would almost certainly affect current code,
not just the lazy-preemption changes.

							Thanx, Paul

> -- Steve
> 
> 
> > +		rcu_all_qs_check();
> >  } while (0)
> > 
> >  #define preempt_enable_notrace() \

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-07  4:34                         ` Paul E. McKenney
@ 2023-12-07 13:44                           ` Steven Rostedt
  2023-12-08  4:28                             ` Paul E. McKenney
  0 siblings, 1 reply; 250+ messages in thread
From: Steven Rostedt @ 2023-12-07 13:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Wed, 6 Dec 2023 20:34:11 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > > in their fastpaths might have workloads that have a contrary opinion.  
> > 
> > I don't understand your above statement. Maybe I wasn't clear with my
> > statement? The above is more about PREEMPT_FULL, as it currently will
> > preempt immediately. My above comment is that we can have an option for
> > PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> > it would at least hold off until there's no mutex held. Who cares if it's a
> > fast path when a task needs to give up the CPU for another task? What I
> > worry about is scheduling out while holding a mutex which increases the
> > chance of that mutex being contended upon. Which does have drastic impact
> > on performance.  
> 
> As I understand the current mutex_lock() code, the fastpaths leave no
> scheduler-visible clue that a mutex is in fact held.  If there is no
> such clue, it is quite likely that those fastpaths will need to do some
> additional clue-leaving work, increasing their overhead.  And while it
> is always possible that this overhead will be down in the noise, if it
> was too far down in the noise there would be no need for those fastpaths.
> 
> So it is possible (but by no means certain) that some workloads will end
> up caring.

OK, that makes more sense, and I do agree with that statement. It would
need to do something like spin locks do with preempt disable, but I agree,
this would need to be done in a way not to cause performance regressions.


> 
> > > > > Another is the aforementioned situations where removing the cond_resched()
> > > > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > > > thing, and the people I chatted with are all for that, but it is only
> > > > > natural that there would be a corresponding level of concern about the
> > > > > cases where removing the cond_resched() calls increases latency.    
> > > > 
> > > > With the "capped preemption" I'm not sure that would still be the case.
> > > > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > > > the system had to already be in a situation that a schedule needs to
> > > > happen. There's lots of places in the kernel that run for over a tick
> > > > without any cond_resched(). The cond_resched() is usually added for
> > > > locations that show tremendous latency (where either a watchdog triggered,
> > > > or showed up in some analysis that had a latency that was much greater than
> > > > a tick).    
> > > 
> > > For non-real-time workloads, the average case is important, not just the
> > > worst case.  In the new lazily preemptible mode of thought, a preemption
> > > by a non-real-time task will wait a tick.  Earlier, it would have waited
> > > for the next cond_resched().  Which, in the average case, might have
> > > arrived much sooner than one tick.  
> > 
> > Or much later. It's random. And what's nice about this model, we can add
> > more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
> > say "this task needs to preempt immediately" and not just for RT tasks.
> > 
> > This allows the user to decide which task preempts more and which does not
> > (defined by the scheduler), instead of some random cond_resched() that can
> > also preempt a higher priority task that just finished its quota to run a
> > low priority task causing latency for the higher priority task.
> > 
> > This is what I mean by "think differently".  
> 
> I did understand your meaning, and it is a source of some concern.  ;-)
> 
> When things become sufficiently stable, larger-scale tests will of course
> be needed, not just different thought..

Fair enough.

> 
> > > > The point is, if/when we switch to the new preemption model, we would need
> > > > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > > > done to prevent regressions. But the reasons I see cond_resched() being
> > > > added today, should no longer exist with this new model.    
> > > 
> > > This I agree with.  Also, with the new paradigm and new mode of thought
> > > in place, it should be safe to drop any cond_resched() that is in a loop
> > > that consumes more than a tick of CPU time per iteration.  
> > 
> > Why does that matter? Is the loop not important? Why stop it from finishing
> > for some random task that may not be important, and cond_resched() has no
> > idea if it is or not.  
> 
> Because if it takes more than a tick to reach the next cond_resched(),
> lazy preemption is likely to preempt before that cond_resched() is
> reached.  Which suggests that such a cond_resched() would not be all
> that valuable in the new thought paradigm.  Give or take potential issues
> with exactly where the preemption happens.

I'm just saying there's lots of places that the above happens, which is why
we are still scattering cond_resched() all over the place.

> 
> > > > > There might be others as well.  These are the possibilities that have
> > > > > come up thus far.
> > > > >     
> > > > > > They all suck and keeping some of them is just counterproductive as
> > > > > > again people will sprinkle them all over the place for the very wrong
> > > > > > reasons.      
> > > > > 
> > > > > Yes, but do they suck enough and are they counterproductive enough to
> > > > > be useful and necessary?  ;-)    
> > > > 
> > > > They are only useful and necessary because of the way we handle preemption
> > > > today. With the new preemption model, they are all likely to be useless and
> > > > unnecessary ;-)    
> > > 
> > > The "all likely" needs some demonstration.  I agree that a great many
> > > of them would be useless and unnecessary.  Maybe even the vast majority.
> > > But that is different than "all".  ;-)  
> > 
> > I'm betting it is "all" ;-) But I also agree that this "needs some
> > demonstration". We are not there yet, and likely will not be until the
> > second half of next year. So we have plenty of time to speak rhetorically
> > to each other!  
> 
> You know, we usually find time to engage in rhetorical conversation.  ;-)
> 
> > > > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > > > As I mentioned above, cond_resched() is usually added when a problem was
> > > > seen. I really believe that those problems would never had been seen if
> > > > the new paradigm had already been in place.    
> > > 
> > > Indeed, that sort of wording does quite the opposite of raising my
> > > confidence levels.  ;-)  
> > 
> > Yes, I admit the "manager speak" isn't something to brag about here. But I
> > really do like that word. It's just fun to say (and spell)! Paradigm,
> > paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
> > be like gnu, and pronounce it when speaking about free software? Although,
> > that makes the word sound worse. :-p  
> 
> Pair a' dime, pair a' quarter, pair a' fifty-cent pieces, whatever it takes!

 Pair a' two-bits : that's all it's worth

Or

 Pair a' two-cents : as it's my two cents that I'm giving.


> 
> > > You know, the ancient Romans would have had no problem dealing with the
> > > dot-com boom, cryptocurrency, some of the shadier areas of artificial
> > > intelligence and machine learning, and who knows what all else.  As the
> > > Romans used to say, "Beware of geeks bearing grifts."
> > >   
> > > > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > > > >      absolutely no acceptable solution to solve that unless you think
> > > > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > > > >      is anywhere near acceptable.      
> > > > > 
> > > > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > > > argument, you need to find someone else to argue with.  ;-)    
> > > > 
> > > > Anyway, there's still a long path before cond_resched() can be removed. It
> > > > was a mistake by Ankur to add those removals this early (and he has
> > > > acknowledged that mistake).    
> > > 
> > > OK, that I can live with.  But that seems to be a bit different of a
> > > take than that of some earlier emails in this thread.  ;-)  
> > 
> > Well, we are also stating the final goal as well. I think there's some
> > confusion to what's going to happen immediately and what's going to happen
> > in the long run.  
> 
> If I didn't know better, I might suspect that in addition to the
> confusion, there are a few differences of opinion.  ;-)

Confusion enhances differences of opinion.

> 
> > > > First we need to get the new preemption modeled implemented. When it is, it
> > > > can be just a config option at first. Then when that config option is set,
> > > > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > > > between them at run time as they are just a way to tell the scheduler when
> > > > to set NEED_RESCHED_LAZY vs NEED_RSECHED.    
> > > 
> > > Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> > > the runtime switching needs to be limited to NONE and VOLUNTARY.
> > > Which is fine.  
> > 
> > But why? Because the run time switches of NONE and VOLUNTARY are no
> > different than FULL.
> > 
> > Why I say that? Because:
> > 
> > For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
> > or NEED_RESCHED will be set (of course that one tick may be configurable).
> > Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.
> > 
> > Even if the user sets the mode to "NONE", after the above scenario (one tick
> > after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
> > than PREEMPT_FULL.
> > 
> > So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
> > NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
> > and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.  
> 
> Because PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
> combination.  All of the gains from PREEMPT_FULL=y are more than lost
> due to PREEMPT_RCU=n, especially when the kernel decides to do something
> like walk a long task list under RCU protection.  We should not waste
> people's time getting burned by this combination, nor should we waste
> cycles testing it.

The issue I see here is that PREEMPT_RCU is not something that we can
convert at run time, where the NONE, VOLUNTARY, FULL (and more to come) can
be. And you have stated that PREEMPT_RCU adds some more overhead that
people may not care about. But even though you say PREEMPT_RCU=n makes no
sense with PREEMPT_FULL, it doesn't mean we should not allow it. Especially
if we have to make sure that it still works (even NONE and VOLUNTARY turn
to FULL after that one-tick).

Remember, what we are looking at is having:

N : NEED_RESCHED - schedule at next possible location
L : NEED_RESCHED_LAZY - schedule when going into user space.

When to set what for a task needing to schedule?

 Model           SCHED_OTHER         RT/DL(or user specified)
 -----           -----------         ------------------------
 NONE                 L                         L
 VOLUNTARY            L                         N
 FULL                 N                         N

By saying FULL, you are saying that you want the SCHED_OTHER as well as
RT/DL tasks to schedule as soon as possible and not wait to going into user
space. This is still applicable even with PREEMPT_RCU=n

It may be that someone wants better latency for all tasks (like VOLUNTARY)
but not the overhead that PREEMPT_RCU gives, and is OK with the added
latency as a result.

-- Steve

^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n
  2023-12-07  1:31         ` Ankur Arora
  2023-12-07  2:10           ` Steven Rostedt
@ 2023-12-07 14:22           ` Thomas Gleixner
  1 sibling, 0 replies; 250+ messages in thread
From: Thomas Gleixner @ 2023-12-07 14:22 UTC (permalink / raw)
  To: Ankur Arora, paulmck
  Cc: Ankur Arora, linux-kernel, peterz, torvalds, linux-mm, x86, akpm,
	luto, bp, dave.hansen, hpa, mingo, juri.lelli, vincent.guittot,
	willy, mgorman, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, mingo,
	bristot, mathieu.desnoyers, geert, glaubitz, anton.ivanov,
	mattst88, krypton, rostedt, David.Laight, richard, mjguzik

On Wed, Dec 06 2023 at 17:31, Ankur Arora wrote:
> If we end up needing this, is it valid to just optimistically check if
> a quiescent state needs to be registered (see below)?
> Though this version exposes rcu_data.rcu_urgent_qs outside RCU but maybe
> we can encapsulate that in linux/rcupdate.h.

>  #ifdef CONFIG_PREEMPTION
>  #define preempt_enable() \
>  do { \
>  	barrier(); \
>  	if (unlikely(preempt_count_dec_and_test())) \
>  		__preempt_schedule(); \
> +	else if (unlikely(raw_cpu_read(rcu_data.rcu_urgent_qs))) \
> +		rcu_all_qs_check();

It's still bloat and we can debate this once we come to the conclusion
that the simple forced reschedule is not sufficient. Until then debating
this is just an academic exercise.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 250+ messages in thread

* Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT
  2023-12-07 13:44                           ` Steven Rostedt
@ 2023-12-08  4:28                             ` Paul E. McKenney
  0 siblings, 0 replies; 250+ messages in thread
From: Paul E. McKenney @ 2023-12-08  4:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ankur Arora, linux-kernel, peterz, torvalds,
	linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, mingo,
	juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
	andrew.cooper3, mingo, bristot, mathieu.desnoyers, geert,
	glaubitz, anton.ivanov, mattst88, krypton, David.Laight, richard,
	mjguzik, Simon Horman, Julian Anastasov, Alexei Starovoitov,
	Daniel Borkmann

On Thu, Dec 07, 2023 at 08:44:57AM -0500, Steven Rostedt wrote:
> On Wed, 6 Dec 2023 20:34:11 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > > > in their fastpaths might have workloads that have a contrary opinion.  
> > > 
> > > I don't understand your above statement. Maybe I wasn't clear with my
> > > statement? The above is more about PREEMPT_FULL, as it currently will
> > > preempt immediately. My above comment is that we can have an option for
> > > PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> > > it would at least hold off until there's no mutex held. Who cares if it's a
> > > fast path when a task needs to give up the CPU for another task? What I
> > > worry about is scheduling out while holding a mutex which increases the
> > > chance of that mutex being contended upon. Which does have drastic impact
> > > on performance.  
> > 
> > As I understand the current mutex_lock() code, the fastpaths leave no
> > scheduler-visible clue that a mutex is in fact held.  If there is no
> > such clue, it is quite likely that those fastpaths will need to do some
> > additional clue-leaving work, increasing their overhead.  And while it
> > is always possible that this overhead will be down in the noise, if it
> > was too far down in the noise there would be no need for those fastpaths.
> > 
> > So it is possible (but by no means certain) that some workloads will end
> > up caring.
> 
> OK, that makes more sense, and I do agree with that statement. It would
> need to do something like spin locks do with preempt disable, but I agree,
> this would need to be done in a way not to cause performance regressions.

Whew!  ;-)

> > > > > > Another is the aforementioned situations where removing the cond_resched()
> > > > > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > > > > thing, and the people I chatted with are all for that, but it is only
> > > > > > natural that there would be a corresponding level of concern about the
> > > > > > cases where removing the cond_resched() calls increases latency.    
> > > > > 
> > > > > With the "capped preemption" I'm not sure that would still be the case.
> > > > > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > > > > the system had to already be in a situation that a schedule needs to
> > > > > happen. There's lots of places in the kernel that run for over a tick
> > > > > without any cond_resched(). The cond_resched() is usually added for
> > > > > locations that show tremendous latency (where either a watchdog triggered,
> > > > > or showed up in some analysis that had a latency that was much greater than
> > > > > a tick).    
> > > > 
> > > > For non-real-time workloads, the average case is important, not just the
> > > > worst case.  In the new lazily preemptible mode of thought, a preemption
> > > > by a non-real-time task will wait a tick.  Earlier, it would have waited
> > > > for the next cond_resched().  Which, in the average case, might have
> > > > arrived much sooner than one tick.  
> > > 
> > > Or much later. It's random. And what's nice about this model, we can add
> > > more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
> > > say "this task needs to preempt immediately" and not just for RT tasks.
> > > 
> > > This allows the user to decide which task preempts more and which does not
> > > (defined by the scheduler), instead of some random cond_resched() that can
> > > also preempt a higher priority task that just finished its quota to run a
> > > low priority task causing latency for the higher priority task.
> > > 
> > > This is what I mean by "think differently".  
> > 
> > I did understand your meaning, and it is a source of some concern.  ;-)
> > 
> > When things become sufficiently stable, larger-scale tests will of course
> > be needed, not just different thought..
> 
> Fair enough.
> 
> > 
> > > > > The point is, if/when we switch to the new preemption model, we would need
> > > > > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > > > > done to prevent regressions. But the reasons I see cond_resched() being
> > > > > added today, should no longer exist with this new model.    
> > > > 
> > > > This I agree with.  Also, with the new paradigm and new mode of thought
> > > > in place, it should be safe to drop any cond_resched() that is in a loop
> > > > that consumes more than a tick of CPU time per iteration.  
> > > 
> > > Why does that matter? Is the loop not important? Why stop it from finishing
> > > for some random task that may not be important, and cond_resched() has no
> > > idea if it is or not.  
> > 
> > Because if it takes more than a tick to reach the next cond_resched(),
> > lazy preemption is likely to preempt before that cond_resched() is
> > reached.  Which suggests that such a cond_resched() would not be all
> > that valuable in the new thought paradigm.  Give or take potential issues
> > with exactly where the preemption happens.
> 
> I'm just saying there's lots of places that the above happens, which is why
> we are still scattering cond_resched() all over the place.

And I agree that greatly reducing (if not eliminating) such scattering
is a great benefit of lazy preemption.

> > > > > > There might be others as well.  These are the possibilities that have
> > > > > > come up thus far.
> > > > > >     
> > > > > > > They all suck and keeping some of them is just counterproductive as
> > > > > > > again people will sprinkle them all over the place for the very wrong
> > > > > > > reasons.      
> > > > > > 
> > > > > > Yes, but do they suck enough and are they counterproductive enough to
> > > > > > be useful and necessary?  ;-)    
> > > > > 
> > > > > They are only useful and necessary because of the way we handle preemption
> > > > > today. With the new preemption model, they are all likely to be useless and
> > > > > unnecessary ;-)    
> > > > 
> > > > The "all likely" needs some demonstration.  I agree that a great many
> > > > of them would be useless and unnecessary.  Maybe even the vast majority.
> > > > But that is different than "all".  ;-)  
> > > 
> > > I'm betting it is "all" ;-) But I also agree that this "needs some
> > > demonstration". We are not there yet, and likely will not be until the
> > > second half of next year. So we have plenty of time to speak rhetorically
> > > to each other!  
> > 
> > You know, we usually find time to engage in rhetorical conversation.  ;-)
> > 
> > > > > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > > > > As I mentioned above, cond_resched() is usually added when a problem was
> > > > > seen. I really believe that those problems would never had been seen if
> > > > > the new paradigm had already been in place.    
> > > > 
> > > > Indeed, that sort of wording does quite the opposite of raising my
> > > > confidence levels.  ;-)  
> > > 
> > > Yes, I admit the "manager speak" isn't something to brag about here. But I
> > > really do like that word. It's just fun to say (and spell)! Paradigm,
> > > paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
> > > be like gnu, and pronounce it when speaking about free software? Although,
> > > that makes the word sound worse. :-p  
> > 
> > Pair a' dime, pair a' quarter, pair a' fifty-cent pieces, whatever it takes!
> 
>  Pair a' two-bits : that's all it's worth
> 
> Or
> 
>  Pair a' two-cents : as it's my two cents that I'm giving.

I must confess that the occasional transliteration of paradigm to
pair-of-dimes has been a great sanity-preservation device over the
decades.  ;-)

> > > > You know, the ancient Romans would have had no problem dealing with the
> > > > dot-com boom, cryptocurrency, some of the shadier areas of artificial
> > > > intelligence and machine learning, and who knows what all else.  As the
> > > > Romans used to say, "Beware of geeks bearing grifts."
> > > >   
> > > > > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > > > > >      absolutely no acceptable solution to solve that unless you think
> > > > > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > > > > >      is anywhere near acceptable.      
> > > > > > 
> > > > > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > > > > argument, you need to find someone else to argue with.  ;-)    
> > > > > 
> > > > > Anyway, there's still a long path before cond_resched() can be removed. It
> > > > > was a mistake by Ankur to add those removals this early (and he has
> > > > > acknowledged that mistake).    
> > > > 
> > > > OK, that I can live with.  But that seems to be a bit different of a
> > > > take than that of some earlier emails in this thread.  ;-)  
> > > 
> > > Well, we are also stating the final goal as well. I think there's some
> > > confusion to what's going to happen immediately and what's going to happen
> > > in the long run.  
> > 
> > If I didn't know better, I might suspect that in addition to the
> > confusion, there are a few differences of opinion.  ;-)
> 
> Confusion enhances differences of opinion.

That can happen, but then again confusion can also result in the
mere appearance of agreement.  ;-)

> > > > > First we need to get the new preemption modeled implemented. When it is, it
> > > > > can be just a config option at first. Then when that config option is set,
> > > > > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > > > > between them at run time as they are just a way to tell the scheduler when
> > > > > to set NEED_RESCHED_LAZY vs NEED_RSECHED.    
> > > > 
> > > > Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> > > > the runtime switching needs to be limited to NONE and VOLUNTARY.
> > > > Which is fine.  
> > > 
> > > But why? Because the run time switches of NONE and VOLUNTARY are no
> > > different than FULL.
> > > 
> > > Why I say that? Because:
> > > 
> > > For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
> > > or NEED_RESCHED will be set (of course that one tick may be configurable).
> > > Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.
> > > 
> > > Even if the user sets the mode to "NONE", after the above scenario (one tick
> > > after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
> > > than PREEMPT_FULL.
> > > 
> > > So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
> > > NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
> > > and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.  
> > 
> > Because PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
> > combination.  All of the gains from PREEMPT_FULL=y are more than lost
> > due to PREEMPT_RCU=n, especially when the kernel decides to do something
> > like walk a long task list under RCU protection.  We should not waste
> > people's time getting burned by this combination, nor should we waste
> > cycles testing it.
> 
> The issue I see here is that PREEMPT_RCU is not something that we can
> convert at run time, where the NONE, VOLUNTARY, FULL (and more to come) can
> be. And you have stated that PREEMPT_RCU adds some more overhead that
> people may not care about. But even though you say PREEMPT_RCU=n makes no
> sense with PREEMPT_FULL, it doesn't mean we should not allow it. Especially
> if we have to make sure that it still works (even NONE and VOLUNTARY turn
> to FULL after that one-tick).
> 
> Remember, what we are looking at is having:
> 
> N : NEED_RESCHED - schedule at next possible location
> L : NEED_RESCHED_LAZY - schedule when going into user space.
> 
> When to set what for a task needing to schedule?
> 
>  Model           SCHED_OTHER         RT/DL(or user specified)
>  -----           -----------         ------------------------
>  NONE                 L                         L
>  VOLUNTARY            L                         N
>  FULL                 N                         N
> 
> By saying FULL, you are saying that you want the SCHED_OTHER as well as
> RT/DL tasks to schedule as soon as possible and not wait to going into user
> space. This is still applicable even with PREEMPT_RCU=n
> 
> It may be that someone wants better latency for all tasks (like VOLUNTARY)
> but not the overhead that PREEMPT_RCU gives, and is OK with the added
> latency as a result.

Given the additional testing burden and given the likelihood that it won't
do what people want, let's find someone who really needs it (as opposed
to someone who merely wants it) before allowing it to be selected.
It is after all an easy check far from any fastpath to prevent the
combination of PREEMPT_RCU and PREEMPT_FULL.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 250+ messages in thread

end of thread, other threads:[~2023-12-08  4:28 UTC | newest]

Thread overview: 250+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-07 21:56 [RFC PATCH 00/86] Make the kernel preemptible Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 01/86] Revert "riscv: support PREEMPT_DYNAMIC with static keys" Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 02/86] Revert "sched/core: Make sched_dynamic_mutex static" Ankur Arora
2023-11-07 23:04   ` Steven Rostedt
2023-11-07 21:56 ` [RFC PATCH 03/86] Revert "ftrace: Use preemption model accessors for trace header printout" Ankur Arora
2023-11-07 23:10   ` Steven Rostedt
2023-11-07 23:23     ` Ankur Arora
2023-11-07 23:31       ` Steven Rostedt
2023-11-07 23:34         ` Steven Rostedt
2023-11-08  0:12           ` Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 04/86] Revert "preempt/dynamic: Introduce preemption model accessors" Ankur Arora
2023-11-07 23:12   ` Steven Rostedt
2023-11-08  4:59     ` Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 05/86] Revert "kcsan: Use " Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 06/86] Revert "entry: Fix compile error in dynamic_irqentry_exit_cond_resched()" Ankur Arora
2023-11-08  7:47   ` Greg KH
2023-11-08  9:09     ` Ankur Arora
2023-11-08 10:00       ` Greg KH
2023-11-07 21:56 ` [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()" Ankur Arora
2023-11-07 23:16   ` Steven Rostedt
2023-11-08  4:55     ` Ankur Arora
2023-11-09 17:26     ` Josh Poimboeuf
2023-11-09 17:31       ` Steven Rostedt
2023-11-09 17:51         ` Josh Poimboeuf
2023-11-09 22:50           ` Ankur Arora
2023-11-09 23:47             ` Josh Poimboeuf
2023-11-10  0:46               ` Ankur Arora
2023-11-10  0:56           ` Steven Rostedt
2023-11-07 21:56 ` [RFC PATCH 08/86] Revert "arm64: Support PREEMPT_DYNAMIC" Ankur Arora
2023-11-07 23:17   ` Steven Rostedt
2023-11-08 15:44   ` Mark Rutland
2023-11-07 21:56 ` [RFC PATCH 09/86] Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 10/86] Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY" Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 11/86] Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers" Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 12/86] Revert "sched/preempt: Refactor sched_dynamic_update()" Ankur Arora
2023-11-07 21:56 ` [RFC PATCH 13/86] Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 14/86] Revert "preempt/dynamic: Fix setup_preempt_mode() return value" Ankur Arora
2023-11-07 23:20   ` Steven Rostedt
2023-11-07 21:57 ` [RFC PATCH 15/86] Revert "preempt: Restore preemption model selection configs" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 16/86] Revert "sched: Provide Kconfig support for default dynamic preempt mode" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 17/86] sched/preempt: remove PREEMPT_DYNAMIC from the build version Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 18/86] Revert "preempt/dynamic: Fix typo in macro conditional statement" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 19/86] Revert "sched,preempt: Move preempt_dynamic to debug.c" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 20/86] Revert "static_call: Relax static_call_update() function argument type" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 21/86] Revert "sched/core: Use -EINVAL in sched_dynamic_mode()" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 22/86] Revert "sched/core: Stop using magic values " Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 23/86] Revert "sched,x86: Allow !PREEMPT_DYNAMIC" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 24/86] Revert "sched: Harden PREEMPT_DYNAMIC" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 25/86] Revert "sched: Add /debug/sched_preempt" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 26/86] Revert "preempt/dynamic: Support dynamic preempt with preempt= boot option" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 27/86] Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static call" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 28/86] Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static calls" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 29/86] Revert "preempt/dynamic: Provide cond_resched() and might_resched() " Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 30/86] Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC" Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 31/86] x86/thread_info: add TIF_NEED_RESCHED_LAZY Ankur Arora
2023-11-07 23:26   ` Steven Rostedt
2023-11-07 21:57 ` [RFC PATCH 32/86] entry: handle TIF_NEED_RESCHED_LAZY Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 33/86] entry/kvm: " Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 34/86] thread_info: accessors for TIF_NEED_RESCHED* Ankur Arora
2023-11-08  8:58   ` Peter Zijlstra
2023-11-21  5:59     ` Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 35/86] thread_info: change to tif_need_resched(resched_t) Ankur Arora
2023-11-08  9:00   ` Peter Zijlstra
2023-11-07 21:57 ` [RFC PATCH 36/86] entry: irqentry_exit only preempts TIF_NEED_RESCHED Ankur Arora
2023-11-08  9:01   ` Peter Zijlstra
2023-11-21  6:00     ` Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 37/86] sched: make test_*_tsk_thread_flag() return bool Ankur Arora
2023-11-08  9:02   ` Peter Zijlstra
2023-11-07 21:57 ` [RFC PATCH 38/86] sched: *_tsk_need_resched() now takes resched_t Ankur Arora
2023-11-08  9:03   ` Peter Zijlstra
2023-11-07 21:57 ` [RFC PATCH 39/86] sched: handle lazy resched in set_nr_*_polling() Ankur Arora
2023-11-08  9:15   ` Peter Zijlstra
2023-11-07 21:57 ` [RFC PATCH 40/86] context_tracking: add ct_state_cpu() Ankur Arora
2023-11-08  9:16   ` Peter Zijlstra
2023-11-21  6:32     ` Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 41/86] sched: handle resched policy in resched_curr() Ankur Arora
2023-11-08  9:36   ` Peter Zijlstra
2023-11-08 10:26     ` Ankur Arora
2023-11-08 10:46       ` Peter Zijlstra
2023-11-21  6:34         ` Ankur Arora
2023-11-21  6:31       ` Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 42/86] sched: force preemption on tick expiration Ankur Arora
2023-11-08  9:56   ` Peter Zijlstra
2023-11-21  6:44     ` Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 43/86] sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models Ankur Arora
2023-11-08  9:58   ` Peter Zijlstra
2023-11-07 21:57 ` [RFC PATCH 44/86] sched: voluntary preemption Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 45/86] preempt: ARCH_NO_PREEMPT only preempts lazily Ankur Arora
2023-11-08  0:07   ` Steven Rostedt
2023-11-08  8:47     ` Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 46/86] tracing: handle lazy resched Ankur Arora
2023-11-08  0:19   ` Steven Rostedt
2023-11-08  9:24     ` Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT Ankur Arora
2023-11-08  0:27   ` Steven Rostedt
2023-11-21  0:28     ` Paul E. McKenney
2023-11-21  3:43       ` Steven Rostedt
2023-11-21  5:04         ` Paul E. McKenney
2023-11-21  5:39           ` Ankur Arora
2023-11-21 15:00           ` Steven Rostedt
2023-11-21 15:19             ` Paul E. McKenney
2023-11-28 10:53               ` Thomas Gleixner
2023-11-28 18:30                 ` Ankur Arora
2023-12-05  1:03                   ` Paul E. McKenney
2023-12-05  1:01                 ` Paul E. McKenney
2023-12-05 15:01                   ` Steven Rostedt
2023-12-05 19:38                     ` Paul E. McKenney
2023-12-05 20:18                       ` Ankur Arora
2023-12-06  4:07                         ` Paul E. McKenney
2023-12-07  1:33                           ` Ankur Arora
2023-12-05 20:45                       ` Steven Rostedt
2023-12-06 10:08                         ` David Laight
2023-12-07  4:34                         ` Paul E. McKenney
2023-12-07 13:44                           ` Steven Rostedt
2023-12-08  4:28                             ` Paul E. McKenney
2023-11-08 12:15   ` Julian Anastasov
2023-11-07 21:57 ` [RFC PATCH 48/86] rcu: handle quiescent states for PREEMPT_RCU=n Ankur Arora
2023-11-21  0:38   ` Paul E. McKenney
2023-11-21  3:26     ` Ankur Arora
2023-11-21  5:17       ` Paul E. McKenney
2023-11-21  5:34         ` Paul E. McKenney
2023-11-21  6:13           ` Z qiang
2023-11-21 15:32             ` Paul E. McKenney
2023-11-21 19:25           ` Paul E. McKenney
2023-11-21 20:30             ` Peter Zijlstra
2023-11-21 21:14               ` Paul E. McKenney
2023-11-21 21:38                 ` Steven Rostedt
2023-11-21 22:26                   ` Paul E. McKenney
2023-11-21 22:52                     ` Steven Rostedt
2023-11-22  0:01                       ` Paul E. McKenney
2023-11-22  0:12                         ` Steven Rostedt
2023-11-22  1:09                           ` Paul E. McKenney
2023-11-28 17:04     ` Thomas Gleixner
2023-12-05  1:33       ` Paul E. McKenney
2023-12-06 15:10         ` Thomas Gleixner
2023-12-07  4:17           ` Paul E. McKenney
2023-12-07  1:31         ` Ankur Arora
2023-12-07  2:10           ` Steven Rostedt
2023-12-07  4:37             ` Paul E. McKenney
2023-12-07 14:22           ` Thomas Gleixner
2023-11-21  3:55   ` Z qiang
2023-11-07 21:57 ` [RFC PATCH 49/86] osnoise: handle quiescent states directly Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 50/86] rcu: TASKS_RCU does not need to depend on PREEMPTION Ankur Arora
2023-11-21  0:38   ` Paul E. McKenney
2023-11-07 21:57 ` [RFC PATCH 51/86] preempt: disallow !PREEMPT_COUNT or !PREEMPTION Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 52/86] sched: remove CONFIG_PREEMPTION from *_needbreak() Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 53/86] sched: fixup __cond_resched_*() Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 54/86] sched: add cond_resched_stall() Ankur Arora
2023-11-09 11:19   ` Thomas Gleixner
2023-11-09 22:27     ` Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 55/86] xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq() Ankur Arora
2023-11-07 21:57 ` [RFC PATCH 56/86] xarray: use cond_resched_xas*() Ankur Arora
2023-11-07 23:01 ` [RFC PATCH 00/86] Make the kernel preemptible Steven Rostedt
2023-11-07 23:43   ` Ankur Arora
2023-11-08  0:00     ` Steven Rostedt
2023-11-07 23:07 ` [RFC PATCH 57/86] coccinelle: script to remove cond_resched() Ankur Arora
2023-11-07 23:07   ` [RFC PATCH 58/86] treewide: x86: " Ankur Arora
2023-11-07 23:07   ` [RFC PATCH 59/86] treewide: rcu: " Ankur Arora
2023-11-21  1:01     ` Paul E. McKenney
2023-11-07 23:07   ` [RFC PATCH 60/86] treewide: torture: " Ankur Arora
2023-11-21  1:02     ` Paul E. McKenney
2023-11-07 23:07   ` [RFC PATCH 61/86] treewide: bpf: " Ankur Arora
2023-11-07 23:07   ` [RFC PATCH 62/86] treewide: trace: " Ankur Arora
2023-11-07 23:07   ` [RFC PATCH 63/86] treewide: futex: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 64/86] treewide: printk: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 65/86] treewide: task_work: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 66/86] treewide: kernel: " Ankur Arora
2023-11-17 18:14     ` Luis Chamberlain
2023-11-17 19:51       ` Steven Rostedt
2023-11-07 23:08   ` [RFC PATCH 67/86] treewide: kernel: remove cond_reshed() Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 68/86] treewide: mm: remove cond_resched() Ankur Arora
2023-11-08  1:28     ` Sergey Senozhatsky
2023-11-08  7:49       ` Vlastimil Babka
2023-11-08  8:02         ` Yosry Ahmed
2023-11-08  8:54           ` Ankur Arora
2023-11-08 12:58             ` Matthew Wilcox
2023-11-08 14:50               ` Steven Rostedt
2023-11-07 23:08   ` [RFC PATCH 69/86] treewide: io_uring: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 70/86] treewide: ipc: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 71/86] treewide: lib: " Ankur Arora
2023-11-08  9:15     ` Herbert Xu
2023-11-08 15:08       ` Steven Rostedt
2023-11-09  4:19         ` Herbert Xu
2023-11-09  4:43           ` Steven Rostedt
2023-11-08 19:15     ` Kees Cook
2023-11-08 19:41       ` Steven Rostedt
2023-11-08 22:16         ` Kees Cook
2023-11-08 22:21           ` Steven Rostedt
2023-11-09  9:39         ` David Laight
2023-11-07 23:08   ` [RFC PATCH 72/86] treewide: crypto: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 73/86] treewide: security: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 74/86] treewide: fs: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 75/86] treewide: virt: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 76/86] treewide: block: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 77/86] treewide: netfilter: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 78/86] treewide: net: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 79/86] " Ankur Arora
2023-11-08 12:16     ` Eric Dumazet
2023-11-08 17:11       ` Steven Rostedt
2023-11-08 20:59         ` Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 80/86] treewide: sound: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 81/86] treewide: md: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 82/86] treewide: mtd: " Ankur Arora
2023-11-08 16:28     ` Miquel Raynal
2023-11-08 16:32       ` Matthew Wilcox
2023-11-08 17:21         ` Steven Rostedt
2023-11-09  8:38           ` Miquel Raynal
2023-11-07 23:08   ` [RFC PATCH 83/86] treewide: drm: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 84/86] treewide: net: " Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 85/86] treewide: drivers: " Ankur Arora
2023-11-08  0:48     ` Chris Packham
2023-11-09  0:55       ` Ankur Arora
2023-11-09 23:25     ` Dmitry Torokhov
2023-11-09 23:41       ` Steven Rostedt
2023-11-10  0:01       ` Ankur Arora
2023-11-07 23:08   ` [RFC PATCH 86/86] sched: " Ankur Arora
2023-11-07 23:19   ` [RFC PATCH 57/86] coccinelle: script to " Julia Lawall
2023-11-08  8:29     ` Ankur Arora
2023-11-08  9:49       ` Julia Lawall
2023-11-21  0:45   ` Paul E. McKenney
2023-11-21  5:16     ` Ankur Arora
2023-11-21 15:26       ` Paul E. McKenney
2023-11-08  4:08 ` [RFC PATCH 00/86] Make the kernel preemptible Christoph Lameter
2023-11-08  4:33   ` Ankur Arora
2023-11-08  4:52     ` Christoph Lameter
2023-11-08  5:12       ` Steven Rostedt
2023-11-08  6:49         ` Ankur Arora
2023-11-08  7:54         ` Vlastimil Babka
2023-11-08  7:31 ` Juergen Gross
2023-11-08  8:51 ` Peter Zijlstra
2023-11-08  9:53   ` Daniel Bristot de Oliveira
2023-11-08 10:04   ` Ankur Arora
2023-11-08 10:13     ` Peter Zijlstra
2023-11-08 11:00       ` Ankur Arora
2023-11-08 11:14         ` Peter Zijlstra
2023-11-08 12:16           ` Peter Zijlstra
2023-11-08 15:38       ` Thomas Gleixner
2023-11-08 16:15         ` Peter Zijlstra
2023-11-08 16:22         ` Steven Rostedt
2023-11-08 16:49           ` Peter Zijlstra
2023-11-08 17:18             ` Steven Rostedt
2023-11-08 20:46             ` Ankur Arora
2023-11-08 20:26         ` Ankur Arora
2023-11-08  9:43 ` David Laight
2023-11-08 15:15   ` Steven Rostedt
2023-11-08 16:29     ` David Laight
2023-11-08 16:33 ` Mark Rutland
2023-11-09  0:34   ` Ankur Arora
2023-11-09 11:00     ` Mark Rutland
2023-11-09 22:36       ` Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).