linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v7 00/23] Core scheduling v7
@ 2020-08-28 19:51 Julien Desfossez
  2020-08-28 19:51 ` [RFC PATCH v7 01/23] sched: Wrap rq::lock access Julien Desfossez
                   ` (22 more replies)
  0 siblings, 23 replies; 66+ messages in thread
From: Julien Desfossez @ 2020-08-28 19:51 UTC (permalink / raw)
  To: Peter Zijlstra, Vineeth Pillai, Joel Fernandes, Tim Chen,
	Aaron Lu, Aubrey Li, Dhaval Giani, Chris Hyser,
	Nishanth Aravamudan
  Cc: Julien Desfossez, mingo, tglx, pjt, torvalds, linux-kernel,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, joel, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, rostedt, derkling, benbjiang

Seventh iteration of the Core-Scheduling feature.

(Note that this iteration is a repost of the combined v6 and v6+ series,
and does not include the new interface discussed at LPC).

Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

This iteration focuses on the kernel protection when kernel runs along
with untrusted user mode tasks on a core. This was a main gap in the
security aspects provided by core scheduling feature in mitigating the
side-channel attacks. In v6, we had a minimal version where kernel was
protected from untrusted tasks when a sibling enters the kernel via
irq/softirq. System calls were not protected in the v6 patch and the
implementation had corner cases.  With this iteration, user mode tasks
running on siblings in a core are paused until the last sibling exits
kernel and thus guarantees that kernel code does not run concurrently
with untrusted user mode tasks. This protects both syscalls and IRQs vs
usermode and guest. It is also based on Thomas’s x86/entry branch which
moves generic entry code out of arch/

This iteration also fixes the hotplug crash and hang issues present in
v6. In v6, previous hotplug fixes were removed as they were getting
complicated. This iteration addresses the root cause of hot plug issues
and fixes them.

In terms of performance, the active VM-based workloads are still less
impacted by core scheduling than by disabling SMT. There are corner
cases and at this stage, since we are getting closer to merger, we are
actively looking for feedback and reproducers of those so we can address
them.

Performance tests:
sysbench is used to test the performance of the patch series. Used a 8
cpu/4 Core VM and ran 2 sysbench tests in parallel. Each sysbench test
runs 4 tasks:
sysbench --test=cpu --cpu-max-prime=100000 --num-threads=4 run

Compared the performance results for various combinations as below.
The metric below is 'events per second':

1. Coresched disabled
    sysbench-1/sysbench-2 => 175.7/175.6
2. Coreched enabled, both sysbench tagged
  sysbench-1/sysbench-2 => 168.8/165.6
3. Coresched enabled, sysbench-1 tagged and sysbench-2 untagged
    sysbench-1/sysbench-2 => 96.4/176.9
4. smt off
    sysbench-1/sysbench-2 => 97.9/98.8

When both sysbench-es are tagged, there is a perf drop of ~4%. With a
tagged/untagged case, the tagged one suffers because it always gets
stalled when the sibiling enters kernel. But this is no worse than
smtoff.

Also a modified rcutorture was used to heavily stress the kernel to make
sure there is not crash or instability.

Larger scale performance tests were performed where the host is running
the core scheduling kernel, each VM is in its own tag.

With 3 12-vcpus VMs on a 36 hardware thread NUMA node running linpack
(CPU-intensive), we go from 1:1 to 2:1 vcpus to hardware threads when we
disable SMT, the performance impact of core scheduling is -9.52% whereas
it is -25.41% with nosmt.

With 2 12-vcpus VMs database servers and 192 1-vcpus semi-idle VMs
running on 2 36 hardware threads NUMA nodes (3:1), the performance
impact is -47% with core scheduling and -66% with nosmt.

More details on those tests at the end of this presentation:
https://www.linuxplumbersconf.org/event/7/contributions/648/attachments/555/981/CoreScheduling_LPC2020.pdf

v7 is rebased on 5.9-rc2(d3ac97072ec5)
https://github.com/digitalocean/linux-coresched/tree/coresched/v7-v5.9-rc

TODO
----
(Mostly same as in v6)
- MAJOR: Core wide vruntime comparison re-work
  https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net/
  https://lwn.net/ml/linux-kernel/20200514130248.GD2940@hirez.programming.kicks-ass.net/
- MAJOR: Decide on the interfaces/API for exposing the feature to userland.
  - LPC 2020 gained consensus on a simple task level interface as a starter
  - Investigate ideas on overcoming unified hierarchy limitations with
    cgroup v2 controller
- MAJOR: Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain/
- Investigate the source of the overhead even when no tasks are tagged:
  https://lkml.org/lkml/2019/10/29/242
- Core scheduling test framework: kselftests, torture tests etc

Changes in v7
-------------
- Kernel protection from untrusted usermode tasks
  - Joel, Vineeth
- Fix for hotplug crashes and hangs
  - Joel, Vineeth

Changes in v6
-------------
- Documentation
  - Joel
- Pause siblings on entering nmi/irq/softirq
  - Joel, Vineeth
- Fix for RCU crash
  - Joel
- Fix for a crash in pick_next_task
  - Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
  - Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
  - Joel, Vineeth

Changes in v5
-------------
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
-------------
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

--

Aaron Lu (2):
  sched/fair: wrapper for cfs_rq->min_vruntime
  sched/fair: core wide cfs task priority comparison

Aubrey Li (1):
  sched: migration changes for core scheduling

Joel Fernandes (Google) (6):
  irq_work: Add support to detect if work is pending
  entry/idle: Add a common function for activites during idle entry/exit
  arch/x86: Add a new TIF flag for untrusted tasks
  kernel/entry: Add support for core-wide protection of kernel-mode
  entry/idle: Enter and exit kernel protection during idle entry and
    exit
  Documentation: Add documentation on core scheduling

Peter Zijlstra (9):
  sched: Wrap rq::lock access
  sched: Introduce sched_class::pick_task()
  sched: Core-wide rq->lock
  sched/fair: Add a few assertions
  sched: Basic tracking of matching tasks
  sched: Add core wide task selection and scheduling.
  sched: Trivial forced-newidle balancer
  sched: cgroup tagging interface for core scheduling
  sched: Debug bits...

Vineeth Pillai (5):
  bitops: Introduce find_next_or_bit
  cpumask: Introduce a new iterator for_each_cpu_wrap_or
  sched/fair: Fix forced idle sibling starvation corner case
  entry/kvm: Protect the kernel when entering from guest
  sched/coresched: config option for kernel protection

 .../admin-guide/hw-vuln/core-scheduling.rst   |  253 ++++
 Documentation/admin-guide/hw-vuln/index.rst   |    1 +
 .../admin-guide/kernel-parameters.txt         |    9 +
 arch/x86/include/asm/thread_info.h            |    2 +
 arch/x86/kvm/x86.c                            |    3 +
 include/asm-generic/bitops/find.h             |   16 +
 include/linux/cpumask.h                       |   42 +
 include/linux/entry-common.h                  |   22 +
 include/linux/entry-kvm.h                     |   12 +
 include/linux/irq_work.h                      |    1 +
 include/linux/sched.h                         |   21 +-
 kernel/Kconfig.preempt                        |   19 +
 kernel/entry/common.c                         |   90 +-
 kernel/entry/kvm.c                            |   12 +
 kernel/irq_work.c                             |   11 +
 kernel/sched/core.c                           | 1310 +++++++++++++++--
 kernel/sched/cpuacct.c                        |   12 +-
 kernel/sched/deadline.c                       |   34 +-
 kernel/sched/debug.c                          |    4 +-
 kernel/sched/fair.c                           |  400 +++--
 kernel/sched/idle.c                           |   30 +-
 kernel/sched/pelt.h                           |    2 +-
 kernel/sched/rt.c                             |   22 +-
 kernel/sched/sched.h                          |  253 +++-
 kernel/sched/stop_task.c                      |   13 +-
 kernel/sched/topology.c                       |    4 +-
 lib/cpumask.c                                 |   53 +
 lib/find_bit.c                                |   58 +-
 28 files changed, 2381 insertions(+), 328 deletions(-)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

--
2.17.1


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2020-09-25 15:02 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-28 19:51 [RFC PATCH v7 00/23] Core scheduling v7 Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 01/23] sched: Wrap rq::lock access Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 02/23] sched: Introduce sched_class::pick_task() Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 03/23] sched: Core-wide rq->lock Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 04/23] sched/fair: Add a few assertions Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 05/23] sched: Basic tracking of matching tasks Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 06/23] bitops: Introduce find_next_or_bit Julien Desfossez
2020-09-03  5:13   ` Randy Dunlap
2020-08-28 19:51 ` [RFC PATCH v7 07/23] cpumask: Introduce a new iterator for_each_cpu_wrap_or Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling Julien Desfossez
2020-08-28 20:51   ` Peter Zijlstra
2020-08-28 22:02     ` Vineeth Pillai
2020-08-28 22:23       ` Joel Fernandes
2020-08-29  7:47       ` peterz
2020-08-31 13:01         ` Vineeth Pillai
2020-08-31 14:24         ` Joel Fernandes
2020-09-01  3:38         ` Joel Fernandes
2020-09-01  5:10         ` Joel Fernandes
2020-09-01 12:34           ` Vineeth Pillai
2020-09-01 17:30             ` Joel Fernandes
2020-09-01 21:23               ` Vineeth Pillai
2020-09-02  1:11                 ` Joel Fernandes
2020-08-28 20:55   ` Peter Zijlstra
2020-08-28 22:15     ` Vineeth Pillai
2020-09-15 20:08   ` Joel Fernandes
2020-08-28 19:51 ` [RFC PATCH v7 09/23] sched/fair: Fix forced idle sibling starvation corner case Julien Desfossez
2020-08-28 21:25   ` Peter Zijlstra
2020-08-28 23:24     ` Vineeth Pillai
2020-08-28 19:51 ` [RFC PATCH v7 10/23] sched/fair: wrapper for cfs_rq->min_vruntime Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison Julien Desfossez
2020-08-28 21:29   ` Peter Zijlstra
2020-09-17 14:15     ` Vineeth Pillai
2020-09-17 20:39       ` Vineeth Pillai
2020-09-23  1:46     ` Joel Fernandes
2020-09-23  1:52       ` Joel Fernandes
2020-09-25 15:02         ` Joel Fernandes
2020-09-15 21:49   ` chris hyser
     [not found]     ` <81b208ad-b9e6-bfbf-631e-02e9f75d73a2@linux.intel.com>
2020-09-16 14:24       ` chris hyser
2020-09-16 20:53         ` chris hyser
2020-09-17  1:09           ` Li, Aubrey
2020-08-28 19:51 ` [RFC PATCH v7 12/23] sched: Trivial forced-newidle balancer Julien Desfossez
2020-09-02  7:08   ` Pavan Kondeti
2020-08-28 19:51 ` [RFC PATCH v7 13/23] sched: migration changes for core scheduling Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 14/23] irq_work: Add support to detect if work is pending Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 15/23] entry/idle: Add a common function for activites during idle entry/exit Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 16/23] arch/x86: Add a new TIF flag for untrusted tasks Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode Julien Desfossez
2020-09-01 15:54   ` Thomas Gleixner
2020-09-01 16:50     ` Joel Fernandes
2020-09-01 20:02       ` Thomas Gleixner
2020-09-02  1:29         ` Joel Fernandes
2020-09-02  7:53           ` Thomas Gleixner
2020-09-02 15:12             ` Joel Fernandes
2020-09-02 16:57             ` Dario Faggioli
2020-09-03  4:34               ` Joel Fernandes
2020-09-03 11:05                 ` Vineeth Pillai
2020-09-03 13:20                 ` Thomas Gleixner
2020-09-03 20:30                   ` Joel Fernandes
2020-09-03 13:43                 ` Dario Faggioli
2020-09-03 20:25                   ` Joel Fernandes
2020-08-28 19:51 ` [RFC PATCH v7 18/23] entry/idle: Enter and exit kernel protection during idle entry and exit Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 19/23] entry/kvm: Protect the kernel when entering from guest Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 20/23] sched/coresched: config option for kernel protection Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 21/23] sched: cgroup tagging interface for core scheduling Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 22/23] Documentation: Add documentation on " Julien Desfossez
2020-08-28 19:51 ` [RFC PATCH v7 23/23] sched: Debug bits Julien Desfossez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).