All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v8 0/3] Add scheduler overview documentation
@ 2020-09-02 16:26 John Mathew
  2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: John Mathew @ 2020-09-02 16:26 UTC (permalink / raw)
  To: linux-doc
  Cc: linux-kernel, corbet, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tsbogend,
	lukas.bulwahn, x86, linux-mips, tglx, willy, valentin.schneider,
	John Mathew

This patch series updates the scheduler documentation to add more topics
wrt to scheduler overview. New sections are added to provide a brief
overview of the kernel structs used by the scheduler, scheduler invocation,
and context switch. Previous version of the patch was reviewed at:
https://lore.kernel.org/lkml/20200527084421.4673-1-John.Mathew@unikie.com/

version 8:
 - Rebase

version 7:
 -Fix overview description
 -Removed rst headers
 -Removed kernel-doc for struct rq and meged it as struct
  member comments

version 6:
 -Fix typos.

version 5:
 -Fix description error on CAS

version 4:
 -Added section on Capacity-Aware Scheduling
 -Reworded CFS recently added features.
 -Removed vruntime description from scheduler structs
 -Added description of idle and stopper sched classses

version 3:
 -Fix spelling, spacing and typo errors.

version 2:
- Remove :c:func: directive as it was redundant
- Limit document width (line symbol count) to 75
- Replace dot file with ASCII art
- Describe prepare_task_switch(), ASID use, 
  kernel/user transtion, MIPS FPU affinity correctly
- Add missing references to files
- Removed internal APIs from scheduler API reference
- Described rq struct member as kernel-doc comments
- Replaced CFS history with CFS current status
- Added documentation for sched_class fields
- Refined explanation of context swtich functionality
- Replace CFS history with recent changes
- Added kernel-doc comments for struct rq

John Mathew (3):
  docs: scheduler: Restructure scheduler documentation.
  docs: scheduler: Add scheduler overview  documentation
  docs: scheduler: Add introduction to scheduler context-switch

 Documentation/scheduler/arch-specific.rst     |  14 +
 Documentation/scheduler/cfs-overview.rst      |  59 ++++
 Documentation/scheduler/context-switching.rst | 126 ++++++++
 Documentation/scheduler/index.rst             |  34 +-
 .../scheduler/mips-context-switch.rst         |  89 ++++++
 Documentation/scheduler/overview.rst          | 297 ++++++++++++++++++
 .../scheduler/sched-data-structs.rst          | 176 +++++++++++
 Documentation/scheduler/sched-debugging.rst   |  14 +
 Documentation/scheduler/sched-features.rst    |  25 ++
 Documentation/scheduler/scheduler-api.rst     |  24 ++
 .../scheduler/x86-context-switch.rst          |  55 ++++
 kernel/sched/core.c                           |  21 +-
 kernel/sched/sched.h                          |  63 +++-
 13 files changed, 978 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/scheduler/arch-specific.rst
 create mode 100644 Documentation/scheduler/cfs-overview.rst
 create mode 100644 Documentation/scheduler/context-switching.rst
 create mode 100644 Documentation/scheduler/mips-context-switch.rst
 create mode 100644 Documentation/scheduler/overview.rst
 create mode 100644 Documentation/scheduler/sched-data-structs.rst
 create mode 100644 Documentation/scheduler/sched-debugging.rst
 create mode 100644 Documentation/scheduler/sched-features.rst
 create mode 100644 Documentation/scheduler/scheduler-api.rst
 create mode 100644 Documentation/scheduler/x86-context-switch.rst

-- 
2.17.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation.
  2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
@ 2020-09-02 16:26 ` John Mathew
  2020-09-09 10:28   ` Lukas Bulwahn
  2020-09-02 16:26 ` [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview documentation John Mathew
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: John Mathew @ 2020-09-02 16:26 UTC (permalink / raw)
  To: linux-doc
  Cc: linux-kernel, corbet, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tsbogend,
	lukas.bulwahn, x86, linux-mips, tglx, willy, valentin.schneider,
	John Mathew

Add new sections to enable addition of new documentation on
the scheduler. Existing documentation is moved under the related
new sections. The sections are
  - overview
  - sched-features
  - arch-specific.rst
  - sched-debugging.rst

Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: John Mathew <john.mathew@unikie.com>
---
 Documentation/scheduler/arch-specific.rst   | 12 +++++++++
 Documentation/scheduler/index.rst           | 30 +++++++++++----------
 Documentation/scheduler/overview.rst        |  5 ++++
 Documentation/scheduler/sched-debugging.rst | 14 ++++++++++
 Documentation/scheduler/sched-features.rst  | 25 +++++++++++++++++
 5 files changed, 72 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/scheduler/arch-specific.rst
 create mode 100644 Documentation/scheduler/overview.rst
 create mode 100644 Documentation/scheduler/sched-debugging.rst
 create mode 100644 Documentation/scheduler/sched-features.rst

diff --git a/Documentation/scheduler/arch-specific.rst b/Documentation/scheduler/arch-specific.rst
new file mode 100644
index 000000000000..3e5af3a0695e
--- /dev/null
+++ b/Documentation/scheduler/arch-specific.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Architecture Specific Scheduler Implementation Differences
+==========================================================
+
+.. class:: toc-title
+
+	   Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 88900aabdbf7..6e88a070c503 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -1,24 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
 ===============
 Linux Scheduler
 ===============
 
-.. toctree::
-    :maxdepth: 1
+This documentation outlines the Linux kernel scheduler with its concepts,
+details about the scheduler design and its data structures and architecture
+specific implementation differences.
+
 
+.. class:: toc-title
+
+    Table of contents
+
+.. toctree::
+    :maxdepth: 2
 
-    completion
-    sched-arch
-    sched-bwc
-    sched-deadline
+    overview
     sched-design-CFS
-    sched-domains
-    sched-capacity
-    sched-energy
-    sched-nice-design
-    sched-rt-group
-    sched-stats
-
-    text_files
+    sched-features
+    arch-specific
+    sched-debugging
 
 .. only::  subproject and html
 
diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
new file mode 100644
index 000000000000..a1d2d26629eb
--- /dev/null
+++ b/Documentation/scheduler/overview.rst
@@ -0,0 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+====================
+Scheduler overview
+====================
\ No newline at end of file
diff --git a/Documentation/scheduler/sched-debugging.rst b/Documentation/scheduler/sched-debugging.rst
new file mode 100644
index 000000000000..e332069f99d6
--- /dev/null
+++ b/Documentation/scheduler/sched-debugging.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Scheduler Debugging Interface
+==============================
+
+.. class:: toc-title
+
+	   Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
+   sched-stats
+   text_files
diff --git a/Documentation/scheduler/sched-features.rst b/Documentation/scheduler/sched-features.rst
new file mode 100644
index 000000000000..8eb90e86e489
--- /dev/null
+++ b/Documentation/scheduler/sched-features.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Scheduler Features
+===================
+
+.. class:: toc-title
+
+	Table of contents
+
+.. toctree::
+   :maxdepth: 1
+
+   completion
+   sched-arch
+   sched-bwc
+   sched-deadline
+   sched-domains
+   sched-capacity
+   sched-energy
+   sched-nice-design
+   sched-rt-group
+   sched-stats
+
+   text_files
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview  documentation
  2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
  2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
@ 2020-09-02 16:26 ` John Mathew
  2020-09-09 11:23   ` Lukas Bulwahn
  2020-09-02 16:26 ` [RFC PATCH v8 3/3] docs: scheduler: Add introduction to scheduler context-switch John Mathew
  2020-09-09  8:37 ` [RFC PATCH v8 0/3] Add scheduler overview documentation Lukas Bulwahn
  3 siblings, 1 reply; 7+ messages in thread
From: John Mathew @ 2020-09-02 16:26 UTC (permalink / raw)
  To: linux-doc
  Cc: linux-kernel, corbet, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tsbogend,
	lukas.bulwahn, x86, linux-mips, tglx, willy, valentin.schneider,
	John Mathew, Mostafa Chamanara, Oleg Tsymbal

Add documentation for
 -scheduler overview
 -scheduler state transtion
 -CFS overview
 -scheduler data structs

Add rst for scheduler APIs and modify sched/core.c
to add kernel-doc comments.

Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Co-developed-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
Signed-off-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
Co-developed-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
Signed-off-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
Signed-off-by: John Mathew <john.mathew@unikie.com>
---
 Documentation/scheduler/cfs-overview.rst      |  59 ++++
 Documentation/scheduler/index.rst             |   3 +
 Documentation/scheduler/overview.rst          | 294 +++++++++++++++++-
 .../scheduler/sched-data-structs.rst          | 176 +++++++++++
 Documentation/scheduler/scheduler-api.rst     |  24 ++
 kernel/sched/core.c                           |  21 +-
 kernel/sched/sched.h                          |  63 +++-
 7 files changed, 634 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/scheduler/cfs-overview.rst
 create mode 100644 Documentation/scheduler/sched-data-structs.rst
 create mode 100644 Documentation/scheduler/scheduler-api.rst

diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst
new file mode 100644
index 000000000000..1524c24da897
--- /dev/null
+++ b/Documentation/scheduler/cfs-overview.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============
+CFS Overview
+=============
+
+Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
+Scheduler (CFS) implemented as a scheduling module. A brief overview of the
+CFS design is provided in :doc:`sched-design-CFS`
+
+In addition there have been many improvements to the CFS, a few of which are
+
+Tracking available capacity
+---------------------------
+Scale CPU capacity mechanism for CFS so it knows how much CPU capacity is left
+for its use after higher priority sched classes (RT, DL), IRQs and
+'Thermal Pressure' have reduced the 'original' CPU capacity.
+Thermal pressure on a CPU means the maximum possible capacity is
+unavailable due to thermal events.
+
+NUMA balancing
+--------------
+Attempt to migrate tasks to the NUMA Node where the frequently accessed memory
+pages belongs. The scheduler gets information about memory placement through the
+paging mechanism. Scheduler periodically scans the virtual memory of the tasks
+and make them inaccessible by changing the memory protection. The flag
+MM_CP_PROT_NUMA indicates this purpose. When the task attempts to access
+the memory again a page fault occurs. Scheduler traps the fault and increments
+the counters in a task specific array corresponding to the NUMA node id.
+There array is divided in to four regions: faults_memory, faults_cpu,
+faults_memory_buffer and faults_cpu_buffer, where faults_memory is the
+exponential decaying average of faults on a per-node basis. The 'preferred
+node' is found by looping through the array and finding the node with the
+highest number of faults. Migration to the preferred node is done periodically
+by either swapping two tasks tasks between their respective CPUs or
+just moving a task to its preferred node CPU. It the migration or move fails
+it will be retried.
+
+Energy Aware Scheduling
+-----------------------
+For asymmetric CPU capacity topologies, an Energy Model is used to figure out
+which of the CPU candidates is the most energy-efficient. Capacity is the
+amount of work which a CPU can perform at its highest frequency which is
+calculated by the Per-Entity Load Tracking (PELT) mechanism.
+EAS is described at :doc:`sched-energy`
+
+Capacity Aware Scheduling
+--------------------------
+Migrate a task to a CPU which meets its compute demand. In asymmetric CPU
+capacity topologies CFS scheduler frequently updates the 'Misfit' status of
+tasks and migrate them to CPU's of higher capacity. Also during wakeups the
+a CPU with sufficient capacity is found for executing the task. CAS is
+described at :doc:`sched-capacity`
+
+
+
+
+
+
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 6e88a070c503..e3b1d4fc1604 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -17,10 +17,13 @@ specific implementation differences.
     :maxdepth: 2
 
     overview
+    sched-data-structs
+    cfs-overview
     sched-design-CFS
     sched-features
     arch-specific
     sched-debugging
+    scheduler-api
 
 .. only::  subproject and html
 
diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
index a1d2d26629eb..f2fb8f419919 100644
--- a/Documentation/scheduler/overview.rst
+++ b/Documentation/scheduler/overview.rst
@@ -2,4 +2,296 @@
 
 ====================
 Scheduler overview
-====================
\ No newline at end of file
+====================
+
+Linux kernel implements priority-based scheduling. More than one process are
+allowed to run at any given time and each process is allowed to run as if it
+were the only process on the system. The process scheduler coordinates which
+process runs when. In that context, it has the following tasks:
+
+  - share CPUs equally among all currently running processes.
+  - pick appropriate process to run next if required, considering scheduling
+    class/policy and process priorities.
+  - balance processes between multiple CPUs in SMP systems.
+
+The scheduler attempts to be responsive for I/O bound processes and efficient
+for CPU bound processes. The scheduler uses different scheduling policies
+for real time and normal processes based on their respective policy
+enumerations. Scheduler adds support for each policy through scheduling class
+implementations for each. The five scheduling classes which scheduler provides
+are:
+
+  - stop_sched_class:
+    It is a per-cpu maximum priority CPU monopolization mechanism. It is
+    exposed as a SCHED_FIFO task ('migration/X') with static priority of 99
+    in the user space. This is done to make it compatible with user space and
+    thus to avoid growing the ABI. It is used by one CPU to stop another
+    in order to run a specific function, so it is only available on SMP
+    systems. This class is used by the scheduler for task migration between
+    CPUs.
+
+  - dl_sched_class:
+    Implements the SCHED_DEADLINE scheduling policy. It has static priority
+    of -1 in kernel space. This policy schedules each task according to the
+    task's deadline. The task with the earliest deadline will be served first.
+
+  - rt_sched_class:
+    Implements the SCHED_RR and SCHED_FIFO policies. Real time static
+    priorities range from 1(low)..99 in the user space. (priority is inverted
+    in kernel space). It is the only scheduling class that makes use of the
+    static priority of the task. SCHED_FIFO is a simple scheduling algorithm
+    without time slicing. A SCHED_FIFO thread runs until either it is blocked
+    by an I/O request, it is preempted by a higher priority thread, or it
+    calls sched_yield(). SCHED_RR is a simple enhancement of SCHED_FIFO where
+    a thread is allowed to run only for a maximum time quantum.
+
+  - fair_sched_class:
+    Implements the SCHED_NORMAL SCHED_BATCH and SCHED_IDLE  policies. Static
+    priority is always 0 in the user space. A dynamic priority based on
+    'nice' value is used to schedule these tasks. This priority increases each
+    time the the task  is scheduled to run but denied to run by scheduler.
+    This ensures fair scheduling between these tasks. Nice value is an
+    attribute which can be set by the user to influence scheduler to favour
+    a particular task. SCHED_BATCH is similar to SCHED_NORMAL with the
+    difference that the policy causes the scheduler to assume that the task
+    is CPU-intensive. SCHED_IDLE policy also has static priority 0. Nice
+    value has no effect on this policy. Weight mapping is not done, instead
+    weight is set at a constant minimal weight WEIGHT_IDLEPRIO. Used to
+    run tasks at extremely low priority.
+
+  - idle_sched_class:
+    Priority for idle task is irrelevant. This class is not related to
+    SCHED_IDLE policy. Idle tasks run when there are no other runnable tasks
+    on a CPU. The execute the idle loop which is responsible to put a CPU
+    in one of its idle states.
+
+
+Process Management
+==================
+
+Each process in the system is represented by struct task_struct. When a
+process/thread is created, the kernel allocates a new task_struct for it.
+The kernel then stores this task_struct in an RCU list. Macro next_task()
+allows a process to obtain its next task and for_each_process() macro enables
+traversal of the list.
+
+Frequently used fields of the task struct are:
+
+ - state: The running state of the task. The possible states are:
+
+    - TASK_RUNNING: The task is currently running or in a run queue waiting
+      to run.
+    - TASK_INTERRUPTIBLE: The task is sleeping waiting for some event to occur.
+      This task can be interrupted by signals. On waking up the task transitions
+      to TASK_RUNNING.
+    - TASK_UNINTERRUPTIBLE: Similar to TASK_INTERRUPTIBLE but does not wake
+      up on signals. Needs an explicit wake-up call to be woken up. Contributes
+      to loadavg.
+    - __TASK_TRACED: Task is being traced by another task like a debugger.
+    - __TASK_STOPPED: Task execution has stopped and not eligible to run.
+      SIGSTOP, SIGTSTP etc causes this state.  The task can be continued by
+      the signal SIGCONT.
+    - TASK_PARKED: State to support kthread parking/unparking.
+    - TASK_DEAD: If a task dies, then it sets TASK_DEAD in tsk->state and calls
+      schedule one last time. The task will be never ran again.
+    - TASK_WAKEKILL: It works like TASK_UNINTERRUPTIBLE with the bonus that it
+      can respond to fatal signals.
+    - TASK_WAKING: To handle concurrent waking of the same task for SMP.
+      Indicates that someone is already waking the task.
+    - TASK_NOLOAD: To be used along with TASK_UNINTERRUPTIBLE to indicate
+      an idle task which does not contribute to loadavg.
+    - TASK_NEW: Set during fork(), to guarantee that no one will run the task,
+      a signal or any other wake event cannot wake it up and insert it on
+      the runqueue.
+
+ - exit_state : The exiting state of the task. The possible states are:
+
+    - EXIT_ZOMBIE: The task is terminated and waiting for parent to collect
+      the exit information of the task.
+    - EXIT_DEAD: After collecting the exit information the task is put to
+      this state and removed from the system.
+
+ - static_prio: Used by the fair scheduling class to encode the nice level.
+   It does not have any effect on the SCHED_DEADLINE, SCHED_FIFO or SCHED_RR
+   policy tasks.
+
+ - prio: The value of this field is used to:
+
+    - distinguish scheduling classes.
+    - in the RR/FIFO static priority scheduler.
+
+ - normal_prio: Expected priority of a task. The value of static_prio
+   and normal_prio are the same for non-real-time processes. For real time
+   processes value of prio is used.
+
+ - rt_priority: Field used to set priority of real time tasks. Not used by the
+   rt_sched_class.
+
+ - sched_class: Pointer to sched_class structure of the policy that the task
+   belongs to.
+
+ - sched_entity: Pointer to sched_entity CFS structure.
+
+ - policy: scheduling policy of the task. See above.
+
+ - nr_cpus_allowed: Hamming weight of the bitmask retrieved from cpumask pointer.
+
+New tasks are created using the fork() system call which is described
+at manpage `FORK(2)` or the clone system call described at manpage `CLONE(2)`.
+Users can create threads within a process to achieve parallelism. Threads
+share address space, open files and other resources of the process. Threads
+are created like normal tasks with their unique task_struct, but clone()
+is provided with flags that enable the sharing of resources such as address
+space ::
+
+	clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
+
+The scheduler schedules task_structs so from scheduler perspective there is
+no difference between threads and processes. Threads are created using
+the system call pthread_create described at its manpage `PTHREAD_CREATE(3)`
+POSIX threads creation is described at its manpage `PTHREADS(7)`
+
+The Scheduler Entry Point
+=========================
+
+The main scheduler entry point is an architecture independent schedule()
+function defined in kernel/sched/core.c. Its objective is to find a process in
+the runqueue list and then assign the CPU to it. It is invoked, directly
+or in a lazy (deferred) way from many different places in the kernel. A lazy
+invocation does not call the function by its name, but gives the kernel a
+hint by setting a flag TIF_NEED_RESCHED. The flag is a message to the kernel
+that the scheduler should be invoked as soon as possible because another
+process deserves to run. The flag should not be modified directly.
+
+Following are some places that notify the kernel to schedule which can be
+classified based on the type of operations:
+
+  - Blocking operations: Suspends the current task and directly call into
+    the scheduler to find something else to do. Some blocking operations are:
+
+      - mutex_lock()
+      - wait_event()
+      - do_exit()
+      - preempt_schedule_irq()
+
+  - Co-operative or voluntary preemptions: Allows another task to run at that
+    point subject to preemption model. Voluntary preemption model can be
+    set through the kernel config option: CONFIG_PREEMPT_VOLUNTARY. The
+    operations are:
+
+      - cond_resched()
+      - cond_resched_lock()
+      - yield()
+      - preempt_enable()
+
+  - Involuntary preemption: Marks TIF_NEED_RESCHED and wait for action
+    depending on preemption model. Involuntary preemption operations are:
+
+      - scheduler_tick()
+      - wake_up_process()
+
+Calling functions mentioned above leads to a call to __schedule(). Note
+that preemption must be disabled before it is called and enabled after
+the call using preempt_disable and preempt_enable functions family.
+
+
+The steps during invocation are:
+--------------------------------
+1. Disable preemption to avoid another task preempting the scheduling
+   thread itself.
+2. Retrieve the runqueue of current processor and its lock is obtained to
+   allow only one thread to modify the runqueue at a time.
+3. The state of the previously executed task when the schedule()
+   was called is examined. If it is not runnable and has not been
+   preempted in kernel mode, it is removed from the runqueue. If the
+   previous task has non-blocked pending signals, its state is set to
+   TASK_RUNNING and left in the runqueue.
+4. Scheduler classes are iterated and the corresponding class hook to
+   pick the next suitable task to be scheduled on the CPU is called.
+   Since most tasks are handled by the sched_fair class, a shortcut to this
+   class is implemented in the beginning of the function.
+5. TIF_NEED_RESCHED and architecture specific need_resched flags are cleared.
+6. If the scheduler class picks a different task from what was running
+   before, a context switch is performed by calling context_switch().
+   Internally, context_switch() switches to the new task's memory map and
+   swaps the register state and stack. If scheduler class picked the same
+   task as the previous task, no task switch is performed and the current
+   task keeps running.
+7. Balance callback list is processed. Each scheduling class can migrate tasks
+   between CPUs to balance load. These load balancing operations are queued
+   on a Balance callback list which get executed when balance_callback() is
+   called.
+8. The runqueue is unlocked and preemption is re-enabled. In case
+   preemption was requested during the time in which it was disabled,
+   schedule() is run again right away.
+
+Scheduler State Transition
+==========================
+
+A very high level scheduler state transition flow with a few states can
+be depicted as follows. ::
+
+                                       *
+                                       |
+                                       | task
+                                       | forks
+                                       v
+                        +------------------------------+
+                        |           TASK_NEW           |
+                        |        (Ready to run)        |
+                        +------------------------------+
+                                       |
+                                       |
+                                       v
+                     +------------------------------------+
+                     |            TASK_RUNNING            |
+   +---------------> |           (Ready to run)           | <--+
+   |                 +------------------------------------+    |
+   |                   |                                       |
+   |                   | schedule() calls context_switch()     | task is preempted
+   |                   v                                       |
+   |                 +------------------------------------+    |
+   |                 |            TASK_RUNNING            |    |
+   |                 |             (Running)              | ---+
+   | event occurred  +------------------------------------+
+   |                   |
+   |                   | task needs to wait for event
+   |                   v
+   |                 +------------------------------------+
+   |                 |         TASK_INTERRUPTIBLE         |
+   |                 |        TASK_UNINTERRUPTIBLE        |
+   +-----------------|           TASK_WAKEKILL            |
+                     +------------------------------------+
+                                       |
+                                       | task exits via do_exit()
+                                       v
+                        +------------------------------+
+                        |          TASK_DEAD           |
+                        |         EXIT_ZOMBIE          |
+                        +------------------------------+
+
+
+Scheduler provides trace events tracing all major events of the scheduler.
+The trace events are defined in ::
+
+  include/trace/events/sched.h
+
+Using these trace events it is possible to model the scheduler state transition
+in an automata model. The following journal paper discusses such modeling:
+
+Daniel B. de Oliveira, Rômulo S. de Oliveira, Tommaso Cucinotta, **A thread
+synchronization model for the PREEMPT_RT Linux kernel**, *Journal of Systems
+Architecture*, Volume 107, 2020, 101729, ISSN 1383-7621,
+https://doi.org/10.1016/j.sysarc.2020.101729.
+
+To model the scheduler efficiently the system was divided in to generators
+and specifications. Some of the generators used were "need_resched",
+"sleepable" and "runnable", "thread_context" and "scheduling context".
+The specifications are the necessary and sufficient conditions to call
+the scheduler. New trace events were added to specify the generators
+and specifications. In case a kernel event referred to more than one
+event, extra fields of the kernel event was used to distinguish between
+automation events. The final model was generated from parallel composition
+of all generators and specifications which composed of 34 events,
+12 generators and 33 specifications. This resulted in 9017 states, and
+20103 transitions.
diff --git a/Documentation/scheduler/sched-data-structs.rst b/Documentation/scheduler/sched-data-structs.rst
new file mode 100644
index 000000000000..b8d968c70bfc
--- /dev/null
+++ b/Documentation/scheduler/sched-data-structs.rst
@@ -0,0 +1,176 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=========================
+Scheduler Data Structures
+=========================
+
+The main parts of the Linux scheduler are:
+
+Runqueue
+~~~~~~~~
+
+struct rq is the central data structure of process scheduling. It keeps track
+of tasks that are in a runnable state assigned for a particular processor.
+Each CPU has its own run queue and stored in a per CPU array::
+
+    DEFINE_PER_CPU(struct rq, runqueues);
+
+Access to the queue requires locking and lock acquire operations must be
+ordered by ascending runqueue. Macros for accessing and locking the runqueue
+are provided in::
+
+    kernel/sched/sched.h
+
+The runqueue contains scheduling class specific queues and several scheduling
+statistics.
+
+Scheduling entity
+~~~~~~~~~~~~~~~~~
+Scheduler uses scheduling entities which contain sufficient information to
+actually accomplish the scheduling job of a task or a task-group. The
+scheduling entity may be a group of tasks or a single task. Every task is
+associated with a sched_entity structure. CFS adds support for nesting of
+tasks and task groups. Each scheduling entity may be run from its parents
+runqueue. The scheduler traverses the sched_entity hierarchy to pick the
+next task to run on the CPU. The entity gets picked up from the cfs_rq on
+which it is queued and its time slice is divided among all the tasks on its my_q.
+
+Scheduler classes
+~~~~~~~~~~~~~~~~~
+It is an extensible hierarchy of scheduler modules. The modules encapsulate
+scheduling policy details. They are called from the core code which is
+independent. Scheduling classes are implemented through the sched_class
+structure. The five scheduling classes are stop_sched_class, dl_sched_class,
+rt_sched_class, fair_sched_class and idle_sched_class. The important methods
+of scheduler class are:
+
+  - sched_class::enqueue_task()
+  - sched_class::dequeue_task()
+    These functions are used to put and remove tasks from the runqueue
+    respectively to change a property of a task. This is referred to as
+    change pattern. Change is defined as the following sequence of calls:
+
+      - dequeue_task()
+      - put_prev_task()
+      - change a property
+      - enqueue_task()
+      - set_next_task()
+
+    The enqueue_task function takes the runqueue, the task which needs to
+    be enqueued/dequeued and a bit mask of flags as parameters. The main
+    purpose of the flags is to describe why the enqueue or dequeue is being
+    called. The different flags used are described in ::
+
+        kernel/sched/sched.h
+
+    Some places where the enqueue_task and dequeue_task are called for
+    changing task properties are:
+
+      - When migrating a task from one CPU's runqueue to another.
+      - When changing a tasks CPU affinity.
+      - When changing the priority of a task.
+      - When changing the nice value of the task.
+      - When changing the scheduling policy and/or RT priority of a thread.
+
+  - sched_class::pick_next_task()
+    Called by the scheduler to pick the next best task to run. The scheduler
+    iterates through the corresponding functions of the scheduler classes
+    in priority order to pick up the next best task to run. Since tasks
+    belonging to the idle class and fair class are frequent, the scheduler
+    optimizes the picking of next task to call the pick_next_task_fair()
+    if the previous task was of the similar scheduling class.
+
+  - sched_class::put_prev_task()
+    Called by the scheduler when a running task is being taken off a CPU.
+    The behavior of this function depends on individual scheduling classes.
+    In CFS class this function is used to put the currently running task back
+    into the CFS RB tree. When a task is running it is dequeued from the tree.
+    This is to prevent redundant enqueue's and dequeue's for updating its
+    vruntime. vruntime of tasks on the tree needs to be updated by update_curr()
+    to keep the tree in sync. In SCHED_DEADLINE and RT classes additional tree
+    is maintained to push tasks from the current CPU to another CPU where the
+    task can preempt and start executing. Task will be added to this queue
+    if it is present on the scheduling class rq and the task has affinity
+    to more than one CPU. set_next_task() is called on the task returned from
+    this function.
+
+  - sched_class::set_next_task()
+    Pairs with the put_prev_task(), this function is called when the next
+    task is set to run on the CPU. This function is called in all the places
+    where put_prev_task is called to complete the 'change pattern'. In case
+    of CFS scheduling class, it will set current scheduling entity to the
+    picked task and accounts bandwidth usage on the cfs_rq. In addition it
+    will also remove the current entity from the CFS runqueue for the vruntime
+    update optimization, opposite to what was done in put_prev_task.
+    For the SCHED_DEADLINE and RT classes it will remove the task from the
+    tree of pushable tasks trigger the balance callback to push another task
+    which is non running on the current CPU for execution on another CPU.
+
+      - dequeue the picked task from the tree of pushable tasks.
+      - update the load average in case the previous task belonged to another
+        class.
+      - queues the function to push tasks from current runqueue to other CPUs
+        which can preempt and start execution. Balance callback list is used.
+
+  - sched_class::task_tick()
+    Called from scheduler_tick(), hrtick() and sched_tick_remote() to update
+    the current task statistics and load averages. Also restarting the high
+    resolution tick timer is done if high resolution timers are enabled.
+    scheduler_tick() runs at 1/HZ and is called from the timer interrupt
+    handler of the Kernel internal timers.
+    hrtick() is called from high resolution timers to deliver an accurate
+    preemption tick as the regular scheduler tick that runs at 1/HZ can be
+    too coarse when nice levels are used.
+    sched_tick_remote() gets called by the offloaded residual 1Hz scheduler
+    tick. In order to reduce interruptions to bare metal tasks, it is possible
+    to outsource these scheduler ticks to the global workqueue so that a
+    housekeeping CPU handles those remotely.
+
+  - sched_class::select_task_rq()
+    Called by scheduler to get the CPU to assign a task to and migrating
+    tasks between CPUs. Flags describe the reason the function was called.
+    Called by try_to_wake_up() with SD_BALANCE_WAKE flag which wakes up a
+    sleeping task.
+    Called by wake_up_new_task() with SD_BALANCE_FORK flag which wakes up a
+    newly forked task.
+    Called by sched_exec() with SD_BALANCE_EXEC which is called from execv
+    syscall.
+    SCHED_DEADLINE class decides the CPU on which the task should be woken
+    up based on the deadline. RT class decides based on the RT priority. Fair
+    scheduling class balances load by selecting the idlest CPU in the
+    idlest group, or under certain conditions an idle sibling CPU if the
+    domain has SD_WAKE_AFFINE set.
+
+  - sched_class::balance()
+    Called by pick_next_task() from scheduler to enable scheduling classes
+    to pull tasks from runqueues of other CPUs for balancing task execution
+    between the CPUs.
+
+  - sched_class::task_fork()
+    Called from sched_fork() of scheduler which assigns a task to a CPU.
+    Fair scheduling class updates runqueue clock, runtime statistics and
+    vruntime for the scheduling entity.
+
+  - sched_class::yield_task()
+    Called from SYSCALL sched_yield to yield the CPU to other tasks.
+    SCHED_DEADLINE class forces the runtime of the task to zero using a special
+    flag and dequeues the task from its trees. RT class requeues the task
+    entities to the end of the run list. Fair scheduling class implements
+    the buddy mechanism. This allows skipping onto the next highest priority
+    scheduling entity at every level in the CFS tree, unless doing so would
+    introduce gross unfairness in CPU time distribution.
+
+  - sched_class::check_preempt_curr()
+    Check whether the task that woke up should preempt the currently
+    running task. Called by scheduler
+
+      - when moving queued task to new runqueue
+      - ttwu()
+      - when waking up newly created task for the first time.
+
+    SCHED_DEADLINE class compares the deadlines of the tasks and calls
+    scheduler function resched_curr() if the preemption is needed. In case
+    the deadlines are equal, migratability of the tasks is used a criteria
+    for preemption.
+    RT class behaves the same except it uses RT priority for comparison.
+    Fair class sets the buddy hints before calling resched_curr() to preempt.
diff --git a/Documentation/scheduler/scheduler-api.rst b/Documentation/scheduler/scheduler-api.rst
new file mode 100644
index 000000000000..6b9cb03d0d86
--- /dev/null
+++ b/Documentation/scheduler/scheduler-api.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================
+Scheduler related functions
+=============================
+
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: scheduler_tick
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: try_to_wake_up
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: do_task_dead
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: preempt_schedule_irq
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: prepare_task_switch
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: finish_task_switch
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d95dc3f4644..99a3b5ab4039 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3976,9 +3976,13 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	return ns;
 }
 
-/*
+/**
+ * scheduler_tick - sched tick timer handler
+ *
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
+ *
+ * Return: 0.
  */
 void scheduler_tick(void)
 {
@@ -4367,8 +4371,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	BUG();
 }
 
-/*
- * __schedule() is the main scheduler function.
+/**
+ * __schedule() - the main scheduler function.
+ *
+ * @preempt: preemption enabled/disabled
  *
  * The main means of driving the scheduler and thus entering this function are:
  *
@@ -4533,6 +4539,12 @@ static void __sched notrace __schedule(bool preempt)
 	balance_callback(rq);
 }
 
+/**
+ * do_task_dead - handle task exit
+ *
+ * Changes the task state to TASK_DEAD and calls
+ * schedule to pick next task to run.
+ */
 void __noreturn do_task_dead(void)
 {
 	/* Causes final put_task_struct in finish_task_switch(): */
@@ -4764,7 +4776,8 @@ EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
 
 #endif /* CONFIG_PREEMPTION */
 
-/*
+/**
+ * preempt_schedule_irq - schedule from irq context
  * This is the entry point to schedule() from kernel preemption
  * off of irq context.
  * Note, that this is called and return with irqs disabled. This will
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 28709f6b0975..11b2125df8b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -902,16 +902,25 @@ struct rq {
 	 */
 	unsigned int		nr_running;
 #ifdef CONFIG_NUMA_BALANCING
+    /* Number of tasks running that care about their NUMA placement */
 	unsigned int		nr_numa_running;
+	/* Number of tasks that are optimally NUMA placed */
 	unsigned int		nr_preferred_running;
+	/*
+	 * Per runqueue variable to check if NUMA-balance is active on the
+	 * run-queue
+	 */
 	unsigned int		numa_migrate_on;
 #endif
 #ifdef CONFIG_NO_HZ_COMMON
 #ifdef CONFIG_SMP
+    /* Tick stamp for decay of blocked load */
 	unsigned long		last_blocked_load_update_tick;
+	/* Idle CPU has blocked load */
 	unsigned int		has_blocked_load;
 	call_single_data_t	nohz_csd;
 #endif /* CONFIG_SMP */
+    /* CPU is going idle with tick stopped */
 	unsigned int		nohz_tick_stopped;
 	atomic_t		nohz_flags;
 #endif /* CONFIG_NO_HZ_COMMON */
@@ -927,14 +936,17 @@ struct rq {
 	unsigned int		uclamp_flags;
 #define UCLAMP_FLAG_IDLE 0x01
 #endif
-
+	/* Fair scheduling class runqueue */
 	struct cfs_rq		cfs;
+	/* RT scheduling class runqueue */
 	struct rt_rq		rt;
+	/* Deadline scheduing class runqueue */
 	struct dl_rq		dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
 	struct list_head	leaf_cfs_rq_list;
+	/* Reference to add child before its parent in leaf_cfs_rq_list */
 	struct list_head	*tmp_alone_branch;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -946,19 +958,28 @@ struct rq {
 	 */
 	unsigned long		nr_uninterruptible;
 
+	/* Currently running task of this rq */
 	struct task_struct __rcu	*curr;
+	/* Idle task of this rq */
 	struct task_struct	*idle;
+	/* Stop task of this rq */
 	struct task_struct	*stop;
 	unsigned long		next_balance;
 	struct mm_struct	*prev_mm;
 
+	/* RQCF clock_update_flags bits */
 	unsigned int		clock_update_flags;
+	/* sched_clock() value for the queue */
 	u64			clock;
 	/* Ensure that all clocks are in the same cache line */
 	u64			clock_task ____cacheline_aligned;
 	u64			clock_pelt;
 	unsigned long		lost_idle_time;
 
+	/*
+	 * Account the idle time that we could have spend running if it were
+	 * not for IO
+	 */
 	atomic_t		nr_iowait;
 
 #ifdef CONFIG_MEMBARRIER
@@ -967,9 +988,18 @@ struct rq {
 
 #ifdef CONFIG_SMP
 	struct root_domain		*rd;
+	/* A domain heirarchy of CPU groups to balance process load among them */
 	struct sched_domain __rcu	*sd;
 
+	/*
+	 * Information about CPUs heterogeneity used for CPU performance
+	 * scaling
+	 */
 	unsigned long		cpu_capacity;
+	/*
+	 * Original capacity of a CPU before being altered by rt tasks
+	 * and/or IRQ
+	 */
 	unsigned long		cpu_capacity_orig;
 
 	struct callback_head	*balance_callback;
@@ -977,6 +1007,11 @@ struct rq {
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
 
+	/*
+	 * Set whenever the current running task has a utilization greater
+	 * than 80% of rq->cpu_capacity. A non-zero value in this field
+	 * enables misfit load balancing
+	 */
 	unsigned long		misfit_task_load;
 
 	/* For active balancing */
@@ -988,16 +1023,42 @@ struct rq {
 	int			cpu;
 	int			online;
 
+	/*
+	 * An MRU list used for load balancing, sorted (except
+	 * woken tasks) starting from recently given CPU time tasks
+	 * toward tasks with max wait time in a run-queue
+	 */
 	struct list_head cfs_tasks;
 
+	/*
+	 * Track the utilization of RT tasks for a more accurate
+	 * view of the utilization of the CPU when overloaded by CFS and
+	 * RT tasks
+	 */
 	struct sched_avg	avg_rt;
+
+	/*
+	 * Track the utilization of DL tasks as CFS tasks can be preempted
+	 * by DL tasks and the CFS's utilization might no longer describe
+	 * the real utilization level
+	 */
 	struct sched_avg	avg_dl;
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
+	/*
+	 * Track the utilization of interrupt to give a more accurate
+	 * level of utilization of CPU taking into account the time spent
+	 * under interrupt context when rq's clock is updated
+	 */
 	struct sched_avg	avg_irq;
 #endif
 #ifdef CONFIG_SCHED_THERMAL_PRESSURE
+	/*
+	 * Tracks thermal pressure which is the reduction in maximum
+	 * possible capacity due to thermal events
+	 */
 	struct sched_avg	avg_thermal;
 #endif
+	/* Time stamp at which idle load balance started for this rq */
 	u64			idle_stamp;
 	u64			avg_idle;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v8 3/3] docs: scheduler: Add introduction to scheduler context-switch
  2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
  2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
  2020-09-02 16:26 ` [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview documentation John Mathew
@ 2020-09-02 16:26 ` John Mathew
  2020-09-09  8:37 ` [RFC PATCH v8 0/3] Add scheduler overview documentation Lukas Bulwahn
  3 siblings, 0 replies; 7+ messages in thread
From: John Mathew @ 2020-09-02 16:26 UTC (permalink / raw)
  To: linux-doc
  Cc: linux-kernel, corbet, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tsbogend,
	lukas.bulwahn, x86, linux-mips, tglx, willy, valentin.schneider,
	John Mathew, Mostafa Chamanara, Oleg Tsymbal

Add documentation for introduction to
 -context-switch
 -x86 context-switch
 -MIPS context switch

Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Co-developed-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
Signed-off-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
Co-developed-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
Signed-off-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
Signed-off-by: John Mathew <john.mathew@unikie.com>
---
 Documentation/scheduler/arch-specific.rst     |   2 +
 Documentation/scheduler/context-switching.rst | 126 ++++++++++++++++++
 Documentation/scheduler/index.rst             |   1 +
 .../scheduler/mips-context-switch.rst         |  89 +++++++++++++
 .../scheduler/x86-context-switch.rst          |  55 ++++++++
 5 files changed, 273 insertions(+)
 create mode 100644 Documentation/scheduler/context-switching.rst
 create mode 100644 Documentation/scheduler/mips-context-switch.rst
 create mode 100644 Documentation/scheduler/x86-context-switch.rst

diff --git a/Documentation/scheduler/arch-specific.rst b/Documentation/scheduler/arch-specific.rst
index 3e5af3a0695e..65dc393b605f 100644
--- a/Documentation/scheduler/arch-specific.rst
+++ b/Documentation/scheduler/arch-specific.rst
@@ -10,3 +10,5 @@ Architecture Specific Scheduler Implementation Differences
 .. toctree::
    :maxdepth: 2
 
+   x86-context-switch
+   mips-context-switch
diff --git a/Documentation/scheduler/context-switching.rst b/Documentation/scheduler/context-switching.rst
new file mode 100644
index 000000000000..04f97bff08e1
--- /dev/null
+++ b/Documentation/scheduler/context-switching.rst
@@ -0,0 +1,126 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==========================
+Process context switching
+==========================
+
+Context Switching
+-----------------
+
+Context switching, the switching from a running task to another,
+is done by the context_switch() function defined in kernel/sched/core.c.
+It is called by __schedule() when a new process has been selected to run.
+The execution flow is as follows:
+
+ - prepare_task_switch() performs necessary kernel preparations for the
+   context switch and then calls prepare_arch_switch() for architecture
+   specific context switch preparation. This call must be paired with a
+   subsequent finish_task_switch() after the context switch. The various
+   steps are:
+
+    - Prepare kcov for context switch. Context switch does switch_mm() to the
+      next task's mm, then switch_to() that new task. This means vmalloc'd
+      regions which had previously been faulted in can transiently disappear in
+      the context of the prev task. Functions instrumented by KCOV may try to
+      access a vmalloc'd kcov_area during this window, and result in a recursive
+      fault. This is avoided by setting a new flag: KCOV_IN_CTXSW in kcov_mode
+      prior to switching the mm, and cleared once the new task is live.
+    - Update sched_info statistics for both the prev and next tasks.
+    - Handle perf subsystem context switch from previous task to next.
+      The various steps are:
+
+        - Remove perf events for the task being context-switched out.
+        - Stop each perf event and update the event value in event->count.
+        - Call the context switch callback for PMU with flag indicating
+          schedule out.
+        - Create a PERF_RECORD_MISC_SWITCH_OUT perf event.
+        - Context switch the perf event contexts between the current and next tasks.
+        - Schedule out current cgroup events if cgroup perf events exist on the
+          CPU.
+
+    - Set TIF_NOTIFY_RESUME flag on the current thread for the Restartable
+      sequence mechanism. Restartable sequences allow user-space to perform
+      update operations on per-cpu data without requiring heavy-weight atomic
+      operations.
+    - Fire preempt notifiers. A task can request the scheduler to notify it
+      whenever it is preempted or scheduled back in. This allows the task to
+      swap any special-purpose registers like the FPU or Intel's VT registers.
+    - Claim the next task as running to prevent load balancing run on it.
+
+ - arch_start_context_switch() batches the reload of page tables and other
+   process state with the actual context switch code for paravirtualized
+   guests.
+
+ - Transfer the real and anonymous address spaces between the switching tasks.
+   Four possible transfer types are:
+
+    - kernel task switching to another kernel task
+    - user task switching to a kernel task
+    - kernel task switching to user task
+    - user task switching to user task
+
+    For a kernel task switching to kernel task enter_lazy_tlb() is called
+    which is an architecture specific implementation to handle a context
+    without an mm. Architectures implement lazy tricks to minimize TLB
+    flushes here. The active address space from the previous task is
+    borrowed (transferred) to the next task.
+
+    For a user task switching to kernel task it will have a real address
+    space and so its anonymous users counter is incremented. This makes
+    sure that the address space will not get freed even after the previous
+    task exits.
+
+    For a user task switching to user task the architecture specific
+    switch_mm_irqs_off() or switch_mm() functions are called.  The main
+    functionality of these calls is to switch the address space between
+    the user space processes.  This includes switching the page table pointers
+    either via retrieved valid ASID for the process or page mapping in the TLB.
+
+    For a kernel task switching to a user task, switch_mm_irqs_off()
+    replaces the address space of prev kernel task(last active_mm) with the
+    next (next mm) from the user task. The context_switch() function saves the
+    active_mm to the runqueue’s prev_mm field to drop this mm later in
+    the finish_task_switch().
+
+ - prepare_lock_switch() releases lockdep of the runqueue lock to handle
+   the special case of the scheduler context switch where the runqueue lock
+   will be released by the next task.
+
+ - Architecture specific implementation of switch_to() switches the
+   register state and the stack. This involves saving and restoring stack
+   information and the processor registers and any other
+   architecture-specific state that must be managed and restored on a
+   per-process basis.
+
+ - finish_task_switch() performs the final steps of the context switch:
+
+    - Emit a warning if the preempt count is corrupted and set the preempt count
+      to FORK_PREEMPT_COUNT.
+    - Reset the pointer to the memory descriptor used by prev which was set in
+      context_switch().
+    - Store the state of the previous task to handle the possibility of a DEAD
+      task.
+    - Do virtual CPU time accounting for the previous task.
+    - Handle perf subsystem context switch from previous task to current:
+
+       - Add perf events for the current task.
+       - Schedule in current cgroup events if cgroup perf events exist on the
+         CPU.
+       - Context switch the perf event contexts between the prev and current
+         tasks.
+       - Clear the PERF_RECORD_MISC_SWITCH_OUT perf event.
+       - Call the context switch callback for PMU with flag indicating
+         schedule in.
+
+    - Free the task for load balancing run on it.
+    - Unlock the rq lock.
+    - Clear the KCOV_IN_CTXSW in kcov_mode which was set in prepare_task_switch
+      now that the new task is live.
+    - Fire preempt notifiers to notify about task scheduled back in.
+    - If the prev task state indicated that it was dead, the corresponding
+      scheduler class task_dead hook is called. Function-return probe
+      instances associated with the task are removed and put back on the
+      free list. Stack for the task is freed and drop the RCU references.
+    - Evaluate the need for No idle tick due to the context switch and do the
+      idle tick if needed.
+
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index e3b1d4fc1604..fc7ee056f3bb 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -20,6 +20,7 @@ specific implementation differences.
     sched-data-structs
     cfs-overview
     sched-design-CFS
+    context-switching
     sched-features
     arch-specific
     sched-debugging
diff --git a/Documentation/scheduler/mips-context-switch.rst b/Documentation/scheduler/mips-context-switch.rst
new file mode 100644
index 000000000000..42a3454a06f0
--- /dev/null
+++ b/Documentation/scheduler/mips-context-switch.rst
@@ -0,0 +1,89 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==============================================
+MIPS Architecture And Scheduler implementation
+==============================================
+
+Multi-threading in MIPS CPUs
+-----------------------------
+The MIPS architecture defines four coprocessors.
+
+ - CP0: supports virtual memory system and exception handling.
+ - CP1: reserved for the floating point coprocessor, the FPU.
+ - CP2: available for specific implementations.
+ - CP3: reserved for floating point operations in the release 1
+   implementation of MIPS64.
+
+MIPS32 and MIPS64 architectures provide support for optional components
+known as Modules or Application Specific Extensions. The MT module
+enables the architecture to support multi-threaded implementations.
+This includes support for virtual processors and lightweight thread
+contexts. Implementation of MT features depends on the individual MIPS
+cores. The virtual processing element (VPE) maintains a complete copy
+of the processor state as seen by the software system which includes
+interrupts, register set, and MMU. This enables a single processor to
+appear to an SMP operating system like two separate cores if it has
+2 VPE's. For example two separate OSes can run on each VPE such as Linux
+and and an RTOS.
+
+A lighter version of VPE enables threading at the user/application
+software level.  It is called Thread Context (TC). TC is the hardware
+state necessary to support a thread of execution. This includes a set
+of general purpose registers (GPRs), a program counter (PC), and some
+multiplier and coprocessor state.  TCs have common execution unit.
+MIPS ISA provides instructions to utilize TC.
+
+The Quality of service block of the MT module allows the allocation of
+processor cycles to threads, and sets relative thread priorities. This
+enables 2 thread prioritization mechanisms. The user can prioritize one
+thread over the other as well as allocate a specific ratio of the cycles
+to specific threads. These mechanisms allocate bandwidth to a set
+of threads effectively. QoS block improves system level determinism
+and predictability. Qos block can be replaced by more application
+specific blocks.
+
+MIPS Context Switch
+-------------------
+
+Context switch behavior specific to MIPS begins in the way the switch_to()
+macro is implemented. The main steps in the MIPS implementation of the macro
+are:
+
+ - Handle the FPU affinity management feature. This feature is enabled
+   by the config option CONFIG_MIPS_MT_FPAFF at build time. The macro checks
+   if the FPU was used in the most recent time slice. In case FPU was not
+   used, the restriction of having to run on a CPU with FPU is removed.
+ - Disable the FPU and clear the bit indicating the FPU was used in this
+   quantum for the task for the previous task.
+ - If FPU is enabled in the next task, check FCSR for any unmasked
+   exceptions pending, clear them and send a signal.
+ - If MIPS DSP modules is enabled, save the DSP context of the previous
+   task and restore the dsp context of the next task.
+ - If coprocessor 2 is present set the access allowed field of the
+   coprocessor 2.
+ - If coprocessor 2 access allowed field was set in previous task, clear it.
+ - Clear the the access allowed field of the coprocessor 2.
+ - Clear the llbit on MIPS release 6 such that instruction eretnc can be
+   used unconditionally when returning to userland in entry.S.
+   LLbit is used to specify operation for instructions that provide atomic
+   read-modify-write. LLbit is set when a linked load occurs and is tested
+   by the conditional store.  It is cleared, during other CPU operation,
+   when a store to the location would no longer be atomic. In particular,
+   it is cleared by exception return instructions.  eretnc instruction
+   enables to return from interrupt, exception, or error trap without
+   clearing the LLbit.
+ - Clear the global variable ll_bit used by MIPS exception handler.
+ - Write the thread pointer to the MIPS userlocal register if the CPU
+   supports this feature. This register is not interpreted by hardware and
+   can be used to share data between privileged and unprivileged software.
+ - If hardware watchpoint feature is enabled during build, the watchpoint
+   registers are restored from the next task.
+ - Finally the MIPS processor specific implementation of the resume()
+   function is called. It restores the registers of the next task including
+   the stack pointer. The implementation is in assembly in the following
+   architecutre specific files ::
+
+        arch/mips/kernel/r4k_switch.S
+        arch/mips/kernel/r2300_switch.S
+        arch/mips/kernel/octeon_switch.S
+
diff --git a/Documentation/scheduler/x86-context-switch.rst b/Documentation/scheduler/x86-context-switch.rst
new file mode 100644
index 000000000000..96941d9435fc
--- /dev/null
+++ b/Documentation/scheduler/x86-context-switch.rst
@@ -0,0 +1,55 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+X86 Context Switch
+------------------
+
+The x86 architecture context switching logic is as follows.
+After the switching of MM in the scheduler, context_switch() calls the x86
+implementation of switch_to() macro. For x86 arch it is located at ::
+
+    arch/x86/include/asm/switch_to.h
+
+The macro calls the the assembly implementation of __switch_to_asm() in the
+assembly files ::
+
+    arch/x86/entry/entry_64.S
+    arch/x86/entry/entry_32.S
+
+The main steps of the assembly function __switch_to_asm() are:
+
+ - store the callee saved registers to the old stack which will be switched
+   away from.
+ - swap the stack pointers between the old and the new task.
+ - move the stack canary value to the current CPU's interrupt stack
+ - if return trampoline is enabled, overwrite all entries in the RSB on
+   exiting a guest, to prevent malicious branch target predictions from
+   affecting the host kernel.
+ - restore all registers from the new stack previously pushed in reverse
+   order.
+ - jump to a C implementation of __switch_to(). The sources are located in::
+
+      arch/x86/kernel/process_64.c
+      arch/x86/kernel/process_32.c
+
+
+The main steps of the C function __switch_to() which is effectively
+the new task running are as follows:
+
+ - retrieve the thread_struct and fpu struct from the next and previous tasks.
+ - get the current CPU tss_struct.
+ - save the current FPU state while on the old task.
+ - store the FS and GS segment registers before changing the thread local
+   storage.
+ - reload the GDT for the new tasks TLS.
+
+   Following is effectively arch_end_context_switch().
+ - save the ES and DS segments of the previous task and load the same from
+   the nest task.
+ - load the FS and GS segment registers.
+ - update the current task of the CPU.
+ - update the top of stack pointer for the CPU for entry trampoline.
+ - initialize FPU state for next task.
+ - set sp0 to point to the entry trampoline stack.
+ - call _switch_to_xtra() to handle debug registers, I/O
+   bitmaps and speculation mitigation.
+ - write the task's CLOSid/RMID to IA32_PQR_MSR.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v8 0/3] Add scheduler overview documentation
  2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
                   ` (2 preceding siblings ...)
  2020-09-02 16:26 ` [RFC PATCH v8 3/3] docs: scheduler: Add introduction to scheduler context-switch John Mathew
@ 2020-09-09  8:37 ` Lukas Bulwahn
  3 siblings, 0 replies; 7+ messages in thread
From: Lukas Bulwahn @ 2020-09-09  8:37 UTC (permalink / raw)
  To: John Mathew
  Cc: linux-doc, linux-kernel, corbet, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, tsbogend, lukas.bulwahn, x86, linux-mips, tglx, willy,
	valentin.schneider



On Wed, 2 Sep 2020, John Mathew wrote:

> This patch series updates the scheduler documentation to add more topics
> wrt to scheduler overview. New sections are added to provide a brief
> overview of the kernel structs used by the scheduler, scheduler invocation,
> and context switch. Previous version of the patch was reviewed at:
> https://lore.kernel.org/lkml/20200527084421.4673-1-John.Mathew@unikie.com/
>

John, here is some first feedback to get the ball rolling:

I tried to apply your patches on v5.9-rc4, and I got those warnings:

Applying: docs: scheduler: Restructure scheduler documentation.
.git/rebase-apply/patch:30: new blank line at EOF.
+
.git/rebase-apply/patch:137: new blank line at EOF.
+
warning: 2 lines add whitespace errors.
Applying: docs: scheduler: Add scheduler overview documentation
.git/rebase-apply/patch:73: new blank line at EOF.
+
warning: 1 line adds whitespace errors.
Applying: docs: scheduler: Add introduction to scheduler context-switch
.git/rebase-apply/patch:153: new blank line at EOF.
+
.git/rebase-apply/patch:260: new blank line at EOF.
+
warning: 2 lines add whitespace errors.


You might want to look into this. I also checked that the patch also 
applies on linux-next, i.e., next-20200908; so, it does not clash in an 
obvious way with other changes at the moment.

I did run checkpatch.pl and it warned about:
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?

No action required here. 


Documentation generation (make htmldocs) shows these two new warnings with 
your patches applied to v5.9-rc4:

  ./kernel/sched/core.c:17: WARNING: Definition list ends without a blank 
line; unexpected unindent.
  ./kernel/sched/core.c:21: WARNING: Unexpected indentation.


You might want to put those minor fixes on your remaining TODO list for 
this patchset as well.

I will continue to comment with more editorial points in the next hours 
and days, while proof-reading your additions to the documentation.


Lukas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation.
  2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
@ 2020-09-09 10:28   ` Lukas Bulwahn
  0 siblings, 0 replies; 7+ messages in thread
From: Lukas Bulwahn @ 2020-09-09 10:28 UTC (permalink / raw)
  To: John Mathew
  Cc: linux-doc, linux-kernel, corbet, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, tsbogend, lukas.bulwahn, x86, linux-mips, tglx, willy,
	valentin.schneider



On Wed, 2 Sep 2020, John Mathew wrote:

> Add new sections to enable addition of new documentation on
> the scheduler. Existing documentation is moved under the related
> new sections. The sections are
>   - overview
>   - sched-features
>   - arch-specific.rst
>   - sched-debugging.rst
> 
> Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Signed-off-by: John Mathew <john.mathew@unikie.com>
> ---
>  Documentation/scheduler/arch-specific.rst   | 12 +++++++++
>  Documentation/scheduler/index.rst           | 30 +++++++++++----------
>  Documentation/scheduler/overview.rst        |  5 ++++
>  Documentation/scheduler/sched-debugging.rst | 14 ++++++++++
>  Documentation/scheduler/sched-features.rst  | 25 +++++++++++++++++
>  5 files changed, 72 insertions(+), 14 deletions(-)
>  create mode 100644 Documentation/scheduler/arch-specific.rst
>  create mode 100644 Documentation/scheduler/overview.rst
>  create mode 100644 Documentation/scheduler/sched-debugging.rst
>  create mode 100644 Documentation/scheduler/sched-features.rst
> 
> diff --git a/Documentation/scheduler/arch-specific.rst b/Documentation/scheduler/arch-specific.rst
> new file mode 100644
> index 000000000000..3e5af3a0695e
> --- /dev/null
> +++ b/Documentation/scheduler/arch-specific.rst
> @@ -0,0 +1,12 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +Architecture Specific Scheduler Implementation Differences
> +==========================================================

That is a terribly long title, how about Architecture Specifics?

I am wondering if this should be on the toplevel documentation structure
directly under Linux Scheduler.

I think the x86 and MIPS context switch documentation could be placed
under Process context switching in a section Architecture Specifics.

> +
> +.. class:: toc-title
> +
> +	   Table of contents
> +
> +.. toctree::
> +   :maxdepth: 2
> +
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index 88900aabdbf7..6e88a070c503 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> @@ -1,24 +1,26 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
>  ===============
>  Linux Scheduler
>  ===============
>  
> -.. toctree::
> -    :maxdepth: 1
> +This documentation outlines the Linux kernel scheduler with its concepts,
> +details about the scheduler design and its data structures and architecture
> +specific implementation differences.
> +
>  
> +.. class:: toc-title
> +
> +    Table of contents
> +
> +.. toctree::
> +    :maxdepth: 2
>  
> -    completion
> -    sched-arch
> -    sched-bwc
> -    sched-deadline
> +    overview
>      sched-design-CFS
> -    sched-domains
> -    sched-capacity
> -    sched-energy
> -    sched-nice-design
> -    sched-rt-group
> -    sched-stats
> -
> -    text_files
> +    sched-features
> +    arch-specific
> +    sched-debugging
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
> new file mode 100644
> index 000000000000..a1d2d26629eb
> --- /dev/null
> +++ b/Documentation/scheduler/overview.rst
> @@ -0,0 +1,5 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +====================
> +Scheduler overview

s/Scheduler overview/Scheduler Overview/

for some more consistent capitalisation.

> +====================
> \ No newline at end of file

That could be the cause for the git am errors.

> diff --git a/Documentation/scheduler/sched-debugging.rst b/Documentation/scheduler/sched-debugging.rst
> new file mode 100644
> index 000000000000..e332069f99d6
> --- /dev/null
> +++ b/Documentation/scheduler/sched-debugging.rst
> @@ -0,0 +1,14 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +Scheduler Debugging Interface
> +==============================
> +
> +.. class:: toc-title
> +
> +	   Table of contents
> +
> +.. toctree::
> +   :maxdepth: 2
> +
> +   sched-stats
> +   text_files
> diff --git a/Documentation/scheduler/sched-features.rst b/Documentation/scheduler/sched-features.rst
> new file mode 100644
> index 000000000000..8eb90e86e489
> --- /dev/null
> +++ b/Documentation/scheduler/sched-features.rst
> @@ -0,0 +1,25 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +Scheduler Features
> +===================
> +
> +.. class:: toc-title
> +
> +	Table of contents
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   completion
> +   sched-arch
> +   sched-bwc
> +   sched-deadline
> +   sched-domains
> +   sched-capacity
> +   sched-energy
> +   sched-nice-design
> +   sched-rt-group
> +   sched-stats
> +
> +   text_files

I guess it is fine to place everything here for now, but more clean-up 
would probably move those to the appropriate documentation structure, 
right?


> +
> -- 
> 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview  documentation
  2020-09-02 16:26 ` [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview documentation John Mathew
@ 2020-09-09 11:23   ` Lukas Bulwahn
  0 siblings, 0 replies; 7+ messages in thread
From: Lukas Bulwahn @ 2020-09-09 11:23 UTC (permalink / raw)
  To: John Mathew
  Cc: linux-doc, linux-kernel, corbet, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, tsbogend, lukas.bulwahn, x86, linux-mips, tglx, willy,
	valentin.schneider, Mostafa Chamanara, Oleg Tsymbal

[-- Attachment #1: Type: text/plain, Size: 23115 bytes --]



On Wed, 2 Sep 2020, John Mathew wrote:

> Add documentation for
>  -scheduler overview
>  -scheduler state transtion
>  -CFS overview
>  -scheduler data structs
> 
> Add rst for scheduler APIs and modify sched/core.c
> to add kernel-doc comments.
> 
> Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Co-developed-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
> Signed-off-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
> Co-developed-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
> Signed-off-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
> Signed-off-by: John Mathew <john.mathew@unikie.com>
> ---
>  Documentation/scheduler/cfs-overview.rst      |  59 ++++
>  Documentation/scheduler/index.rst             |   3 +
>  Documentation/scheduler/overview.rst          | 294 +++++++++++++++++-
>  .../scheduler/sched-data-structs.rst          | 176 +++++++++++
>  Documentation/scheduler/scheduler-api.rst     |  24 ++
>  kernel/sched/core.c                           |  21 +-
>  kernel/sched/sched.h                          |  63 +++-
>  7 files changed, 634 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/scheduler/cfs-overview.rst
>  create mode 100644 Documentation/scheduler/sched-data-structs.rst
>  create mode 100644 Documentation/scheduler/scheduler-api.rst
> 
> diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst
> new file mode 100644
> index 000000000000..1524c24da897
> --- /dev/null
> +++ b/Documentation/scheduler/cfs-overview.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +=============
> +CFS Overview
> +=============
> +
> +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
> +Scheduler (CFS) implemented as a scheduling module. A brief overview of the
> +CFS design is provided in :doc:`sched-design-CFS`
> +
> +In addition there have been many improvements to the CFS, a few of which are

This can be shortened to:

In addition there have been many improvements to the CFS:

> +
> +Tracking available capacity
> +---------------------------

Capitalize title for local consistency with the sections below.

This below is not a full sentence:

> +Scale CPU capacity mechanism for CFS so it knows how much CPU capacity is left

The "it" here refers to what?

> +for its use after higher priority sched classes (RT, DL), IRQs and
> +'Thermal Pressure' have reduced the 'original' CPU capacity.

Why are putting thermal pressure and orginal in quotes?


> +Thermal pressure on a CPU means the maximum possible capacity is
> +unavailable due to thermal events.
> +
> +NUMA balancing
> +--------------

Capitalize.

Again, this below is not a full sentence:

> +Attempt to migrate tasks to the NUMA Node where the frequently accessed memory

why is Node capitalized here?

> +pages belongs. The scheduler gets information about memory placement through the

belongs? You mean is closest placed to, right?

s/through the paging mechanism/through paging/

> +paging mechanism. Scheduler periodically scans the virtual memory of the tasks

Maybe add: This works as follows:

1. The scheduler scans ...
s/Scheduler/The scheduler/

> +and make them inaccessible by changing the memory protection. The flag

s/make/makes/

> +MM_CP_PROT_NUMA indicates this purpose. When the task attempts to access

I think the detail on the flag is too much here for the overview.

2. When the task attempts to access the memory, this triggers a page fault 
and the scheduler reacts with recording some statistics on the use for the 
specific NUMA nodes.

3. On a periodic basis, the scheduler then migrates the task to the 
preffered node, i.e., the node that encountered the most memory faults.

> +the memory again a page fault occurs. Scheduler traps the fault and increments
> +the counters in a task specific array corresponding to the NUMA node id.
> +There array is divided in to four regions: faults_memory, faults_cpu,
> +faults_memory_buffer and faults_cpu_buffer, where faults_memory is the
> +exponential decaying average of faults on a per-node basis. The 'preferred
> +node' is found by looping through the array and finding the node with the
> +highest number of faults. Migration to the preferred node is done periodically
> +by either swapping two tasks tasks between their respective CPUs or
> +just moving a task to its preferred node CPU. It the migration or move fails
> +it will be retried.
> +
> +Energy Aware Scheduling
> +-----------------------
> +For asymmetric CPU capacity topologies, an Energy Model is used to figure out
> +which of the CPU candidates is the most energy-efficient. Capacity is the
> +amount of work which a CPU can perform at its highest frequency which is
> +calculated by the Per-Entity Load Tracking (PELT) mechanism.
> +EAS is described at :doc:`sched-energy`
> +
> +Capacity Aware Scheduling
> +--------------------------
> +Migrate a task to a CPU which meets its compute demand. In asymmetric CPU
> +capacity topologies CFS scheduler frequently updates the 'Misfit' status of

s/CFS scheduler/, the CFS scheduler/

> +tasks and migrate them to CPU's of higher capacity. Also during wakeups the

the a?

> +a CPU with sufficient capacity is found for executing the task. CAS is

I guess it is better to use active here, rather than passive. Who finds 
the CPU?

Do not use an abbreviation here.

> +described at :doc:`sched-capacity`
> +
> +
> +
> +
> +
> +
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index 6e88a070c503..e3b1d4fc1604 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> @@ -17,10 +17,13 @@ specific implementation differences.
>      :maxdepth: 2
>  
>      overview
> +    sched-data-structs
> +    cfs-overview
>      sched-design-CFS
>      sched-features
>      arch-specific
>      sched-debugging
> +    scheduler-api
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
> index a1d2d26629eb..f2fb8f419919 100644
> --- a/Documentation/scheduler/overview.rst
> +++ b/Documentation/scheduler/overview.rst
> @@ -2,4 +2,296 @@
>  
>  ====================
>  Scheduler overview
> -====================
> \ No newline at end of file
> +====================
> +
> +Linux kernel implements priority-based scheduling. More than one process are

s/Linux kernel/The Linux kernel/

> +allowed to run at any given time and each process is allowed to run as if it
> +were the only process on the system. The process scheduler coordinates which
> +process runs when. In that context, it has the following tasks:
> +
> +  - share CPUs equally among all currently running processes.

equally? That is not true, right?

> +  - pick appropriate process to run next if required, considering scheduling
> +    class/policy and process priorities.
> +  - balance processes between multiple CPUs in SMP systems.
> +
> +The scheduler attempts to be responsive for I/O bound processes and efficient
> +for CPU bound processes. The scheduler uses different scheduling policies
> +for real time and normal processes based on their respective policy
> +enumerations. Scheduler adds support for each policy through scheduling class

... for each policy through a scheduling class and a dedicated 
implementation for each scheduling class.

> +implementations for each. The five scheduling classes which scheduler provides
> +are:

This can be shortened to "The five scheduling classes are:"

> +
> +  - stop_sched_class:
> +    It is a per-cpu maximum priority CPU monopolization mechanism. It is
> +    exposed as a SCHED_FIFO task ('migration/X') with static priority of 99
> +    in the user space. This is done to make it compatible with user space and
> +    thus to avoid growing the ABI. It is used by one CPU to stop another
> +    in order to run a specific function, so it is only available on SMP
> +    systems. This class is used by the scheduler for task migration between
> +    CPUs.
> +
> +  - dl_sched_class:
> +    Implements the SCHED_DEADLINE scheduling policy. It has static priority

Here, you use "scheduling policy".

> +    of -1 in kernel space. This policy schedules each task according to the
> +    task's deadline. The task with the earliest deadline will be served first.
> +
> +  - rt_sched_class:
> +    Implements the SCHED_RR and SCHED_FIFO policies. Real time static

... and here only "policy". Be consistent.

> +    priorities range from 1(low)..99 in the user space. (priority is inverted
> +    in kernel space). It is the only scheduling class that makes use of the
> +    static priority of the task. SCHED_FIFO is a simple scheduling algorithm
> +    without time slicing. A SCHED_FIFO thread runs until either it is blocked
> +    by an I/O request, it is preempted by a higher priority thread, or it
> +    calls sched_yield(). SCHED_RR is a simple enhancement of SCHED_FIFO where
> +    a thread is allowed to run only for a maximum time quantum.
> +
> +  - fair_sched_class:
> +    Implements the SCHED_NORMAL SCHED_BATCH and SCHED_IDLE  policies. Static

double spacing before policies.

> +    priority is always 0 in the user space. A dynamic priority based on
> +    'nice' value is used to schedule these tasks. This priority increases each
> +    time the the task  is scheduled to run but denied to run by scheduler.

... the the... and then spacing.

> +    This ensures fair scheduling between these tasks. Nice value is an
> +    attribute which can be set by the user to influence scheduler to favour
> +    a particular task. SCHED_BATCH is similar to SCHED_NORMAL with the
> +    difference that the policy causes the scheduler to assume that the task
> +    is CPU-intensive. SCHED_IDLE policy also has static priority 0. Nice
> +    value has no effect on this policy. Weight mapping is not done, instead
> +    weight is set at a constant minimal weight WEIGHT_IDLEPRIO. Used to
> +    run tasks at extremely low priority.
> +
> +  - idle_sched_class:
> +    Priority for idle task is irrelevant. This class is not related to
> +    SCHED_IDLE policy. Idle tasks run when there are no other runnable tasks
> +    on a CPU. The execute the idle loop which is responsible to put a CPU
> +    in one of its idle states.
> +

This last sentence above is totally broken; I cannot parse it.

> +
> +Process Management
> +==================
> +
> +Each process in the system is represented by struct task_struct. When a
> +process/thread is created, the kernel allocates a new task_struct for it.

Each process or each thread?

> +The kernel then stores this task_struct in an RCU list. Macro next_task()
> +allows a process to obtain its next task and for_each_process() macro enables
> +traversal of the list.
> +

This is too much detail at this point of the overview.

> +Frequently used fields of the task struct are:
> +
> + - state: The running state of the task. The possible states are:
> +
> +    - TASK_RUNNING: The task is currently running or in a run queue waiting
> +      to run.
> +    - TASK_INTERRUPTIBLE: The task is sleeping waiting for some event to occur.
> +      This task can be interrupted by signals. On waking up the task transitions
> +      to TASK_RUNNING.
> +    - TASK_UNINTERRUPTIBLE: Similar to TASK_INTERRUPTIBLE but does not wake
> +      up on signals. Needs an explicit wake-up call to be woken up. Contributes
> +      to loadavg.
> +    - __TASK_TRACED: Task is being traced by another task like a debugger.
> +    - __TASK_STOPPED: Task execution has stopped and not eligible to run.
> +      SIGSTOP, SIGTSTP etc causes this state.  The task can be continued by
> +      the signal SIGCONT.
> +    - TASK_PARKED: State to support kthread parking/unparking.
> +    - TASK_DEAD: If a task dies, then it sets TASK_DEAD in tsk->state and calls
> +      schedule one last time. The task will be never ran again.
> +    - TASK_WAKEKILL: It works like TASK_UNINTERRUPTIBLE with the bonus that it
> +      can respond to fatal signals.
> +    - TASK_WAKING: To handle concurrent waking of the same task for SMP.
> +      Indicates that someone is already waking the task.
> +    - TASK_NOLOAD: To be used along with TASK_UNINTERRUPTIBLE to indicate
> +      an idle task which does not contribute to loadavg.
> +    - TASK_NEW: Set during fork(), to guarantee that no one will run the task,
> +      a signal or any other wake event cannot wake it up and insert it on
> +      the runqueue.
> +
> + - exit_state : The exiting state of the task. The possible states are:
> +
> +    - EXIT_ZOMBIE: The task is terminated and waiting for parent to collect
> +      the exit information of the task.
> +    - EXIT_DEAD: After collecting the exit information the task is put to
> +      this state and removed from the system.
> +
> + - static_prio: Used by the fair scheduling class to encode the nice level.
> +   It does not have any effect on the SCHED_DEADLINE, SCHED_FIFO or SCHED_RR
> +   policy tasks.
> +
> + - prio: The value of this field is used to:
> +
> +    - distinguish scheduling classes.
> +    - in the RR/FIFO static priority scheduler.
> +
> + - normal_prio: Expected priority of a task. The value of static_prio
> +   and normal_prio are the same for non-real-time processes. For real time
> +   processes value of prio is used.
> +
> + - rt_priority: Field used to set priority of real time tasks. Not used by the
> +   rt_sched_class.
> +
> + - sched_class: Pointer to sched_class structure of the policy that the task
> +   belongs to.
> +
> + - sched_entity: Pointer to sched_entity CFS structure.
> +
> + - policy: scheduling policy of the task. See above.
> +
> + - nr_cpus_allowed: Hamming weight of the bitmask retrieved from cpumask pointer.
> +
> +New tasks are created using the fork() system call which is described
> +at manpage `FORK(2)` or the clone system call described at manpage `CLONE(2)`.

Is there a better way to refer to a manpage here? Maybe an URL?

> +Users can create threads within a process to achieve parallelism. Threads
> +share address space, open files and other resources of the process. Threads
> +are created like normal tasks with their unique task_struct, but clone()
> +is provided with flags that enable the sharing of resources such as address
> +space ::
> +
> +	clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
> +
> +The scheduler schedules task_structs so from scheduler perspective there is
> +no difference between threads and processes. Threads are created using
> +the system call pthread_create described at its manpage `PTHREAD_CREATE(3)`
> +POSIX threads creation is described at its manpage `PTHREADS(7)`
> +
> +The Scheduler Entry Point
> +=========================
> +
> +The main scheduler entry point is an architecture independent schedule()
> +function defined in kernel/sched/core.c. Its objective is to find a process in
> +the runqueue list and then assign the CPU to it. It is invoked, directly
> +or in a lazy (deferred) way from many different places in the kernel. A lazy
> +invocation does not call the function by its name, but gives the kernel a
> +hint by setting a flag TIF_NEED_RESCHED. The flag is a message to the kernel

s/a flag/the flag/
> +that the scheduler should be invoked as soon as possible because another
> +process deserves to run. The flag should not be modified directly.

Well, then let us know how to do that correctly.

> +
> +Following are some places that notify the kernel to schedule which can be
> +classified based on the type of operations:
> +

I cannot follow the jump from the previous explanation to this list now. 
You lost me here.

> +  - Blocking operations: Suspends the current task and directly call into
> +    the scheduler to find something else to do. Some blocking operations are:
> +
> +      - mutex_lock()
> +      - wait_event()
> +      - do_exit()
> +      - preempt_schedule_irq()
> +
> +  - Co-operative or voluntary preemptions: Allows another task to run at that
> +    point subject to preemption model. Voluntary preemption model can be
> +    set through the kernel config option: CONFIG_PREEMPT_VOLUNTARY. The
> +    operations are:
> +
> +      - cond_resched()
> +      - cond_resched_lock()
> +      - yield()
> +      - preempt_enable()
> +
> +  - Involuntary preemption: Marks TIF_NEED_RESCHED and wait for action
> +    depending on preemption model. Involuntary preemption operations are:
> +
> +      - scheduler_tick()
> +      - wake_up_process()
> +
> +Calling functions mentioned above leads to a call to __schedule(). Note
> +that preemption must be disabled before it is called and enabled after
> +the call using preempt_disable and preempt_enable functions family.
> +
> +
> +The steps during invocation are:
> +--------------------------------

I would not put the "half" sentence as subsection.

> +1. Disable preemption to avoid another task preempting the scheduling
> +   thread itself.
> +2. Retrieve the runqueue of current processor and its lock is obtained to
> +   allow only one thread to modify the runqueue at a time.

Okay, 1. and 2. are written in imperative.

> +3. The state of the previously executed task when the schedule()
> +   was called is examined. If it is not runnable and has not been
> +   preempted in kernel mode, it is removed from the runqueue. If the
> +   previous task has non-blocked pending signals, its state is set to
> +   TASK_RUNNING and left in the runqueue.

Now, passive?

> +4. Scheduler classes are iterated and the corresponding class hook to
> +   pick the next suitable task to be scheduled on the CPU is called.
> +   Since most tasks are handled by the sched_fair class, a shortcut to this
> +   class is implemented in the beginning of the function.

Now passive.

> +5. TIF_NEED_RESCHED and architecture specific need_resched flags are cleared.

Now passive, again.

> +6. If the scheduler class picks a different task from what was running
> +   before, a context switch is performed by calling context_switch().
> +   Internally, context_switch() switches to the new task's memory map and
> +   swaps the register state and stack. If scheduler class picked the same
> +   task as the previous task, no task switch is performed and the current
> +   task keeps running.

Passive.

> +7. Balance callback list is processed. Each scheduling class can migrate tasks
> +   between CPUs to balance load. These load balancing operations are queued
> +   on a Balance callback list which get executed when balance_callback() is
> +   called.

Passive.

> +8. The runqueue is unlocked and preemption is re-enabled. In case
> +   preemption was requested during the time in which it was disabled,
> +   schedule() is run again right away.
> +

Passive.

It should be consistent and I think writing it imperative is MUCH better, 
e.g., Process balance callback list, Unlock runqueue, etc.

> +Scheduler State Transition
> +==========================
> +
> +A very high level scheduler state transition flow with a few states can
> +be depicted as follows. ::
> +
> +                                       *
> +                                       |
> +                                       | task
> +                                       | forks
> +                                       v
> +                        +------------------------------+
> +                        |           TASK_NEW           |
> +                        |        (Ready to run)        |
> +                        +------------------------------+
> +                                       |
> +                                       |
> +                                       v
> +                     +------------------------------------+
> +                     |            TASK_RUNNING            |
> +   +---------------> |           (Ready to run)           | <--+
> +   |                 +------------------------------------+    |
> +   |                   |                                       |
> +   |                   | schedule() calls context_switch()     | task is preempted
> +   |                   v                                       |
> +   |                 +------------------------------------+    |
> +   |                 |            TASK_RUNNING            |    |
> +   |                 |             (Running)              | ---+
> +   | event occurred  +------------------------------------+
> +   |                   |
> +   |                   | task needs to wait for event
> +   |                   v
> +   |                 +------------------------------------+
> +   |                 |         TASK_INTERRUPTIBLE         |
> +   |                 |        TASK_UNINTERRUPTIBLE        |
> +   +-----------------|           TASK_WAKEKILL            |
> +                     +------------------------------------+
> +                                       |
> +                                       | task exits via do_exit()
> +                                       v
> +                        +------------------------------+
> +                        |          TASK_DEAD           |
> +                        |         EXIT_ZOMBIE          |
> +                        +------------------------------+
> +
> +

I hope we can refine this high-level description to Daniel's model.

> +Scheduler provides trace events tracing all major events of the scheduler.
> +The trace events are defined in ::
> +
> +  include/trace/events/sched.h
> +

John, can you explain the trace events that would occur for the 
transitions above in your high-level state transition?

> +Using these trace events it is possible to model the scheduler state transition
> +in an automata model. The following journal paper discusses such modeling:
> +
> +Daniel B. de Oliveira, Rômulo S. de Oliveira, Tommaso Cucinotta, **A thread
> +synchronization model for the PREEMPT_RT Linux kernel**, *Journal of Systems
> +Architecture*, Volume 107, 2020, 101729, ISSN 1383-7621,
> +https://doi.org/10.1016/j.sysarc.2020.101729.
> +
> +To model the scheduler efficiently the system was divided in to generators
> +and specifications. Some of the generators used were "need_resched",
> +"sleepable" and "runnable", "thread_context" and "scheduling context".
> +The specifications are the necessary and sufficient conditions to call
> +the scheduler. New trace events were added to specify the generators
> +and specifications. In case a kernel event referred to more than one
> +event, extra fields of the kernel event was used to distinguish between
> +automation events. The final model was generated from parallel composition
> +of all generators and specifications which composed of 34 events,
> +12 generators and 33 specifications. This resulted in 9017 states, and
> +20103 transitions.

That is how far I got with my first review round.

It reads nicely so far; I think a bit of stylistic improvement is needed 
but you did not make me tired within few minutes (so it is readable) and I 
think I learned something about the scheduler :)

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-09 11:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
2020-09-09 10:28   ` Lukas Bulwahn
2020-09-02 16:26 ` [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview documentation John Mathew
2020-09-09 11:23   ` Lukas Bulwahn
2020-09-02 16:26 ` [RFC PATCH v8 3/3] docs: scheduler: Add introduction to scheduler context-switch John Mathew
2020-09-09  8:37 ` [RFC PATCH v8 0/3] Add scheduler overview documentation Lukas Bulwahn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.