linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v8 0/3] Add scheduler overview documentation
@ 2020-09-02 16:26 John Mathew
  2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: John Mathew @ 2020-09-02 16:26 UTC (permalink / raw)
  To: linux-doc
  Cc: linux-kernel, corbet, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tsbogend,
	lukas.bulwahn, x86, linux-mips, tglx, willy, valentin.schneider,
	John Mathew

This patch series updates the scheduler documentation to add more topics
wrt to scheduler overview. New sections are added to provide a brief
overview of the kernel structs used by the scheduler, scheduler invocation,
and context switch. Previous version of the patch was reviewed at:
https://lore.kernel.org/lkml/20200527084421.4673-1-John.Mathew@unikie.com/

version 8:
 - Rebase

version 7:
 -Fix overview description
 -Removed rst headers
 -Removed kernel-doc for struct rq and meged it as struct
  member comments

version 6:
 -Fix typos.

version 5:
 -Fix description error on CAS

version 4:
 -Added section on Capacity-Aware Scheduling
 -Reworded CFS recently added features.
 -Removed vruntime description from scheduler structs
 -Added description of idle and stopper sched classses

version 3:
 -Fix spelling, spacing and typo errors.

version 2:
- Remove :c:func: directive as it was redundant
- Limit document width (line symbol count) to 75
- Replace dot file with ASCII art
- Describe prepare_task_switch(), ASID use, 
  kernel/user transtion, MIPS FPU affinity correctly
- Add missing references to files
- Removed internal APIs from scheduler API reference
- Described rq struct member as kernel-doc comments
- Replaced CFS history with CFS current status
- Added documentation for sched_class fields
- Refined explanation of context swtich functionality
- Replace CFS history with recent changes
- Added kernel-doc comments for struct rq

John Mathew (3):
  docs: scheduler: Restructure scheduler documentation.
  docs: scheduler: Add scheduler overview  documentation
  docs: scheduler: Add introduction to scheduler context-switch

 Documentation/scheduler/arch-specific.rst     |  14 +
 Documentation/scheduler/cfs-overview.rst      |  59 ++++
 Documentation/scheduler/context-switching.rst | 126 ++++++++
 Documentation/scheduler/index.rst             |  34 +-
 .../scheduler/mips-context-switch.rst         |  89 ++++++
 Documentation/scheduler/overview.rst          | 297 ++++++++++++++++++
 .../scheduler/sched-data-structs.rst          | 176 +++++++++++
 Documentation/scheduler/sched-debugging.rst   |  14 +
 Documentation/scheduler/sched-features.rst    |  25 ++
 Documentation/scheduler/scheduler-api.rst     |  24 ++
 .../scheduler/x86-context-switch.rst          |  55 ++++
 kernel/sched/core.c                           |  21 +-
 kernel/sched/sched.h                          |  63 +++-
 13 files changed, 978 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/scheduler/arch-specific.rst
 create mode 100644 Documentation/scheduler/cfs-overview.rst
 create mode 100644 Documentation/scheduler/context-switching.rst
 create mode 100644 Documentation/scheduler/mips-context-switch.rst
 create mode 100644 Documentation/scheduler/overview.rst
 create mode 100644 Documentation/scheduler/sched-data-structs.rst
 create mode 100644 Documentation/scheduler/sched-debugging.rst
 create mode 100644 Documentation/scheduler/sched-features.rst
 create mode 100644 Documentation/scheduler/scheduler-api.rst
 create mode 100644 Documentation/scheduler/x86-context-switch.rst

-- 
2.17.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation.
  2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
@ 2020-09-02 16:26 ` John Mathew
  2020-09-09 10:28   ` Lukas Bulwahn
  2020-09-02 16:26 ` [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview documentation John Mathew
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: John Mathew @ 2020-09-02 16:26 UTC (permalink / raw)
  To: linux-doc
  Cc: linux-kernel, corbet, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tsbogend,
	lukas.bulwahn, x86, linux-mips, tglx, willy, valentin.schneider,
	John Mathew

Add new sections to enable addition of new documentation on
the scheduler. Existing documentation is moved under the related
new sections. The sections are
  - overview
  - sched-features
  - arch-specific.rst
  - sched-debugging.rst

Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: John Mathew <john.mathew@unikie.com>
---
 Documentation/scheduler/arch-specific.rst   | 12 +++++++++
 Documentation/scheduler/index.rst           | 30 +++++++++++----------
 Documentation/scheduler/overview.rst        |  5 ++++
 Documentation/scheduler/sched-debugging.rst | 14 ++++++++++
 Documentation/scheduler/sched-features.rst  | 25 +++++++++++++++++
 5 files changed, 72 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/scheduler/arch-specific.rst
 create mode 100644 Documentation/scheduler/overview.rst
 create mode 100644 Documentation/scheduler/sched-debugging.rst
 create mode 100644 Documentation/scheduler/sched-features.rst

diff --git a/Documentation/scheduler/arch-specific.rst b/Documentation/scheduler/arch-specific.rst
new file mode 100644
index 000000000000..3e5af3a0695e
--- /dev/null
+++ b/Documentation/scheduler/arch-specific.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Architecture Specific Scheduler Implementation Differences
+==========================================================
+
+.. class:: toc-title
+
+	   Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 88900aabdbf7..6e88a070c503 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -1,24 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
 ===============
 Linux Scheduler
 ===============
 
-.. toctree::
-    :maxdepth: 1
+This documentation outlines the Linux kernel scheduler with its concepts,
+details about the scheduler design and its data structures and architecture
+specific implementation differences.
+
 
+.. class:: toc-title
+
+    Table of contents
+
+.. toctree::
+    :maxdepth: 2
 
-    completion
-    sched-arch
-    sched-bwc
-    sched-deadline
+    overview
     sched-design-CFS
-    sched-domains
-    sched-capacity
-    sched-energy
-    sched-nice-design
-    sched-rt-group
-    sched-stats
-
-    text_files
+    sched-features
+    arch-specific
+    sched-debugging
 
 .. only::  subproject and html
 
diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
new file mode 100644
index 000000000000..a1d2d26629eb
--- /dev/null
+++ b/Documentation/scheduler/overview.rst
@@ -0,0 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+====================
+Scheduler overview
+====================
\ No newline at end of file
diff --git a/Documentation/scheduler/sched-debugging.rst b/Documentation/scheduler/sched-debugging.rst
new file mode 100644
index 000000000000..e332069f99d6
--- /dev/null
+++ b/Documentation/scheduler/sched-debugging.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Scheduler Debugging Interface
+==============================
+
+.. class:: toc-title
+
+	   Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
+   sched-stats
+   text_files
diff --git a/Documentation/scheduler/sched-features.rst b/Documentation/scheduler/sched-features.rst
new file mode 100644
index 000000000000..8eb90e86e489
--- /dev/null
+++ b/Documentation/scheduler/sched-features.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Scheduler Features
+===================
+
+.. class:: toc-title
+
+	Table of contents
+
+.. toctree::
+   :maxdepth: 1
+
+   completion
+   sched-arch
+   sched-bwc
+   sched-deadline
+   sched-domains
+   sched-capacity
+   sched-energy
+   sched-nice-design
+   sched-rt-group
+   sched-stats
+
+   text_files
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview  documentation
  2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
  2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
@ 2020-09-02 16:26 ` John Mathew
  2020-09-09 11:23   ` Lukas Bulwahn
  2020-09-02 16:26 ` [RFC PATCH v8 3/3] docs: scheduler: Add introduction to scheduler context-switch John Mathew
  2020-09-09  8:37 ` [RFC PATCH v8 0/3] Add scheduler overview documentation Lukas Bulwahn
  3 siblings, 1 reply; 7+ messages in thread
From: John Mathew @ 2020-09-02 16:26 UTC (permalink / raw)
  To: linux-doc
  Cc: linux-kernel, corbet, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tsbogend,
	lukas.bulwahn, x86, linux-mips, tglx, willy, valentin.schneider,
	John Mathew, Mostafa Chamanara, Oleg Tsymbal

Add documentation for
 -scheduler overview
 -scheduler state transtion
 -CFS overview
 -scheduler data structs

Add rst for scheduler APIs and modify sched/core.c
to add kernel-doc comments.

Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Co-developed-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
Signed-off-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
Co-developed-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
Signed-off-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
Signed-off-by: John Mathew <john.mathew@unikie.com>
---
 Documentation/scheduler/cfs-overview.rst      |  59 ++++
 Documentation/scheduler/index.rst             |   3 +
 Documentation/scheduler/overview.rst          | 294 +++++++++++++++++-
 .../scheduler/sched-data-structs.rst          | 176 +++++++++++
 Documentation/scheduler/scheduler-api.rst     |  24 ++
 kernel/sched/core.c                           |  21 +-
 kernel/sched/sched.h                          |  63 +++-
 7 files changed, 634 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/scheduler/cfs-overview.rst
 create mode 100644 Documentation/scheduler/sched-data-structs.rst
 create mode 100644 Documentation/scheduler/scheduler-api.rst

diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst
new file mode 100644
index 000000000000..1524c24da897
--- /dev/null
+++ b/Documentation/scheduler/cfs-overview.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============
+CFS Overview
+=============
+
+Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
+Scheduler (CFS) implemented as a scheduling module. A brief overview of the
+CFS design is provided in :doc:`sched-design-CFS`
+
+In addition there have been many improvements to the CFS, a few of which are
+
+Tracking available capacity
+---------------------------
+Scale CPU capacity mechanism for CFS so it knows how much CPU capacity is left
+for its use after higher priority sched classes (RT, DL), IRQs and
+'Thermal Pressure' have reduced the 'original' CPU capacity.
+Thermal pressure on a CPU means the maximum possible capacity is
+unavailable due to thermal events.
+
+NUMA balancing
+--------------
+Attempt to migrate tasks to the NUMA Node where the frequently accessed memory
+pages belongs. The scheduler gets information about memory placement through the
+paging mechanism. Scheduler periodically scans the virtual memory of the tasks
+and make them inaccessible by changing the memory protection. The flag
+MM_CP_PROT_NUMA indicates this purpose. When the task attempts to access
+the memory again a page fault occurs. Scheduler traps the fault and increments
+the counters in a task specific array corresponding to the NUMA node id.
+There array is divided in to four regions: faults_memory, faults_cpu,
+faults_memory_buffer and faults_cpu_buffer, where faults_memory is the
+exponential decaying average of faults on a per-node basis. The 'preferred
+node' is found by looping through the array and finding the node with the
+highest number of faults. Migration to the preferred node is done periodically
+by either swapping two tasks tasks between their respective CPUs or
+just moving a task to its preferred node CPU. It the migration or move fails
+it will be retried.
+
+Energy Aware Scheduling
+-----------------------
+For asymmetric CPU capacity topologies, an Energy Model is used to figure out
+which of the CPU candidates is the most energy-efficient. Capacity is the
+amount of work which a CPU can perform at its highest frequency which is
+calculated by the Per-Entity Load Tracking (PELT) mechanism.
+EAS is described at :doc:`sched-energy`
+
+Capacity Aware Scheduling
+--------------------------
+Migrate a task to a CPU which meets its compute demand. In asymmetric CPU
+capacity topologies CFS scheduler frequently updates the 'Misfit' status of
+tasks and migrate them to CPU's of higher capacity. Also during wakeups the
+a CPU with sufficient capacity is found for executing the task. CAS is
+described at :doc:`sched-capacity`
+
+
+
+
+
+
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 6e88a070c503..e3b1d4fc1604 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -17,10 +17,13 @@ specific implementation differences.
     :maxdepth: 2
 
     overview
+    sched-data-structs
+    cfs-overview
     sched-design-CFS
     sched-features
     arch-specific
     sched-debugging
+    scheduler-api
 
 .. only::  subproject and html
 
diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
index a1d2d26629eb..f2fb8f419919 100644
--- a/Documentation/scheduler/overview.rst
+++ b/Documentation/scheduler/overview.rst
@@ -2,4 +2,296 @@
 
 ====================
 Scheduler overview
-====================
\ No newline at end of file
+====================
+
+Linux kernel implements priority-based scheduling. More than one process are
+allowed to run at any given time and each process is allowed to run as if it
+were the only process on the system. The process scheduler coordinates which
+process runs when. In that context, it has the following tasks:
+
+  - share CPUs equally among all currently running processes.
+  - pick appropriate process to run next if required, considering scheduling
+    class/policy and process priorities.
+  - balance processes between multiple CPUs in SMP systems.
+
+The scheduler attempts to be responsive for I/O bound processes and efficient
+for CPU bound processes. The scheduler uses different scheduling policies
+for real time and normal processes based on their respective policy
+enumerations. Scheduler adds support for each policy through scheduling class
+implementations for each. The five scheduling classes which scheduler provides
+are:
+
+  - stop_sched_class:
+    It is a per-cpu maximum priority CPU monopolization mechanism. It is
+    exposed as a SCHED_FIFO task ('migration/X') with static priority of 99
+    in the user space. This is done to make it compatible with user space and
+    thus to avoid growing the ABI. It is used by one CPU to stop another
+    in order to run a specific function, so it is only available on SMP
+    systems. This class is used by the scheduler for task migration between
+    CPUs.
+
+  - dl_sched_class:
+    Implements the SCHED_DEADLINE scheduling policy. It has static priority
+    of -1 in kernel space. This policy schedules each task according to the
+    task's deadline. The task with the earliest deadline will be served first.
+
+  - rt_sched_class:
+    Implements the SCHED_RR and SCHED_FIFO policies. Real time static
+    priorities range from 1(low)..99 in the user space. (priority is inverted
+    in kernel space). It is the only scheduling class that makes use of the
+    static priority of the task. SCHED_FIFO is a simple scheduling algorithm
+    without time slicing. A SCHED_FIFO thread runs until either it is blocked
+    by an I/O request, it is preempted by a higher priority thread, or it
+    calls sched_yield(). SCHED_RR is a simple enhancement of SCHED_FIFO where
+    a thread is allowed to run only for a maximum time quantum.
+
+  - fair_sched_class:
+    Implements the SCHED_NORMAL SCHED_BATCH and SCHED_IDLE  policies. Static
+    priority is always 0 in the user space. A dynamic priority based on
+    'nice' value is used to schedule these tasks. This priority increases each
+    time the the task  is scheduled to run but denied to run by scheduler.
+    This ensures fair scheduling between these tasks. Nice value is an
+    attribute which can be set by the user to influence scheduler to favour
+    a particular task. SCHED_BATCH is similar to SCHED_NORMAL with the
+    difference that the policy causes the scheduler to assume that the task
+    is CPU-intensive. SCHED_IDLE policy also has static priority 0. Nice
+    value has no effect on this policy. Weight mapping is not done, instead
+    weight is set at a constant minimal weight WEIGHT_IDLEPRIO. Used to
+    run tasks at extremely low priority.
+
+  - idle_sched_class:
+    Priority for idle task is irrelevant. This class is not related to
+    SCHED_IDLE policy. Idle tasks run when there are no other runnable tasks
+    on a CPU. The execute the idle loop which is responsible to put a CPU
+    in one of its idle states.
+
+
+Process Management
+==================
+
+Each process in the system is represented by struct task_struct. When a
+process/thread is created, the kernel allocates a new task_struct for it.
+The kernel then stores this task_struct in an RCU list. Macro next_task()
+allows a process to obtain its next task and for_each_process() macro enables
+traversal of the list.
+
+Frequently used fields of the task struct are:
+
+ - state: The running state of the task. The possible states are:
+
+    - TASK_RUNNING: The task is currently running or in a run queue waiting
+      to run.
+    - TASK_INTERRUPTIBLE: The task is sleeping waiting for some event to occur.
+      This task can be interrupted by signals. On waking up the task transitions
+      to TASK_RUNNING.
+    - TASK_UNINTERRUPTIBLE: Similar to TASK_INTERRUPTIBLE but does not wake
+      up on signals. Needs an explicit wake-up call to be woken up. Contributes
+      to loadavg.
+    - __TASK_TRACED: Task is being traced by another task like a debugger.
+    - __TASK_STOPPED: Task execution has stopped and not eligible to run.
+      SIGSTOP, SIGTSTP etc causes this state.  The task can be continued by
+      the signal SIGCONT.
+    - TASK_PARKED: State to support kthread parking/unparking.
+    - TASK_DEAD: If a task dies, then it sets TASK_DEAD in tsk->state and calls
+      schedule one last time. The task will be never ran again.
+    - TASK_WAKEKILL: It works like TASK_UNINTERRUPTIBLE with the bonus that it
+      can respond to fatal signals.
+    - TASK_WAKING: To handle concurrent waking of the same task for SMP.
+      Indicates that someone is already waking the task.
+    - TASK_NOLOAD: To be used along with TASK_UNINTERRUPTIBLE to indicate
+      an idle task which does not contribute to loadavg.
+    - TASK_NEW: Set during fork(), to guarantee that no one will run the task,
+      a signal or any other wake event cannot wake it up and insert it on
+      the runqueue.
+
+ - exit_state : The exiting state of the task. The possible states are:
+
+    - EXIT_ZOMBIE: The task is terminated and waiting for parent to collect
+      the exit information of the task.
+    - EXIT_DEAD: After collecting the exit information the task is put to
+      this state and removed from the system.
+
+ - static_prio: Used by the fair scheduling class to encode the nice level.
+   It does not have any effect on the SCHED_DEADLINE, SCHED_FIFO or SCHED_RR
+   policy tasks.
+
+ - prio: The value of this field is used to:
+
+    - distinguish scheduling classes.
+    - in the RR/FIFO static priority scheduler.
+
+ - normal_prio: Expected priority of a task. The value of static_prio
+   and normal_prio are the same for non-real-time processes. For real time
+   processes value of prio is used.
+
+ - rt_priority: Field used to set priority of real time tasks. Not used by the
+   rt_sched_class.
+
+ - sched_class: Pointer to sched_class structure of the policy that the task
+   belongs to.
+
+ - sched_entity: Pointer to sched_entity CFS structure.
+
+ - policy: scheduling policy of the task. See above.
+
+ - nr_cpus_allowed: Hamming weight of the bitmask retrieved from cpumask pointer.
+
+New tasks are created using the fork() system call which is described
+at manpage `FORK(2)` or the clone system call described at manpage `CLONE(2)`.
+Users can create threads within a process to achieve parallelism. Threads
+share address space, open files and other resources of the process. Threads
+are created like normal tasks with their unique task_struct, but clone()
+is provided with flags that enable the sharing of resources such as address
+space ::
+
+	clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
+
+The scheduler schedules task_structs so from scheduler perspective there is
+no difference between threads and processes. Threads are created using
+the system call pthread_create described at its manpage `PTHREAD_CREATE(3)`
+POSIX threads creation is described at its manpage `PTHREADS(7)`
+
+The Scheduler Entry Point
+=========================
+
+The main scheduler entry point is an architecture independent schedule()
+function defined in kernel/sched/core.c. Its objective is to find a process in
+the runqueue list and then assign the CPU to it. It is invoked, directly
+or in a lazy (deferred) way from many different places in the kernel. A lazy
+invocation does not call the function by its name, but gives the kernel a
+hint by setting a flag TIF_NEED_RESCHED. The flag is a message to the kernel
+that the scheduler should be invoked as soon as possible because another
+process deserves to run. The flag should not be modified directly.
+
+Following are some places that notify the kernel to schedule which can be
+classified based on the type of operations:
+
+  - Blocking operations: Suspends the current task and directly call into
+    the scheduler to find something else to do. Some blocking operations are:
+
+      - mutex_lock()
+      - wait_event()
+      - do_exit()
+      - preempt_schedule_irq()
+
+  - Co-operative or voluntary preemptions: Allows another task to run at that
+    point subject to preemption model. Voluntary preemption model can be
+    set through the kernel config option: CONFIG_PREEMPT_VOLUNTARY. The
+    operations are:
+
+      - cond_resched()
+      - cond_resched_lock()
+      - yield()
+      - preempt_enable()
+
+  - Involuntary preemption: Marks TIF_NEED_RESCHED and wait for action
+    depending on preemption model. Involuntary preemption operations are:
+
+      - scheduler_tick()
+      - wake_up_process()
+
+Calling functions mentioned above leads to a call to __schedule(). Note
+that preemption must be disabled before it is called and enabled after
+the call using preempt_disable and preempt_enable functions family.
+
+
+The steps during invocation are:
+--------------------------------
+1. Disable preemption to avoid another task preempting the scheduling
+   thread itself.
+2. Retrieve the runqueue of current processor and its lock is obtained to
+   allow only one thread to modify the runqueue at a time.
+3. The state of the previously executed task when the schedule()
+   was called is examined. If it is not runnable and has not been
+   preempted in kernel mode, it is removed from the runqueue. If the
+   previous task has non-blocked pending signals, its state is set to
+   TASK_RUNNING and left in the runqueue.
+4. Scheduler classes are iterated and the corresponding class hook to
+   pick the next suitable task to be scheduled on the CPU is called.
+   Since most tasks are handled by the sched_fair class, a shortcut to this
+   class is implemented in the beginning of the function.
+5. TIF_NEED_RESCHED and architecture specific need_resched flags are cleared.
+6. If the scheduler class picks a different task from what was running
+   before, a context switch is performed by calling context_switch().
+   Internally, context_switch() switches to the new task's memory map and
+   swaps the register state and stack. If scheduler class picked the same
+   task as the previous task, no task switch is performed and the current
+   task keeps running.
+7. Balance callback list is processed. Each scheduling class can migrate tasks
+   between CPUs to balance load. These load balancing operations are queued
+   on a Balance callback list which get executed when balance_callback() is
+   called.
+8. The runqueue is unlocked and preemption is re-enabled. In case
+   preemption was requested during the time in which it was disabled,
+   schedule() is run again right away.
+
+Scheduler State Transition
+==========================
+
+A very high level scheduler state transition flow with a few states can
+be depicted as follows. ::
+
+                                       *
+                                       |
+                                       | task
+                                       | forks
+                                       v
+                        +------------------------------+
+                        |           TASK_NEW           |
+                        |        (Ready to run)        |
+                        +------------------------------+
+                                       |
+                                       |
+                                       v
+                     +------------------------------------+
+                     |            TASK_RUNNING            |
+   +---------------> |           (Ready to run)           | <--+
+   |                 +------------------------------------+    |
+   |                   |                                       |
+   |                   | schedule() calls context_switch()     | task is preempted
+   |                   v                                       |
+   |                 +------------------------------------+    |
+   |                 |            TASK_RUNNING            |    |
+   |                 |             (Running)              | ---+
+   | event occurred  +------------------------------------+
+   |                   |
+   |                   | task needs to wait for event
+   |                   v
+   |                 +------------------------------------+
+   |                 |         TASK_INTERRUPTIBLE         |
+   |                 |        TASK_UNINTERRUPTIBLE        |
+   +-----------------|           TASK_WAKEKILL            |
+                     +------------------------------------+
+                                       |
+                                       | task exits via do_exit()
+                                       v
+                        +------------------------------+
+                        |          TASK_DEAD           |
+                        |         EXIT_ZOMBIE          |
+                        +------------------------------+
+
+
+Scheduler provides trace events tracing all major events of the scheduler.
+The trace events are defined in ::
+
+  include/trace/events/sched.h
+
+Using these trace events it is possible to model the scheduler state transition
+in an automata model. The following journal paper discusses such modeling:
+
+Daniel B. de Oliveira, Rômulo S. de Oliveira, Tommaso Cucinotta, **A thread
+synchronization model for the PREEMPT_RT Linux kernel**, *Journal of Systems
+Architecture*, Volume 107, 2020, 101729, ISSN 1383-7621,
+https://doi.org/10.1016/j.sysarc.2020.101729.
+
+To model the scheduler efficiently the system was divided in to generators
+and specifications. Some of the generators used were "need_resched",
+"sleepable" and "runnable", "thread_context" and "scheduling context".
+The specifications are the necessary and sufficient conditions to call
+the scheduler. New trace events were added to specify the generators
+and specifications. In case a kernel event referred to more than one
+event, extra fields of the kernel event was used to distinguish between
+automation events. The final model was generated from parallel composition
+of all generators and specifications which composed of 34 events,
+12 generators and 33 specifications. This resulted in 9017 states, and
+20103 transitions.
diff --git a/Documentation/scheduler/sched-data-structs.rst b/Documentation/scheduler/sched-data-structs.rst
new file mode 100644
index 000000000000..b8d968c70bfc
--- /dev/null
+++ b/Documentation/scheduler/sched-data-structs.rst
@@ -0,0 +1,176 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=========================
+Scheduler Data Structures
+=========================
+
+The main parts of the Linux scheduler are:
+
+Runqueue
+~~~~~~~~
+
+struct rq is the central data structure of process scheduling. It keeps track
+of tasks that are in a runnable state assigned for a particular processor.
+Each CPU has its own run queue and stored in a per CPU array::
+
+    DEFINE_PER_CPU(struct rq, runqueues);
+
+Access to the queue requires locking and lock acquire operations must be
+ordered by ascending runqueue. Macros for accessing and locking the runqueue
+are provided in::
+
+    kernel/sched/sched.h
+
+The runqueue contains scheduling class specific queues and several scheduling
+statistics.
+
+Scheduling entity
+~~~~~~~~~~~~~~~~~
+Scheduler uses scheduling entities which contain sufficient information to
+actually accomplish the scheduling job of a task or a task-group. The
+scheduling entity may be a group of tasks or a single task. Every task is
+associated with a sched_entity structure. CFS adds support for nesting of
+tasks and task groups. Each scheduling entity may be run from its parents
+runqueue. The scheduler traverses the sched_entity hierarchy to pick the
+next task to run on the CPU. The entity gets picked up from the cfs_rq on
+which it is queued and its time slice is divided among all the tasks on its my_q.
+
+Scheduler classes
+~~~~~~~~~~~~~~~~~
+It is an extensible hierarchy of scheduler modules. The modules encapsulate
+scheduling policy details. They are called from the core code which is
+independent. Scheduling classes are implemented through the sched_class
+structure. The five scheduling classes are stop_sched_class, dl_sched_class,
+rt_sched_class, fair_sched_class and idle_sched_class. The important methods
+of scheduler class are:
+
+  - sched_class::enqueue_task()
+  - sched_class::dequeue_task()
+    These functions are used to put and remove tasks from the runqueue
+    respectively to change a property of a task. This is referred to as
+    change pattern. Change is defined as the following sequence of calls:
+
+      - dequeue_task()
+      - put_prev_task()
+      - change a property
+      - enqueue_task()
+      - set_next_task()
+
+    The enqueue_task function takes the runqueue, the task which needs to
+    be enqueued/dequeued and a bit mask of flags as parameters. The main
+    purpose of the flags is to describe why the enqueue or dequeue is being
+    called. The different flags used are described in ::
+
+        kernel/sched/sched.h
+
+    Some places where the enqueue_task and dequeue_task are called for
+    changing task properties are:
+
+      - When migrating a task from one CPU's runqueue to another.
+      - When changing a tasks CPU affinity.
+      - When changing the priority of a task.
+      - When changing the nice value of the task.
+      - When changing the scheduling policy and/or RT priority of a thread.
+
+  - sched_class::pick_next_task()
+    Called by the scheduler to pick the next best task to run. The scheduler
+    iterates through the corresponding functions of the scheduler classes
+    in priority order to pick up the next best task to run. Since tasks
+    belonging to the idle class and fair class are frequent, the scheduler
+    optimizes the picking of next task to call the pick_next_task_fair()
+    if the previous task was of the similar scheduling class.
+
+  - sched_class::put_prev_task()
+    Called by the scheduler when a running task is being taken off a CPU.
+    The behavior of this function depends on individual scheduling classes.
+    In CFS class this function is used to put the currently running task back
+    into the CFS RB tree. When a task is running it is dequeued from the tree.
+    This is to prevent redundant enqueue's and dequeue's for updating its
+    vruntime. vruntime of tasks on the tree needs to be updated by update_curr()
+    to keep the tree in sync. In SCHED_DEADLINE and RT classes additional tree
+    is maintained to push tasks from the current CPU to another CPU where the
+    task can preempt and start executing. Task will be added to this queue
+    if it is present on the scheduling class rq and the task has affinity
+    to more than one CPU. set_next_task() is called on the task returned from
+    this function.
+
+  - sched_class::set_next_task()
+    Pairs with the put_prev_task(), this function is called when the next
+    task is set to run on the CPU. This function is called in all the places
+    where put_prev_task is called to complete the 'change pattern'. In case
+    of CFS scheduling class, it will set current scheduling entity to the
+    picked task and accounts bandwidth usage on the cfs_rq. In addition it
+    will also remove the current entity from the CFS runqueue for the vruntime
+    update optimization, opposite to what was done in put_prev_task.
+    For the SCHED_DEADLINE and RT classes it will remove the task from the
+    tree of pushable tasks trigger the balance callback to push another task
+    which is non running on the current CPU for execution on another CPU.
+
+      - dequeue the picked task from the tree of pushable tasks.
+      - update the load average in case the previous task belonged to another
+        class.
+      - queues the function to push tasks from current runqueue to other CPUs
+        which can preempt and start execution. Balance callback list is used.
+
+  - sched_class::task_tick()
+    Called from scheduler_tick(), hrtick() and sched_tick_remote() to update
+    the current task statistics and load averages. Also restarting the high
+    resolution tick timer is done if high resolution timers are enabled.
+    scheduler_tick() runs at 1/HZ and is called from the timer interrupt
+    handler of the Kernel internal timers.
+    hrtick() is called from high resolution timers to deliver an accurate
+    preemption tick as the regular scheduler tick that runs at 1/HZ can be
+    too coarse when nice levels are used.
+    sched_tick_remote() gets called by the offloaded residual 1Hz scheduler
+    tick. In order to reduce interruptions to bare metal tasks, it is possible
+    to outsource these scheduler ticks to the global workqueue so that a
+    housekeeping CPU handles those remotely.
+
+  - sched_class::select_task_rq()
+    Called by scheduler to get the CPU to assign a task to and migrating
+    tasks between CPUs. Flags describe the reason the function was called.
+    Called by try_to_wake_up() with SD_BALANCE_WAKE flag which wakes up a
+    sleeping task.
+    Called by wake_up_new_task() with SD_BALANCE_FORK flag which wakes up a
+    newly forked task.
+    Called by sched_exec() with SD_BALANCE_EXEC which is called from execv
+    syscall.
+    SCHED_DEADLINE class decides the CPU on which the task should be woken
+    up based on the deadline. RT class decides based on the RT priority. Fair
+    scheduling class balances load by selecting the idlest CPU in the
+    idlest group, or under certain conditions an idle sibling CPU if the
+    domain has SD_WAKE_AFFINE set.
+
+  - sched_class::balance()
+    Called by pick_next_task() from scheduler to enable scheduling classes
+    to pull tasks from runqueues of other CPUs for balancing task execution
+    between the CPUs.
+
+  - sched_class::task_fork()
+    Called from sched_fork() of scheduler which assigns a task to a CPU.
+    Fair scheduling class updates runqueue clock, runtime statistics and
+    vruntime for the scheduling entity.
+
+  - sched_class::yield_task()
+    Called from SYSCALL sched_yield to yield the CPU to other tasks.
+    SCHED_DEADLINE class forces the runtime of the task to zero using a special
+    flag and dequeues the task from its trees. RT class requeues the task
+    entities to the end of the run list. Fair scheduling class implements
+    the buddy mechanism. This allows skipping onto the next highest priority
+    scheduling entity at every level in the CFS tree, unless doing so would
+    introduce gross unfairness in CPU time distribution.
+
+  - sched_class::check_preempt_curr()
+    Check whether the task that woke up should preempt the currently
+    running task. Called by scheduler
+
+      - when moving queued task to new runqueue
+      - ttwu()
+      - when waking up newly created task for the first time.
+
+    SCHED_DEADLINE class compares the deadlines of the tasks and calls
+    scheduler function resched_curr() if the preemption is needed. In case
+    the deadlines are equal, migratability of the tasks is used a criteria
+    for preemption.
+    RT class behaves the same except it uses RT priority for comparison.
+    Fair class sets the buddy hints before calling resched_curr() to preempt.
diff --git a/Documentation/scheduler/scheduler-api.rst b/Documentation/scheduler/scheduler-api.rst
new file mode 100644
index 000000000000..6b9cb03d0d86
--- /dev/null
+++ b/Documentation/scheduler/scheduler-api.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================
+Scheduler related functions
+=============================
+
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: scheduler_tick
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: try_to_wake_up
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: do_task_dead
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: preempt_schedule_irq
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: prepare_task_switch
+
+.. kernel-doc:: kernel/sched/core.c
+	:functions: finish_task_switch
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d95dc3f4644..99a3b5ab4039 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3976,9 +3976,13 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	return ns;
 }
 
-/*
+/**
+ * scheduler_tick - sched tick timer handler
+ *
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
+ *
+ * Return: 0.
  */
 void scheduler_tick(void)
 {
@@ -4367,8 +4371,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	BUG();
 }
 
-/*
- * __schedule() is the main scheduler function.
+/**
+ * __schedule() - the main scheduler function.
+ *
+ * @preempt: preemption enabled/disabled
  *
  * The main means of driving the scheduler and thus entering this function are:
  *
@@ -4533,6 +4539,12 @@ static void __sched notrace __schedule(bool preempt)
 	balance_callback(rq);
 }
 
+/**
+ * do_task_dead - handle task exit
+ *
+ * Changes the task state to TASK_DEAD and calls
+ * schedule to pick next task to run.
+ */
 void __noreturn do_task_dead(void)
 {
 	/* Causes final put_task_struct in finish_task_switch(): */
@@ -4764,7 +4776,8 @@ EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
 
 #endif /* CONFIG_PREEMPTION */
 
-/*
+/**
+ * preempt_schedule_irq - schedule from irq context
  * This is the entry point to schedule() from kernel preemption
  * off of irq context.
  * Note, that this is called and return with irqs disabled. This will
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 28709f6b0975..11b2125df8b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -902,16 +902,25 @@ struct rq {
 	 */
 	unsigned int		nr_running;
 #ifdef CONFIG_NUMA_BALANCING
+    /* Number of tasks running that care about their NUMA placement */
 	unsigned int		nr_numa_running;
+	/* Number of tasks that are optimally NUMA placed */
 	unsigned int		nr_preferred_running;
+	/*
+	 * Per runqueue variable to check if NUMA-balance is active on the
+	 * run-queue
+	 */
 	unsigned int		numa_migrate_on;
 #endif
 #ifdef CONFIG_NO_HZ_COMMON
 #ifdef CONFIG_SMP
+    /* Tick stamp for decay of blocked load */
 	unsigned long		last_blocked_load_update_tick;
+	/* Idle CPU has blocked load */
 	unsigned int		has_blocked_load;
 	call_single_data_t	nohz_csd;
 #endif /* CONFIG_SMP */
+    /* CPU is going idle with tick stopped */
 	unsigned int		nohz_tick_stopped;
 	atomic_t		nohz_flags;
 #endif /* CONFIG_NO_HZ_COMMON */
@@ -927,14 +936,17 @@ struct rq {
 	unsigned int		uclamp_flags;
 #define UCLAMP_FLAG_IDLE 0x01
 #endif
-
+	/* Fair scheduling class runqueue */
 	struct cfs_rq		cfs;
+	/* RT scheduling class runqueue */
 	struct rt_rq		rt;
+	/* Deadline scheduing class runqueue */
 	struct dl_rq		dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
 	struct list_head	leaf_cfs_rq_list;
+	/* Reference to add child before its parent in leaf_cfs_rq_list */
 	struct list_head	*tmp_alone_branch;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -946,19 +958,28 @@ struct rq {
 	 */
 	unsigned long		nr_uninterruptible;
 
+	/* Currently running task of this rq */
 	struct task_struct __rcu	*curr;
+	/* Idle task of this rq */
 	struct task_struct	*idle;
+	/* Stop task of this rq */
 	struct task_struct	*stop;
 	unsigned long		next_balance;
 	struct mm_struct	*prev_mm;
 
+	/* RQCF clock_update_flags bits */
 	unsigned int		clock_update_flags;
+	/* sched_clock() value for the queue */
 	u64			clock;
 	/* Ensure that all clocks are in the same cache line */
 	u64			clock_task ____cacheline_aligned;
 	u64			clock_pelt;
 	unsigned long		lost_idle_time;
 
+	/*
+	 * Account the idle time that we could have spend running if it were
+	 * not for IO
+	 */
 	atomic_t		nr_iowait;
 
 #ifdef CONFIG_MEMBARRIER
@@ -967,9 +988,18 @@ struct rq {
 
 #ifdef CONFIG_SMP
 	struct root_domain		*rd;
+	/* A domain heirarchy of CPU groups to balance process load among them */
 	struct sched_domain __rcu	*sd;
 
+	/*
+	 * Information about CPUs heterogeneity used for CPU performance
+	 * scaling
+	 */
 	unsigned long		cpu_capacity;
+	/*
+	 * Original capacity of a CPU before being altered by rt tasks
+	 * and/or IRQ
+	 */
 	unsigned long		cpu_capacity_orig;
 
 	struct callback_head	*balance_callback;
@@ -977,6 +1007,11 @@ struct rq {
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
 
+	/*
+	 * Set whenever the current running task has a utilization greater
+	 * than 80% of rq->cpu_capacity. A non-zero value in this field
+	 * enables misfit load balancing
+	 */
 	unsigned long		misfit_task_load;
 
 	/* For active balancing */
@@ -988,16 +1023,42 @@ struct rq {
 	int			cpu;
 	int			online;
 
+	/*
+	 * An MRU list used for load balancing, sorted (except
+	 * woken tasks) starting from recently given CPU time tasks
+	 * toward tasks with max wait time in a run-queue
+	 */
 	struct list_head cfs_tasks;
 
+	/*
+	 * Track the utilization of RT tasks for a more accurate
+	 * view of the utilization of the CPU when overloaded by CFS and
+	 * RT tasks
+	 */
 	struct sched_avg	avg_rt;
+
+	/*
+	 * Track the utilization of DL tasks as CFS tasks can be preempted
+	 * by DL tasks and the CFS's utilization might no longer describe
+	 * the real utilization level
+	 */
 	struct sched_avg	avg_dl;
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
+	/*
+	 * Track the utilization of interrupt to give a more accurate
+	 * level of utilization of CPU taking into account the time spent
+	 * under interrupt context when rq's clock is updated
+	 */
 	struct sched_avg	avg_irq;
 #endif
 #ifdef CONFIG_SCHED_THERMAL_PRESSURE
+	/*
+	 * Tracks thermal pressure which is the reduction in maximum
+	 * possible capacity due to thermal events
+	 */
 	struct sched_avg	avg_thermal;
 #endif
+	/* Time stamp at which idle load balance started for this rq */
 	u64			idle_stamp;
 	u64			avg_idle;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v8 3/3] docs: scheduler: Add introduction to scheduler context-switch
  2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
  2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
  2020-09-02 16:26 ` [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview documentation John Mathew
@ 2020-09-02 16:26 ` John Mathew
  2020-09-09  8:37 ` [RFC PATCH v8 0/3] Add scheduler overview documentation Lukas Bulwahn
  3 siblings, 0 replies; 7+ messages in thread
From: John Mathew @ 2020-09-02 16:26 UTC (permalink / raw)
  To: linux-doc
  Cc: linux-kernel, corbet, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tsbogend,
	lukas.bulwahn, x86, linux-mips, tglx, willy, valentin.schneider,
	John Mathew, Mostafa Chamanara, Oleg Tsymbal

Add documentation for introduction to
 -context-switch
 -x86 context-switch
 -MIPS context switch

Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Co-developed-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
Signed-off-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
Co-developed-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
Signed-off-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
Signed-off-by: John Mathew <john.mathew@unikie.com>
---
 Documentation/scheduler/arch-specific.rst     |   2 +
 Documentation/scheduler/context-switching.rst | 126 ++++++++++++++++++
 Documentation/scheduler/index.rst             |   1 +
 .../scheduler/mips-context-switch.rst         |  89 +++++++++++++
 .../scheduler/x86-context-switch.rst          |  55 ++++++++
 5 files changed, 273 insertions(+)
 create mode 100644 Documentation/scheduler/context-switching.rst
 create mode 100644 Documentation/scheduler/mips-context-switch.rst
 create mode 100644 Documentation/scheduler/x86-context-switch.rst

diff --git a/Documentation/scheduler/arch-specific.rst b/Documentation/scheduler/arch-specific.rst
index 3e5af3a0695e..65dc393b605f 100644
--- a/Documentation/scheduler/arch-specific.rst
+++ b/Documentation/scheduler/arch-specific.rst
@@ -10,3 +10,5 @@ Architecture Specific Scheduler Implementation Differences
 .. toctree::
    :maxdepth: 2
 
+   x86-context-switch
+   mips-context-switch
diff --git a/Documentation/scheduler/context-switching.rst b/Documentation/scheduler/context-switching.rst
new file mode 100644
index 000000000000..04f97bff08e1
--- /dev/null
+++ b/Documentation/scheduler/context-switching.rst
@@ -0,0 +1,126 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==========================
+Process context switching
+==========================
+
+Context Switching
+-----------------
+
+Context switching, the switching from a running task to another,
+is done by the context_switch() function defined in kernel/sched/core.c.
+It is called by __schedule() when a new process has been selected to run.
+The execution flow is as follows:
+
+ - prepare_task_switch() performs necessary kernel preparations for the
+   context switch and then calls prepare_arch_switch() for architecture
+   specific context switch preparation. This call must be paired with a
+   subsequent finish_task_switch() after the context switch. The various
+   steps are:
+
+    - Prepare kcov for context switch. Context switch does switch_mm() to the
+      next task's mm, then switch_to() that new task. This means vmalloc'd
+      regions which had previously been faulted in can transiently disappear in
+      the context of the prev task. Functions instrumented by KCOV may try to
+      access a vmalloc'd kcov_area during this window, and result in a recursive
+      fault. This is avoided by setting a new flag: KCOV_IN_CTXSW in kcov_mode
+      prior to switching the mm, and cleared once the new task is live.
+    - Update sched_info statistics for both the prev and next tasks.
+    - Handle perf subsystem context switch from previous task to next.
+      The various steps are:
+
+        - Remove perf events for the task being context-switched out.
+        - Stop each perf event and update the event value in event->count.
+        - Call the context switch callback for PMU with flag indicating
+          schedule out.
+        - Create a PERF_RECORD_MISC_SWITCH_OUT perf event.
+        - Context switch the perf event contexts between the current and next tasks.
+        - Schedule out current cgroup events if cgroup perf events exist on the
+          CPU.
+
+    - Set TIF_NOTIFY_RESUME flag on the current thread for the Restartable
+      sequence mechanism. Restartable sequences allow user-space to perform
+      update operations on per-cpu data without requiring heavy-weight atomic
+      operations.
+    - Fire preempt notifiers. A task can request the scheduler to notify it
+      whenever it is preempted or scheduled back in. This allows the task to
+      swap any special-purpose registers like the FPU or Intel's VT registers.
+    - Claim the next task as running to prevent load balancing run on it.
+
+ - arch_start_context_switch() batches the reload of page tables and other
+   process state with the actual context switch code for paravirtualized
+   guests.
+
+ - Transfer the real and anonymous address spaces between the switching tasks.
+   Four possible transfer types are:
+
+    - kernel task switching to another kernel task
+    - user task switching to a kernel task
+    - kernel task switching to user task
+    - user task switching to user task
+
+    For a kernel task switching to kernel task enter_lazy_tlb() is called
+    which is an architecture specific implementation to handle a context
+    without an mm. Architectures implement lazy tricks to minimize TLB
+    flushes here. The active address space from the previous task is
+    borrowed (transferred) to the next task.
+
+    For a user task switching to kernel task it will have a real address
+    space and so its anonymous users counter is incremented. This makes
+    sure that the address space will not get freed even after the previous
+    task exits.
+
+    For a user task switching to user task the architecture specific
+    switch_mm_irqs_off() or switch_mm() functions are called.  The main
+    functionality of these calls is to switch the address space between
+    the user space processes.  This includes switching the page table pointers
+    either via retrieved valid ASID for the process or page mapping in the TLB.
+
+    For a kernel task switching to a user task, switch_mm_irqs_off()
+    replaces the address space of prev kernel task(last active_mm) with the
+    next (next mm) from the user task. The context_switch() function saves the
+    active_mm to the runqueue’s prev_mm field to drop this mm later in
+    the finish_task_switch().
+
+ - prepare_lock_switch() releases lockdep of the runqueue lock to handle
+   the special case of the scheduler context switch where the runqueue lock
+   will be released by the next task.
+
+ - Architecture specific implementation of switch_to() switches the
+   register state and the stack. This involves saving and restoring stack
+   information and the processor registers and any other
+   architecture-specific state that must be managed and restored on a
+   per-process basis.
+
+ - finish_task_switch() performs the final steps of the context switch:
+
+    - Emit a warning if the preempt count is corrupted and set the preempt count
+      to FORK_PREEMPT_COUNT.
+    - Reset the pointer to the memory descriptor used by prev which was set in
+      context_switch().
+    - Store the state of the previous task to handle the possibility of a DEAD
+      task.
+    - Do virtual CPU time accounting for the previous task.
+    - Handle perf subsystem context switch from previous task to current:
+
+       - Add perf events for the current task.
+       - Schedule in current cgroup events if cgroup perf events exist on the
+         CPU.
+       - Context switch the perf event contexts between the prev and current
+         tasks.
+       - Clear the PERF_RECORD_MISC_SWITCH_OUT perf event.
+       - Call the context switch callback for PMU with flag indicating
+         schedule in.
+
+    - Free the task for load balancing run on it.
+    - Unlock the rq lock.
+    - Clear the KCOV_IN_CTXSW in kcov_mode which was set in prepare_task_switch
+      now that the new task is live.
+    - Fire preempt notifiers to notify about task scheduled back in.
+    - If the prev task state indicated that it was dead, the corresponding
+      scheduler class task_dead hook is called. Function-return probe
+      instances associated with the task are removed and put back on the
+      free list. Stack for the task is freed and drop the RCU references.
+    - Evaluate the need for No idle tick due to the context switch and do the
+      idle tick if needed.
+
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index e3b1d4fc1604..fc7ee056f3bb 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -20,6 +20,7 @@ specific implementation differences.
     sched-data-structs
     cfs-overview
     sched-design-CFS
+    context-switching
     sched-features
     arch-specific
     sched-debugging
diff --git a/Documentation/scheduler/mips-context-switch.rst b/Documentation/scheduler/mips-context-switch.rst
new file mode 100644
index 000000000000..42a3454a06f0
--- /dev/null
+++ b/Documentation/scheduler/mips-context-switch.rst
@@ -0,0 +1,89 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==============================================
+MIPS Architecture And Scheduler implementation
+==============================================
+
+Multi-threading in MIPS CPUs
+-----------------------------
+The MIPS architecture defines four coprocessors.
+
+ - CP0: supports virtual memory system and exception handling.
+ - CP1: reserved for the floating point coprocessor, the FPU.
+ - CP2: available for specific implementations.
+ - CP3: reserved for floating point operations in the release 1
+   implementation of MIPS64.
+
+MIPS32 and MIPS64 architectures provide support for optional components
+known as Modules or Application Specific Extensions. The MT module
+enables the architecture to support multi-threaded implementations.
+This includes support for virtual processors and lightweight thread
+contexts. Implementation of MT features depends on the individual MIPS
+cores. The virtual processing element (VPE) maintains a complete copy
+of the processor state as seen by the software system which includes
+interrupts, register set, and MMU. This enables a single processor to
+appear to an SMP operating system like two separate cores if it has
+2 VPE's. For example two separate OSes can run on each VPE such as Linux
+and and an RTOS.
+
+A lighter version of VPE enables threading at the user/application
+software level.  It is called Thread Context (TC). TC is the hardware
+state necessary to support a thread of execution. This includes a set
+of general purpose registers (GPRs), a program counter (PC), and some
+multiplier and coprocessor state.  TCs have common execution unit.
+MIPS ISA provides instructions to utilize TC.
+
+The Quality of service block of the MT module allows the allocation of
+processor cycles to threads, and sets relative thread priorities. This
+enables 2 thread prioritization mechanisms. The user can prioritize one
+thread over the other as well as allocate a specific ratio of the cycles
+to specific threads. These mechanisms allocate bandwidth to a set
+of threads effectively. QoS block improves system level determinism
+and predictability. Qos block can be replaced by more application
+specific blocks.
+
+MIPS Context Switch
+-------------------
+
+Context switch behavior specific to MIPS begins in the way the switch_to()
+macro is implemented. The main steps in the MIPS implementation of the macro
+are:
+
+ - Handle the FPU affinity management feature. This feature is enabled
+   by the config option CONFIG_MIPS_MT_FPAFF at build time. The macro checks
+   if the FPU was used in the most recent time slice. In case FPU was not
+   used, the restriction of having to run on a CPU with FPU is removed.
+ - Disable the FPU and clear the bit indicating the FPU was used in this
+   quantum for the task for the previous task.
+ - If FPU is enabled in the next task, check FCSR for any unmasked
+   exceptions pending, clear them and send a signal.
+ - If MIPS DSP modules is enabled, save the DSP context of the previous
+   task and restore the dsp context of the next task.
+ - If coprocessor 2 is present set the access allowed field of the
+   coprocessor 2.
+ - If coprocessor 2 access allowed field was set in previous task, clear it.
+ - Clear the the access allowed field of the coprocessor 2.
+ - Clear the llbit on MIPS release 6 such that instruction eretnc can be
+   used unconditionally when returning to userland in entry.S.
+   LLbit is used to specify operation for instructions that provide atomic
+   read-modify-write. LLbit is set when a linked load occurs and is tested
+   by the conditional store.  It is cleared, during other CPU operation,
+   when a store to the location would no longer be atomic. In particular,
+   it is cleared by exception return instructions.  eretnc instruction
+   enables to return from interrupt, exception, or error trap without
+   clearing the LLbit.
+ - Clear the global variable ll_bit used by MIPS exception handler.
+ - Write the thread pointer to the MIPS userlocal register if the CPU
+   supports this feature. This register is not interpreted by hardware and
+   can be used to share data between privileged and unprivileged software.
+ - If hardware watchpoint feature is enabled during build, the watchpoint
+   registers are restored from the next task.
+ - Finally the MIPS processor specific implementation of the resume()
+   function is called. It restores the registers of the next task including
+   the stack pointer. The implementation is in assembly in the following
+   architecutre specific files ::
+
+        arch/mips/kernel/r4k_switch.S
+        arch/mips/kernel/r2300_switch.S
+        arch/mips/kernel/octeon_switch.S
+
diff --git a/Documentation/scheduler/x86-context-switch.rst b/Documentation/scheduler/x86-context-switch.rst
new file mode 100644
index 000000000000..96941d9435fc
--- /dev/null
+++ b/Documentation/scheduler/x86-context-switch.rst
@@ -0,0 +1,55 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+X86 Context Switch
+------------------
+
+The x86 architecture context switching logic is as follows.
+After the switching of MM in the scheduler, context_switch() calls the x86
+implementation of switch_to() macro. For x86 arch it is located at ::
+
+    arch/x86/include/asm/switch_to.h
+
+The macro calls the the assembly implementation of __switch_to_asm() in the
+assembly files ::
+
+    arch/x86/entry/entry_64.S
+    arch/x86/entry/entry_32.S
+
+The main steps of the assembly function __switch_to_asm() are:
+
+ - store the callee saved registers to the old stack which will be switched
+   away from.
+ - swap the stack pointers between the old and the new task.
+ - move the stack canary value to the current CPU's interrupt stack
+ - if return trampoline is enabled, overwrite all entries in the RSB on
+   exiting a guest, to prevent malicious branch target predictions from
+   affecting the host kernel.
+ - restore all registers from the new stack previously pushed in reverse
+   order.
+ - jump to a C implementation of __switch_to(). The sources are located in::
+
+      arch/x86/kernel/process_64.c
+      arch/x86/kernel/process_32.c
+
+
+The main steps of the C function __switch_to() which is effectively
+the new task running are as follows:
+
+ - retrieve the thread_struct and fpu struct from the next and previous tasks.
+ - get the current CPU tss_struct.
+ - save the current FPU state while on the old task.
+ - store the FS and GS segment registers before changing the thread local
+   storage.
+ - reload the GDT for the new tasks TLS.
+
+   Following is effectively arch_end_context_switch().
+ - save the ES and DS segments of the previous task and load the same from
+   the nest task.
+ - load the FS and GS segment registers.
+ - update the current task of the CPU.
+ - update the top of stack pointer for the CPU for entry trampoline.
+ - initialize FPU state for next task.
+ - set sp0 to point to the entry trampoline stack.
+ - call _switch_to_xtra() to handle debug registers, I/O
+   bitmaps and speculation mitigation.
+ - write the task's CLOSid/RMID to IA32_PQR_MSR.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v8 0/3] Add scheduler overview documentation
  2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
                   ` (2 preceding siblings ...)
  2020-09-02 16:26 ` [RFC PATCH v8 3/3] docs: scheduler: Add introduction to scheduler context-switch John Mathew
@ 2020-09-09  8:37 ` Lukas Bulwahn
  3 siblings, 0 replies; 7+ messages in thread
From: Lukas Bulwahn @ 2020-09-09  8:37 UTC (permalink / raw)
  To: John Mathew
  Cc: linux-doc, linux-kernel, corbet, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, tsbogend, lukas.bulwahn, x86, linux-mips, tglx, willy,
	valentin.schneider



On Wed, 2 Sep 2020, John Mathew wrote:

> This patch series updates the scheduler documentation to add more topics
> wrt to scheduler overview. New sections are added to provide a brief
> overview of the kernel structs used by the scheduler, scheduler invocation,
> and context switch. Previous version of the patch was reviewed at:
> https://lore.kernel.org/lkml/20200527084421.4673-1-John.Mathew@unikie.com/
>

John, here is some first feedback to get the ball rolling:

I tried to apply your patches on v5.9-rc4, and I got those warnings:

Applying: docs: scheduler: Restructure scheduler documentation.
.git/rebase-apply/patch:30: new blank line at EOF.
+
.git/rebase-apply/patch:137: new blank line at EOF.
+
warning: 2 lines add whitespace errors.
Applying: docs: scheduler: Add scheduler overview documentation
.git/rebase-apply/patch:73: new blank line at EOF.
+
warning: 1 line adds whitespace errors.
Applying: docs: scheduler: Add introduction to scheduler context-switch
.git/rebase-apply/patch:153: new blank line at EOF.
+
.git/rebase-apply/patch:260: new blank line at EOF.
+
warning: 2 lines add whitespace errors.


You might want to look into this. I also checked that the patch also 
applies on linux-next, i.e., next-20200908; so, it does not clash in an 
obvious way with other changes at the moment.

I did run checkpatch.pl and it warned about:
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?

No action required here. 


Documentation generation (make htmldocs) shows these two new warnings with 
your patches applied to v5.9-rc4:

  ./kernel/sched/core.c:17: WARNING: Definition list ends without a blank 
line; unexpected unindent.
  ./kernel/sched/core.c:21: WARNING: Unexpected indentation.


You might want to put those minor fixes on your remaining TODO list for 
this patchset as well.

I will continue to comment with more editorial points in the next hours 
and days, while proof-reading your additions to the documentation.


Lukas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation.
  2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
@ 2020-09-09 10:28   ` Lukas Bulwahn
  0 siblings, 0 replies; 7+ messages in thread
From: Lukas Bulwahn @ 2020-09-09 10:28 UTC (permalink / raw)
  To: John Mathew
  Cc: linux-doc, linux-kernel, corbet, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, tsbogend, lukas.bulwahn, x86, linux-mips, tglx, willy,
	valentin.schneider



On Wed, 2 Sep 2020, John Mathew wrote:

> Add new sections to enable addition of new documentation on
> the scheduler. Existing documentation is moved under the related
> new sections. The sections are
>   - overview
>   - sched-features
>   - arch-specific.rst
>   - sched-debugging.rst
> 
> Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Signed-off-by: John Mathew <john.mathew@unikie.com>
> ---
>  Documentation/scheduler/arch-specific.rst   | 12 +++++++++
>  Documentation/scheduler/index.rst           | 30 +++++++++++----------
>  Documentation/scheduler/overview.rst        |  5 ++++
>  Documentation/scheduler/sched-debugging.rst | 14 ++++++++++
>  Documentation/scheduler/sched-features.rst  | 25 +++++++++++++++++
>  5 files changed, 72 insertions(+), 14 deletions(-)
>  create mode 100644 Documentation/scheduler/arch-specific.rst
>  create mode 100644 Documentation/scheduler/overview.rst
>  create mode 100644 Documentation/scheduler/sched-debugging.rst
>  create mode 100644 Documentation/scheduler/sched-features.rst
> 
> diff --git a/Documentation/scheduler/arch-specific.rst b/Documentation/scheduler/arch-specific.rst
> new file mode 100644
> index 000000000000..3e5af3a0695e
> --- /dev/null
> +++ b/Documentation/scheduler/arch-specific.rst
> @@ -0,0 +1,12 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +Architecture Specific Scheduler Implementation Differences
> +==========================================================

That is a terribly long title, how about Architecture Specifics?

I am wondering if this should be on the toplevel documentation structure
directly under Linux Scheduler.

I think the x86 and MIPS context switch documentation could be placed
under Process context switching in a section Architecture Specifics.

> +
> +.. class:: toc-title
> +
> +	   Table of contents
> +
> +.. toctree::
> +   :maxdepth: 2
> +
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index 88900aabdbf7..6e88a070c503 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> @@ -1,24 +1,26 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
>  ===============
>  Linux Scheduler
>  ===============
>  
> -.. toctree::
> -    :maxdepth: 1
> +This documentation outlines the Linux kernel scheduler with its concepts,
> +details about the scheduler design and its data structures and architecture
> +specific implementation differences.
> +
>  
> +.. class:: toc-title
> +
> +    Table of contents
> +
> +.. toctree::
> +    :maxdepth: 2
>  
> -    completion
> -    sched-arch
> -    sched-bwc
> -    sched-deadline
> +    overview
>      sched-design-CFS
> -    sched-domains
> -    sched-capacity
> -    sched-energy
> -    sched-nice-design
> -    sched-rt-group
> -    sched-stats
> -
> -    text_files
> +    sched-features
> +    arch-specific
> +    sched-debugging
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
> new file mode 100644
> index 000000000000..a1d2d26629eb
> --- /dev/null
> +++ b/Documentation/scheduler/overview.rst
> @@ -0,0 +1,5 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +====================
> +Scheduler overview

s/Scheduler overview/Scheduler Overview/

for some more consistent capitalisation.

> +====================
> \ No newline at end of file

That could be the cause for the git am errors.

> diff --git a/Documentation/scheduler/sched-debugging.rst b/Documentation/scheduler/sched-debugging.rst
> new file mode 100644
> index 000000000000..e332069f99d6
> --- /dev/null
> +++ b/Documentation/scheduler/sched-debugging.rst
> @@ -0,0 +1,14 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +Scheduler Debugging Interface
> +==============================
> +
> +.. class:: toc-title
> +
> +	   Table of contents
> +
> +.. toctree::
> +   :maxdepth: 2
> +
> +   sched-stats
> +   text_files
> diff --git a/Documentation/scheduler/sched-features.rst b/Documentation/scheduler/sched-features.rst
> new file mode 100644
> index 000000000000..8eb90e86e489
> --- /dev/null
> +++ b/Documentation/scheduler/sched-features.rst
> @@ -0,0 +1,25 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +Scheduler Features
> +===================
> +
> +.. class:: toc-title
> +
> +	Table of contents
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   completion
> +   sched-arch
> +   sched-bwc
> +   sched-deadline
> +   sched-domains
> +   sched-capacity
> +   sched-energy
> +   sched-nice-design
> +   sched-rt-group
> +   sched-stats
> +
> +   text_files

I guess it is fine to place everything here for now, but more clean-up 
would probably move those to the appropriate documentation structure, 
right?


> +
> -- 
> 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview  documentation
  2020-09-02 16:26 ` [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview documentation John Mathew
@ 2020-09-09 11:23   ` Lukas Bulwahn
  0 siblings, 0 replies; 7+ messages in thread
From: Lukas Bulwahn @ 2020-09-09 11:23 UTC (permalink / raw)
  To: John Mathew
  Cc: linux-doc, linux-kernel, corbet, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, tsbogend, lukas.bulwahn, x86, linux-mips, tglx, willy,
	valentin.schneider, Mostafa Chamanara, Oleg Tsymbal

[-- Attachment #1: Type: text/plain, Size: 23115 bytes --]



On Wed, 2 Sep 2020, John Mathew wrote:

> Add documentation for
>  -scheduler overview
>  -scheduler state transtion
>  -CFS overview
>  -scheduler data structs
> 
> Add rst for scheduler APIs and modify sched/core.c
> to add kernel-doc comments.
> 
> Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Co-developed-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
> Signed-off-by: Mostafa Chamanara <mostafa.chamanara@basemark.com>
> Co-developed-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
> Signed-off-by: Oleg Tsymbal <oleg.tsymbal@unikie.com>
> Signed-off-by: John Mathew <john.mathew@unikie.com>
> ---
>  Documentation/scheduler/cfs-overview.rst      |  59 ++++
>  Documentation/scheduler/index.rst             |   3 +
>  Documentation/scheduler/overview.rst          | 294 +++++++++++++++++-
>  .../scheduler/sched-data-structs.rst          | 176 +++++++++++
>  Documentation/scheduler/scheduler-api.rst     |  24 ++
>  kernel/sched/core.c                           |  21 +-
>  kernel/sched/sched.h                          |  63 +++-
>  7 files changed, 634 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/scheduler/cfs-overview.rst
>  create mode 100644 Documentation/scheduler/sched-data-structs.rst
>  create mode 100644 Documentation/scheduler/scheduler-api.rst
> 
> diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst
> new file mode 100644
> index 000000000000..1524c24da897
> --- /dev/null
> +++ b/Documentation/scheduler/cfs-overview.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +=============
> +CFS Overview
> +=============
> +
> +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
> +Scheduler (CFS) implemented as a scheduling module. A brief overview of the
> +CFS design is provided in :doc:`sched-design-CFS`
> +
> +In addition there have been many improvements to the CFS, a few of which are

This can be shortened to:

In addition there have been many improvements to the CFS:

> +
> +Tracking available capacity
> +---------------------------

Capitalize title for local consistency with the sections below.

This below is not a full sentence:

> +Scale CPU capacity mechanism for CFS so it knows how much CPU capacity is left

The "it" here refers to what?

> +for its use after higher priority sched classes (RT, DL), IRQs and
> +'Thermal Pressure' have reduced the 'original' CPU capacity.

Why are putting thermal pressure and orginal in quotes?


> +Thermal pressure on a CPU means the maximum possible capacity is
> +unavailable due to thermal events.
> +
> +NUMA balancing
> +--------------

Capitalize.

Again, this below is not a full sentence:

> +Attempt to migrate tasks to the NUMA Node where the frequently accessed memory

why is Node capitalized here?

> +pages belongs. The scheduler gets information about memory placement through the

belongs? You mean is closest placed to, right?

s/through the paging mechanism/through paging/

> +paging mechanism. Scheduler periodically scans the virtual memory of the tasks

Maybe add: This works as follows:

1. The scheduler scans ...
s/Scheduler/The scheduler/

> +and make them inaccessible by changing the memory protection. The flag

s/make/makes/

> +MM_CP_PROT_NUMA indicates this purpose. When the task attempts to access

I think the detail on the flag is too much here for the overview.

2. When the task attempts to access the memory, this triggers a page fault 
and the scheduler reacts with recording some statistics on the use for the 
specific NUMA nodes.

3. On a periodic basis, the scheduler then migrates the task to the 
preffered node, i.e., the node that encountered the most memory faults.

> +the memory again a page fault occurs. Scheduler traps the fault and increments
> +the counters in a task specific array corresponding to the NUMA node id.
> +There array is divided in to four regions: faults_memory, faults_cpu,
> +faults_memory_buffer and faults_cpu_buffer, where faults_memory is the
> +exponential decaying average of faults on a per-node basis. The 'preferred
> +node' is found by looping through the array and finding the node with the
> +highest number of faults. Migration to the preferred node is done periodically
> +by either swapping two tasks tasks between their respective CPUs or
> +just moving a task to its preferred node CPU. It the migration or move fails
> +it will be retried.
> +
> +Energy Aware Scheduling
> +-----------------------
> +For asymmetric CPU capacity topologies, an Energy Model is used to figure out
> +which of the CPU candidates is the most energy-efficient. Capacity is the
> +amount of work which a CPU can perform at its highest frequency which is
> +calculated by the Per-Entity Load Tracking (PELT) mechanism.
> +EAS is described at :doc:`sched-energy`
> +
> +Capacity Aware Scheduling
> +--------------------------
> +Migrate a task to a CPU which meets its compute demand. In asymmetric CPU
> +capacity topologies CFS scheduler frequently updates the 'Misfit' status of

s/CFS scheduler/, the CFS scheduler/

> +tasks and migrate them to CPU's of higher capacity. Also during wakeups the

the a?

> +a CPU with sufficient capacity is found for executing the task. CAS is

I guess it is better to use active here, rather than passive. Who finds 
the CPU?

Do not use an abbreviation here.

> +described at :doc:`sched-capacity`
> +
> +
> +
> +
> +
> +
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index 6e88a070c503..e3b1d4fc1604 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> @@ -17,10 +17,13 @@ specific implementation differences.
>      :maxdepth: 2
>  
>      overview
> +    sched-data-structs
> +    cfs-overview
>      sched-design-CFS
>      sched-features
>      arch-specific
>      sched-debugging
> +    scheduler-api
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
> index a1d2d26629eb..f2fb8f419919 100644
> --- a/Documentation/scheduler/overview.rst
> +++ b/Documentation/scheduler/overview.rst
> @@ -2,4 +2,296 @@
>  
>  ====================
>  Scheduler overview
> -====================
> \ No newline at end of file
> +====================
> +
> +Linux kernel implements priority-based scheduling. More than one process are

s/Linux kernel/The Linux kernel/

> +allowed to run at any given time and each process is allowed to run as if it
> +were the only process on the system. The process scheduler coordinates which
> +process runs when. In that context, it has the following tasks:
> +
> +  - share CPUs equally among all currently running processes.

equally? That is not true, right?

> +  - pick appropriate process to run next if required, considering scheduling
> +    class/policy and process priorities.
> +  - balance processes between multiple CPUs in SMP systems.
> +
> +The scheduler attempts to be responsive for I/O bound processes and efficient
> +for CPU bound processes. The scheduler uses different scheduling policies
> +for real time and normal processes based on their respective policy
> +enumerations. Scheduler adds support for each policy through scheduling class

... for each policy through a scheduling class and a dedicated 
implementation for each scheduling class.

> +implementations for each. The five scheduling classes which scheduler provides
> +are:

This can be shortened to "The five scheduling classes are:"

> +
> +  - stop_sched_class:
> +    It is a per-cpu maximum priority CPU monopolization mechanism. It is
> +    exposed as a SCHED_FIFO task ('migration/X') with static priority of 99
> +    in the user space. This is done to make it compatible with user space and
> +    thus to avoid growing the ABI. It is used by one CPU to stop another
> +    in order to run a specific function, so it is only available on SMP
> +    systems. This class is used by the scheduler for task migration between
> +    CPUs.
> +
> +  - dl_sched_class:
> +    Implements the SCHED_DEADLINE scheduling policy. It has static priority

Here, you use "scheduling policy".

> +    of -1 in kernel space. This policy schedules each task according to the
> +    task's deadline. The task with the earliest deadline will be served first.
> +
> +  - rt_sched_class:
> +    Implements the SCHED_RR and SCHED_FIFO policies. Real time static

... and here only "policy". Be consistent.

> +    priorities range from 1(low)..99 in the user space. (priority is inverted
> +    in kernel space). It is the only scheduling class that makes use of the
> +    static priority of the task. SCHED_FIFO is a simple scheduling algorithm
> +    without time slicing. A SCHED_FIFO thread runs until either it is blocked
> +    by an I/O request, it is preempted by a higher priority thread, or it
> +    calls sched_yield(). SCHED_RR is a simple enhancement of SCHED_FIFO where
> +    a thread is allowed to run only for a maximum time quantum.
> +
> +  - fair_sched_class:
> +    Implements the SCHED_NORMAL SCHED_BATCH and SCHED_IDLE  policies. Static

double spacing before policies.

> +    priority is always 0 in the user space. A dynamic priority based on
> +    'nice' value is used to schedule these tasks. This priority increases each
> +    time the the task  is scheduled to run but denied to run by scheduler.

... the the... and then spacing.

> +    This ensures fair scheduling between these tasks. Nice value is an
> +    attribute which can be set by the user to influence scheduler to favour
> +    a particular task. SCHED_BATCH is similar to SCHED_NORMAL with the
> +    difference that the policy causes the scheduler to assume that the task
> +    is CPU-intensive. SCHED_IDLE policy also has static priority 0. Nice
> +    value has no effect on this policy. Weight mapping is not done, instead
> +    weight is set at a constant minimal weight WEIGHT_IDLEPRIO. Used to
> +    run tasks at extremely low priority.
> +
> +  - idle_sched_class:
> +    Priority for idle task is irrelevant. This class is not related to
> +    SCHED_IDLE policy. Idle tasks run when there are no other runnable tasks
> +    on a CPU. The execute the idle loop which is responsible to put a CPU
> +    in one of its idle states.
> +

This last sentence above is totally broken; I cannot parse it.

> +
> +Process Management
> +==================
> +
> +Each process in the system is represented by struct task_struct. When a
> +process/thread is created, the kernel allocates a new task_struct for it.

Each process or each thread?

> +The kernel then stores this task_struct in an RCU list. Macro next_task()
> +allows a process to obtain its next task and for_each_process() macro enables
> +traversal of the list.
> +

This is too much detail at this point of the overview.

> +Frequently used fields of the task struct are:
> +
> + - state: The running state of the task. The possible states are:
> +
> +    - TASK_RUNNING: The task is currently running or in a run queue waiting
> +      to run.
> +    - TASK_INTERRUPTIBLE: The task is sleeping waiting for some event to occur.
> +      This task can be interrupted by signals. On waking up the task transitions
> +      to TASK_RUNNING.
> +    - TASK_UNINTERRUPTIBLE: Similar to TASK_INTERRUPTIBLE but does not wake
> +      up on signals. Needs an explicit wake-up call to be woken up. Contributes
> +      to loadavg.
> +    - __TASK_TRACED: Task is being traced by another task like a debugger.
> +    - __TASK_STOPPED: Task execution has stopped and not eligible to run.
> +      SIGSTOP, SIGTSTP etc causes this state.  The task can be continued by
> +      the signal SIGCONT.
> +    - TASK_PARKED: State to support kthread parking/unparking.
> +    - TASK_DEAD: If a task dies, then it sets TASK_DEAD in tsk->state and calls
> +      schedule one last time. The task will be never ran again.
> +    - TASK_WAKEKILL: It works like TASK_UNINTERRUPTIBLE with the bonus that it
> +      can respond to fatal signals.
> +    - TASK_WAKING: To handle concurrent waking of the same task for SMP.
> +      Indicates that someone is already waking the task.
> +    - TASK_NOLOAD: To be used along with TASK_UNINTERRUPTIBLE to indicate
> +      an idle task which does not contribute to loadavg.
> +    - TASK_NEW: Set during fork(), to guarantee that no one will run the task,
> +      a signal or any other wake event cannot wake it up and insert it on
> +      the runqueue.
> +
> + - exit_state : The exiting state of the task. The possible states are:
> +
> +    - EXIT_ZOMBIE: The task is terminated and waiting for parent to collect
> +      the exit information of the task.
> +    - EXIT_DEAD: After collecting the exit information the task is put to
> +      this state and removed from the system.
> +
> + - static_prio: Used by the fair scheduling class to encode the nice level.
> +   It does not have any effect on the SCHED_DEADLINE, SCHED_FIFO or SCHED_RR
> +   policy tasks.
> +
> + - prio: The value of this field is used to:
> +
> +    - distinguish scheduling classes.
> +    - in the RR/FIFO static priority scheduler.
> +
> + - normal_prio: Expected priority of a task. The value of static_prio
> +   and normal_prio are the same for non-real-time processes. For real time
> +   processes value of prio is used.
> +
> + - rt_priority: Field used to set priority of real time tasks. Not used by the
> +   rt_sched_class.
> +
> + - sched_class: Pointer to sched_class structure of the policy that the task
> +   belongs to.
> +
> + - sched_entity: Pointer to sched_entity CFS structure.
> +
> + - policy: scheduling policy of the task. See above.
> +
> + - nr_cpus_allowed: Hamming weight of the bitmask retrieved from cpumask pointer.
> +
> +New tasks are created using the fork() system call which is described
> +at manpage `FORK(2)` or the clone system call described at manpage `CLONE(2)`.

Is there a better way to refer to a manpage here? Maybe an URL?

> +Users can create threads within a process to achieve parallelism. Threads
> +share address space, open files and other resources of the process. Threads
> +are created like normal tasks with their unique task_struct, but clone()
> +is provided with flags that enable the sharing of resources such as address
> +space ::
> +
> +	clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
> +
> +The scheduler schedules task_structs so from scheduler perspective there is
> +no difference between threads and processes. Threads are created using
> +the system call pthread_create described at its manpage `PTHREAD_CREATE(3)`
> +POSIX threads creation is described at its manpage `PTHREADS(7)`
> +
> +The Scheduler Entry Point
> +=========================
> +
> +The main scheduler entry point is an architecture independent schedule()
> +function defined in kernel/sched/core.c. Its objective is to find a process in
> +the runqueue list and then assign the CPU to it. It is invoked, directly
> +or in a lazy (deferred) way from many different places in the kernel. A lazy
> +invocation does not call the function by its name, but gives the kernel a
> +hint by setting a flag TIF_NEED_RESCHED. The flag is a message to the kernel

s/a flag/the flag/
> +that the scheduler should be invoked as soon as possible because another
> +process deserves to run. The flag should not be modified directly.

Well, then let us know how to do that correctly.

> +
> +Following are some places that notify the kernel to schedule which can be
> +classified based on the type of operations:
> +

I cannot follow the jump from the previous explanation to this list now. 
You lost me here.

> +  - Blocking operations: Suspends the current task and directly call into
> +    the scheduler to find something else to do. Some blocking operations are:
> +
> +      - mutex_lock()
> +      - wait_event()
> +      - do_exit()
> +      - preempt_schedule_irq()
> +
> +  - Co-operative or voluntary preemptions: Allows another task to run at that
> +    point subject to preemption model. Voluntary preemption model can be
> +    set through the kernel config option: CONFIG_PREEMPT_VOLUNTARY. The
> +    operations are:
> +
> +      - cond_resched()
> +      - cond_resched_lock()
> +      - yield()
> +      - preempt_enable()
> +
> +  - Involuntary preemption: Marks TIF_NEED_RESCHED and wait for action
> +    depending on preemption model. Involuntary preemption operations are:
> +
> +      - scheduler_tick()
> +      - wake_up_process()
> +
> +Calling functions mentioned above leads to a call to __schedule(). Note
> +that preemption must be disabled before it is called and enabled after
> +the call using preempt_disable and preempt_enable functions family.
> +
> +
> +The steps during invocation are:
> +--------------------------------

I would not put the "half" sentence as subsection.

> +1. Disable preemption to avoid another task preempting the scheduling
> +   thread itself.
> +2. Retrieve the runqueue of current processor and its lock is obtained to
> +   allow only one thread to modify the runqueue at a time.

Okay, 1. and 2. are written in imperative.

> +3. The state of the previously executed task when the schedule()
> +   was called is examined. If it is not runnable and has not been
> +   preempted in kernel mode, it is removed from the runqueue. If the
> +   previous task has non-blocked pending signals, its state is set to
> +   TASK_RUNNING and left in the runqueue.

Now, passive?

> +4. Scheduler classes are iterated and the corresponding class hook to
> +   pick the next suitable task to be scheduled on the CPU is called.
> +   Since most tasks are handled by the sched_fair class, a shortcut to this
> +   class is implemented in the beginning of the function.

Now passive.

> +5. TIF_NEED_RESCHED and architecture specific need_resched flags are cleared.

Now passive, again.

> +6. If the scheduler class picks a different task from what was running
> +   before, a context switch is performed by calling context_switch().
> +   Internally, context_switch() switches to the new task's memory map and
> +   swaps the register state and stack. If scheduler class picked the same
> +   task as the previous task, no task switch is performed and the current
> +   task keeps running.

Passive.

> +7. Balance callback list is processed. Each scheduling class can migrate tasks
> +   between CPUs to balance load. These load balancing operations are queued
> +   on a Balance callback list which get executed when balance_callback() is
> +   called.

Passive.

> +8. The runqueue is unlocked and preemption is re-enabled. In case
> +   preemption was requested during the time in which it was disabled,
> +   schedule() is run again right away.
> +

Passive.

It should be consistent and I think writing it imperative is MUCH better, 
e.g., Process balance callback list, Unlock runqueue, etc.

> +Scheduler State Transition
> +==========================
> +
> +A very high level scheduler state transition flow with a few states can
> +be depicted as follows. ::
> +
> +                                       *
> +                                       |
> +                                       | task
> +                                       | forks
> +                                       v
> +                        +------------------------------+
> +                        |           TASK_NEW           |
> +                        |        (Ready to run)        |
> +                        +------------------------------+
> +                                       |
> +                                       |
> +                                       v
> +                     +------------------------------------+
> +                     |            TASK_RUNNING            |
> +   +---------------> |           (Ready to run)           | <--+
> +   |                 +------------------------------------+    |
> +   |                   |                                       |
> +   |                   | schedule() calls context_switch()     | task is preempted
> +   |                   v                                       |
> +   |                 +------------------------------------+    |
> +   |                 |            TASK_RUNNING            |    |
> +   |                 |             (Running)              | ---+
> +   | event occurred  +------------------------------------+
> +   |                   |
> +   |                   | task needs to wait for event
> +   |                   v
> +   |                 +------------------------------------+
> +   |                 |         TASK_INTERRUPTIBLE         |
> +   |                 |        TASK_UNINTERRUPTIBLE        |
> +   +-----------------|           TASK_WAKEKILL            |
> +                     +------------------------------------+
> +                                       |
> +                                       | task exits via do_exit()
> +                                       v
> +                        +------------------------------+
> +                        |          TASK_DEAD           |
> +                        |         EXIT_ZOMBIE          |
> +                        +------------------------------+
> +
> +

I hope we can refine this high-level description to Daniel's model.

> +Scheduler provides trace events tracing all major events of the scheduler.
> +The trace events are defined in ::
> +
> +  include/trace/events/sched.h
> +

John, can you explain the trace events that would occur for the 
transitions above in your high-level state transition?

> +Using these trace events it is possible to model the scheduler state transition
> +in an automata model. The following journal paper discusses such modeling:
> +
> +Daniel B. de Oliveira, Rômulo S. de Oliveira, Tommaso Cucinotta, **A thread
> +synchronization model for the PREEMPT_RT Linux kernel**, *Journal of Systems
> +Architecture*, Volume 107, 2020, 101729, ISSN 1383-7621,
> +https://doi.org/10.1016/j.sysarc.2020.101729.
> +
> +To model the scheduler efficiently the system was divided in to generators
> +and specifications. Some of the generators used were "need_resched",
> +"sleepable" and "runnable", "thread_context" and "scheduling context".
> +The specifications are the necessary and sufficient conditions to call
> +the scheduler. New trace events were added to specify the generators
> +and specifications. In case a kernel event referred to more than one
> +event, extra fields of the kernel event was used to distinguish between
> +automation events. The final model was generated from parallel composition
> +of all generators and specifications which composed of 34 events,
> +12 generators and 33 specifications. This resulted in 9017 states, and
> +20103 transitions.

That is how far I got with my first review round.

It reads nicely so far; I think a bit of stylistic improvement is needed 
but you did not make me tired within few minutes (so it is readable) and I 
think I learned something about the scheduler :)

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-09 11:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-02 16:26 [RFC PATCH v8 0/3] Add scheduler overview documentation John Mathew
2020-09-02 16:26 ` [RFC PATCH v8 1/3] docs: scheduler: Restructure scheduler documentation John Mathew
2020-09-09 10:28   ` Lukas Bulwahn
2020-09-02 16:26 ` [RFC PATCH v8 2/3] docs: scheduler: Add scheduler overview documentation John Mathew
2020-09-09 11:23   ` Lukas Bulwahn
2020-09-02 16:26 ` [RFC PATCH v8 3/3] docs: scheduler: Add introduction to scheduler context-switch John Mathew
2020-09-09  8:37 ` [RFC PATCH v8 0/3] Add scheduler overview documentation Lukas Bulwahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).