All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Documentation: Add usecases, design and interface for core scheduling
@ 2021-05-26 17:56 Joel Fernandes (Google)
  2021-05-26 21:43 ` Chris Hyser
  2021-05-26 22:52 ` Jonathan Corbet
  0 siblings, 2 replies; 5+ messages in thread
From: Joel Fernandes (Google) @ 2021-05-26 17:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Chris Hyser, Josh Don, mingo, peterz, Jonathan Corbet, linux-doc

Now that core scheduling is merged, update the documentation.

Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Co-developed-by: Josh Don <joshdon@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Cc: mingo@kernel.org
Cc: peterz@infradead.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 211 ++++++++++++++++++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 212 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..585edf16183b
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,211 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Core Scheduling
+***************
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core, or may prefer different cores if they
+do share hardware resource needs). This document only describes the security
+usecase.
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different Hyper
+Threads of the same core. MDS and L1TF are examples of such attacks.  The only
+full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
+scheduling is a scheduler feature that can mitigate some (not all) cross-HT
+attacks. It allows HT to be turned on safely by ensuring that tasks in a
+user-designated trusted group can share a core. This increase in core sharing
+can also improve performance, however it is not guaranteed that performance
+will always improve, though that is seen to be the case with a number of real
+world workloads. In theory, core scheduling aims to perform at least as good as
+when Hyper Threading is disabled. In practice, this is mostly the case though
+not always: as synchronizing scheduling decisions across 2 or more CPUs in a
+core involves additional overhead - especially when the system is lightly
+loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
+scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
+total number of CPUs. Please measure the performance of your workloads always.
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that can be co-scheduled
+on the same core. The core scheduler uses this information to make sure that
+tasks that are not in the same group never run simultaneously on a core, while
+doing its best to satisfy the system's scheduling requirements.
+
+Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
+This interface provides support for the creation of core scheduling groups, as
+well as admission and removal of tasks from created groups.
+
+::
+
+    #include <sys/prctl.h>
+
+    int prctl(int option, unsigned long arg2, unsigned long arg3,
+            unsigned long arg4, unsigned long arg5);
+
+option:
+    ``PR_SCHED_CORE``
+
+arg2:
+    Command for operation, must be one off:
+    - ``PR_SCHED_CORE_GET              0  -- get core_sched cookie of ``pid``.
+    - ``PR_SCHED_CORE_CREATE           1  -- create a new unique cookie for ``pid``.
+    - ``PR_SCHED_CORE_SHARE_TO         2  -- push core_sched cookie to ``pid``.
+    - ``PR_SCHED_CORE_SHARE_FROM       3  -- pull core_sched cookie from ``pid``.
+
+arg3:
+    ``pid`` of the task for which the operation applies.
+
+arg4:
+    ``pid_type`` for which the operation applies. It is of type ``enum pid_type``.
+    For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command
+    will be performed for all tasks in the task group of ``pid``.
+
+arg5:
+    userspace pointer to an unsigned long for storing the cookie returned by
+    ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
+
+Cookie Transferral
+~~~~~~~~~~~~~~~~~~
+Transferring a cookie between the current and other tasks is possible using
+PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
+specified task or a share a cookie with a task. In combination this allows a
+simple helper program to pull a cookie from a task in an existing core
+scheduling group and share it with already running tasks.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, as it trusts everything and everything
+trusts it.
+
+During a schedule() event on any sibling of a core, the highest priority task on
+the sibling's core is picked and assigned to the sibling calling schedule(), if
+the sibling has the task enqueued. For rest of the siblings in the core,
+highest priority task with the same cookie is selected if there is one runnable
+in their individual run queues. If a task with same cookie is not available,
+the idle task is selected.  Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. I.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+----------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core.  However,
+it is possible that some runqueues had tasks that were incompatible with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task.  If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler (idle thread is scheduled to run).
+
+When the highest priority task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+          HT1 (attack)            HT2 (victim)
+   A      idle -> user space      user space -> idle
+   B      idle -> user space      guest -> idle
+   C      idle -> guest           user space -> idle
+   D      idle -> guest           guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Trust model
+-----------
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning them a tag that is the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The stunning of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+Limitations of core-scheduling
+------------------------------
+Core scheduling tries to guarantee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+########################
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a CPU before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro architectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+##########
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+###########
+Core scheduling cannot protect against an L1TF guest attacker exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT (Extended Page Tables).
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+3. Protecting the kernel (IRQ, syscall, VMEXIT)
+###############################################
+Unfortunately, core scheduling does not protect kernel contexts running on
+sibling hyperthreads from one another. Prototypes of mitigations have been posted
+to LKML to solve this, but it is debatable whether such windows are practically
+exploitable, and whether the performance overhead of the prototypes are worth
+it (not to mention, the added code complexity).
+
+Other Use cases
+---------------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+  that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+  together could also be realized using core scheduling. One example is vCPUs of
+  a VM.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd9016d..f12cda55538b 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
    tsx_async_abort
    multihit.rst
    special-register-buffer-data-sampling.rst
+   core-scheduling.rst
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] Documentation: Add usecases, design and interface for core scheduling
  2021-05-26 17:56 [PATCH] Documentation: Add usecases, design and interface for core scheduling Joel Fernandes (Google)
@ 2021-05-26 21:43 ` Chris Hyser
  2021-06-01 20:24   ` Joel Fernandes
  2021-05-26 22:52 ` Jonathan Corbet
  1 sibling, 1 reply; 5+ messages in thread
From: Chris Hyser @ 2021-05-26 21:43 UTC (permalink / raw)
  To: Joel Fernandes (Google), linux-kernel
  Cc: Josh Don, mingo, peterz, Jonathan Corbet, linux-doc

On 5/26/21 1:56 PM, Joel Fernandes (Google) wrote:
> Now that core scheduling is merged, update the documentation.
> 
> Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
> Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
> Co-developed-by: Josh Don <joshdon@google.com>
> Signed-off-by: Josh Don <joshdon@google.com>
> Cc: mingo@kernel.org
> Cc: peterz@infradead.org
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> ---
>   .../admin-guide/hw-vuln/core-scheduling.rst   | 211 ++++++++++++++++++
>   Documentation/admin-guide/hw-vuln/index.rst   |   1 +
>   2 files changed, 212 insertions(+)
>   create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
> 
> diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> new file mode 100644
> index 000000000000..585edf16183b
> --- /dev/null
> +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> @@ -0,0 +1,211 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Core Scheduling
> +***************
> +Core scheduling support allows userspace to define groups of tasks that can
> +share a core. These groups can be specified either for security usecases (one
> +group of tasks don't trust another), or for performance usecases (some
> +workloads may benefit from running on the same core as they don't need the same
> +hardware resources of the shared core, or may prefer different cores if they
> +do share hardware resource needs). This document only describes the security
> +usecase.
> +
> +Security usecase
> +----------------
> +A cross-HT attack involves the attacker and victim running on different Hyper
> +Threads of the same core. MDS and L1TF are examples of such attacks.  The only
> +full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
> +scheduling is a scheduler feature that can mitigate some (not all) cross-HT
> +attacks. It allows HT to be turned on safely by ensuring that tasks in a
> +user-designated trusted group can share a core. This increase in core sharing
> +can also improve performance, however it is not guaranteed that performance
> +will always improve, though that is seen to be the case with a number of real
> +world workloads. In theory, core scheduling aims to perform at least as good as
> +when Hyper Threading is disabled. In practice, this is mostly the case though
> +not always: as synchronizing scheduling decisions across 2 or more CPUs in a
> +core involves additional overhead - especially when the system is lightly
> +loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
> +scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
> +total number of CPUs. Please measure the performance of your workloads always.
> +
> +Usage
> +-----
> +Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
> +Using this feature, userspace defines groups of tasks that can be co-scheduled
> +on the same core. The core scheduler uses this information to make sure that
> +tasks that are not in the same group never run simultaneously on a core, while
> +doing its best to satisfy the system's scheduling requirements.
> +
> +Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
> +This interface provides support for the creation of core scheduling groups, as
> +well as admission and removal of tasks from created groups.
> +
> +::
> +
> +    #include <sys/prctl.h>
> +
> +    int prctl(int option, unsigned long arg2, unsigned long arg3,
> +            unsigned long arg4, unsigned long arg5);
> +
> +option:
> +    ``PR_SCHED_CORE``
> +
> +arg2:
> +    Command for operation, must be one off:
> +    - ``PR_SCHED_CORE_GET              0  -- get core_sched cookie of ``pid``.
> +    - ``PR_SCHED_CORE_CREATE           1  -- create a new unique cookie for ``pid``.
> +    - ``PR_SCHED_CORE_SHARE_TO         2  -- push core_sched cookie to ``pid``.
> +    - ``PR_SCHED_CORE_SHARE_FROM       3  -- pull core_sched cookie from ``pid``.
> +
> +arg3:
> +    ``pid`` of the task for which the operation applies.
> +
> +arg4:
> +    ``pid_type`` for which the operation applies. It is of type ``enum pid_type``.
> +    For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command
> +    will be performed for all tasks in the task group of ``pid``.
> +
> +arg5:
> +    userspace pointer to an unsigned long for storing the cookie returned by
> +    ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.

Thanks Joel.

In terms of using the prctl() interface to achieve what was once done with cgroups, we might want to add some text 
somewhere in here along the lines of say:

-----------

The simplest way to build hierarchies of threads/processes which share a cookie and thus a core is to rely on the fact 
that the core-sched cookie is inherited across forks/clones and execs, thus setting a cookie for the 'initial' 
script/executable/daemon will place every spawned child in the same core-sched group. The prctl() API is useful for 
verification or making more specific or elaborate changes. Clearing a cookie can be done with PR_SCHED_CORE_SHARE_* 
involving a task w/o a cookie presumably owned by root or other secure user.



> +
> +Cookie Transferral
> +~~~~~~~~~~~~~~~~~~
> +Transferring a cookie between the current and other tasks is possible using
> +PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
> +specified task or a share a cookie with a task. In combination this allows a
> +simple helper program to pull a cookie from a task in an existing core
> +scheduling group and share it with already running tasks.
> +
> +Design/Implementation
> +---------------------
> +Each task that is tagged is assigned a cookie internally in the kernel. As
> +mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
> +each other and share a core.
> +
> +The basic idea is that, every schedule event tries to select tasks for all the
> +siblings of a core such that all the selected tasks running on a core are
> +trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
> +The idle task is considered special, as it trusts everything and everything
> +trusts it.
> +
> +During a schedule() event on any sibling of a core, the highest priority task on
> +the sibling's core is picked and assigned to the sibling calling schedule(), if
> +the sibling has the task enqueued. For rest of the siblings in the core,
> +highest priority task with the same cookie is selected if there is one runnable
> +in their individual run queues. If a task with same cookie is not available,
> +the idle task is selected.  Idle task is globally trusted.
> +
> +Once a task has been selected for all the siblings in the core, an IPI is sent to
> +siblings for whom a new task was selected. Siblings on receiving the IPI will
> +switch to the new task immediately. If an idle task is selected for a sibling,
> +then the sibling is considered to be in a `forced idle` state. I.e., it may
> +have tasks on its on runqueue to run, however it will still have to run idle.
> +More on this in the next section.
> +
> +Forced-idling of tasks
> +----------------------
> +The scheduler tries its best to find tasks that trust each other such that all
> +tasks selected to be scheduled are of the highest priority in a core.  However,
> +it is possible that some runqueues had tasks that were incompatible with the
> +highest priority ones in the core. Favoring security over fairness, one or more
> +siblings could be forced to select a lower priority task if the highest
> +priority task is not trusted with respect to the core wide highest priority
> +task.  If a sibling does not have a trusted task to run, it will be forced idle
> +by the scheduler (idle thread is scheduled to run).
> +
> +When the highest priority task is selected to run, a reschedule-IPI is sent to
> +the sibling to force it into idle. This results in 4 cases which need to be
> +considered depending on whether a VM or a regular usermode process was running
> +on either HT::
> +
> +          HT1 (attack)            HT2 (victim)
> +   A      idle -> user space      user space -> idle
> +   B      idle -> user space      guest -> idle
> +   C      idle -> guest           user space -> idle
> +   D      idle -> guest           guest -> idle
> +
> +Note that for better performance, we do not wait for the destination CPU
> +(victim) to enter idle mode. This is because the sending of the IPI would bring
> +the destination CPU immediately into kernel mode from user space, or VMEXIT
> +in the case of guests. At best, this would only leak some scheduler metadata
> +which may not be worth protecting. It is also possible that the IPI is received
> +too late on some architectures, but this has not been observed in the case of
> +x86.
> +
> +Trust model
> +-----------
> +Core scheduling maintains trust relationships amongst groups of tasks by
> +assigning them a tag that is the same cookie value.
> +When a system with core scheduling boots, all tasks are considered to trust
> +each other. This is because the core scheduler does not have information about
> +trust relationships until userspace uses the above mentioned interfaces, to
> +communicate them. In other words, all tasks have a default cookie value of 0.
> +and are considered system-wide trusted. The stunning of siblings running
> +cookie-0 tasks is also avoided.
> +
> +Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
> +within such groups are considered to trust each other, but do not trust those
> +outside. Tasks outside the group also don't trust tasks within.
> +
> +Limitations of core-scheduling
> +------------------------------
> +Core scheduling tries to guarantee that only trusted tasks run concurrently on a
> +core. But there could be small window of time during which untrusted tasks run
> +concurrently or kernel could be running concurrently with a task not trusted by
> +kernel.
> +
> +1. IPI processing delays
> +########################
> +Core scheduling selects only trusted tasks to run together. IPI is used to notify
> +the siblings to switch to the new task. But there could be hardware delays in
> +receiving of the IPI on some arch (on x86, this has not been observed). This may
> +cause an attacker task to start running on a CPU before its siblings receive the
> +IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
> +may populate data in the cache and micro architectural buffers after the attacker
> +starts to run and this is a possibility for data leak.
> +
> +Open cross-HT issues that core scheduling does not solve
> +--------------------------------------------------------
> +1. For MDS
> +##########
> +Core scheduling cannot protect against MDS attacks between an HT running in
> +user mode and another running in kernel mode. Even though both HTs run tasks
> +which trust each other, kernel memory is still considered untrusted. Such
> +attacks are possible for any combination of sibling CPU modes (host or guest mode).
> +
> +2. For L1TF
> +###########
> +Core scheduling cannot protect against an L1TF guest attacker exploiting a
> +guest or host victim. This is because the guest attacker can craft invalid
> +PTEs which are not inverted due to a vulnerable guest kernel. The only
> +solution is to disable EPT (Extended Page Tables).
> +
> +For both MDS and L1TF, if the guest vCPU is configured to not trust each
> +other (by tagging separately), then the guest to guest attacks would go away.
> +Or it could be a system admin policy which considers guest to guest attacks as
> +a guest problem.
> +
> +Another approach to resolve these would be to make every untrusted task on the
> +system to not trust every other untrusted task. While this could reduce
> +parallelism of the untrusted tasks, it would still solve the above issues while
> +allowing system processes (trusted tasks) to share a core.
> +
> +3. Protecting the kernel (IRQ, syscall, VMEXIT)
> +###############################################
> +Unfortunately, core scheduling does not protect kernel contexts running on
> +sibling hyperthreads from one another. Prototypes of mitigations have been posted
> +to LKML to solve this, but it is debatable whether such windows are practically
> +exploitable, and whether the performance overhead of the prototypes are worth
> +it (not to mention, the added code complexity).
> +
> +Other Use cases
> +---------------
> +The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
> +with SMT enabled. There are other use cases where this feature could be used:
> +
> +- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
> +  that uses SIMD instructions etc.
> +- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
> +  together could also be realized using core scheduling. One example is vCPUs of
> +  a VM.
> diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
> index ca4dbdd9016d..f12cda55538b 100644
> --- a/Documentation/admin-guide/hw-vuln/index.rst
> +++ b/Documentation/admin-guide/hw-vuln/index.rst
> @@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
>      tsx_async_abort
>      multihit.rst
>      special-register-buffer-data-sampling.rst
> +   core-scheduling.rst
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Documentation: Add usecases, design and interface for core scheduling
  2021-05-26 17:56 [PATCH] Documentation: Add usecases, design and interface for core scheduling Joel Fernandes (Google)
  2021-05-26 21:43 ` Chris Hyser
@ 2021-05-26 22:52 ` Jonathan Corbet
  2021-06-01 20:46   ` Joel Fernandes
  1 sibling, 1 reply; 5+ messages in thread
From: Jonathan Corbet @ 2021-05-26 22:52 UTC (permalink / raw)
  To: Joel Fernandes (Google), linux-kernel
  Cc: Joel Fernandes (Google), Chris Hyser, Josh Don, mingo, peterz, linux-doc

"Joel Fernandes (Google)" <joel@joelfernandes.org> writes:

> Now that core scheduling is merged, update the documentation.

Yay documentation!

A couple of nits...

> Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
> Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
> Co-developed-by: Josh Don <joshdon@google.com>
> Signed-off-by: Josh Don <joshdon@google.com>
> Cc: mingo@kernel.org
> Cc: peterz@infradead.org
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> ---
>  .../admin-guide/hw-vuln/core-scheduling.rst   | 211 ++++++++++++++++++
>  Documentation/admin-guide/hw-vuln/index.rst   |   1 +

As I understand it, there are use cases for core scheduling that go well
beyond dancing around hardware vulnerabilities.  So do we really want to
bury the documentation for this feature there?  To me it seems like the
user-space API manual might be a better place, but perhaps I'm missing
something.

>  2 files changed, 212 insertions(+)
>  create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
>
> diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> new file mode 100644
> index 000000000000..585edf16183b
> --- /dev/null
> +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> @@ -0,0 +1,211 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Core Scheduling
> +***************

We have a nicely laid-out set of conventions for subsection headings,
described in Documentation/doc-guide/sphinx.rst; it would be nice if
this document would conform to that.

> +Core scheduling support allows userspace to define groups of tasks that can
> +share a core. These groups can be specified either for security usecases (one
> +group of tasks don't trust another), or for performance usecases (some
> +workloads may benefit from running on the same core as they don't need the same
> +hardware resources of the shared core, or may prefer different cores if they
> +do share hardware resource needs). This document only describes the security
> +usecase.
> +
> +Security usecase
> +----------------
> +A cross-HT attack involves the attacker and victim running on different Hyper
> +Threads of the same core. MDS and L1TF are examples of such attacks.  The only
> +full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
> +scheduling is a scheduler feature that can mitigate some (not all) cross-HT
> +attacks. It allows HT to be turned on safely by ensuring that tasks in a

by ensuring that *only* tasks in a trusted group ... right?

> +user-designated trusted group can share a core. This increase in core sharing
> +can also improve performance, however it is not guaranteed that performance
> +will always improve, though that is seen to be the case with a number of real
> +world workloads. In theory, core scheduling aims to perform at least as good as

s/good/well/

> +when Hyper Threading is disabled. In practice, this is mostly the case though
> +not always: as synchronizing scheduling decisions across 2 or more CPUs in a
> +core involves additional overhead - especially when the system is lightly
> +loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
> +scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
> +total number of CPUs. Please measure the performance of your workloads always.
> +
> +Usage
> +-----
> +Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.

The use of ``literal text`` markup isn't necessary here, and is known to
irritate some people.

> +Using this feature, userspace defines groups of tasks that can be co-scheduled
> +on the same core. The core scheduler uses this information to make sure that
> +tasks that are not in the same group never run simultaneously on a core, while
> +doing its best to satisfy the system's scheduling requirements.
> +
> +Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
> +This interface provides support for the creation of core scheduling groups, as
> +well as admission and removal of tasks from created groups.
> +
> +::

I'd just say "from created groups::" and leave off the separate "::" line.

> +
> +    #include <sys/prctl.h>
> +
> +    int prctl(int option, unsigned long arg2, unsigned long arg3,
> +            unsigned long arg4, unsigned long arg5);
> +
> +option:
> +    ``PR_SCHED_CORE``

Did you want that to be in the literal block?  If you don't indent it
that won't work.  If you *do* want it, you really don't need the literal
markup. 

> +
> +arg2:
> +    Command for operation, must be one off:
> +    - ``PR_SCHED_CORE_GET              0  -- get core_sched cookie of ``pid``.
> +    - ``PR_SCHED_CORE_CREATE           1  -- create a new unique cookie for ``pid``.
> +    - ``PR_SCHED_CORE_SHARE_TO         2  -- push core_sched cookie to ``pid``.
> +    - ``PR_SCHED_CORE_SHARE_FROM       3  -- pull core_sched cookie from ``pid``.
> +
> +arg3:
> +    ``pid`` of the task for which the operation applies.
> +
> +arg4:
> +    ``pid_type`` for which the operation applies. It is of type ``enum pid_type``.
> +    For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command
> +    will be performed for all tasks in the task group of ``pid``.
> +
> +arg5:
> +    userspace pointer to an unsigned long for storing the cookie returned by
> +    ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
> +
> +Cookie Transferral
> +~~~~~~~~~~~~~~~~~~
> +Transferring a cookie between the current and other tasks is possible using
> +PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
> +specified task or a share a cookie with a task. In combination this allows a
> +simple helper program to pull a cookie from a task in an existing core
> +scheduling group and share it with already running tasks.

There must be some sort of security model here, right?  You can't just
steal somebody else's cookies, even if they are the yummy chocolate-chip
variety.  It would be good to say what the policy is.

> +Design/Implementation
> +---------------------
> +Each task that is tagged is assigned a cookie internally in the kernel. As
> +mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
> +each other and share a core.
> +
> +The basic idea is that, every schedule event tries to select tasks for all the
> +siblings of a core such that all the selected tasks running on a core are
> +trusted (same cookie) at any point in time. Kernel threads are assumed trusted.

...and kernel threads trust random user tasks too?  Interesting.

> +The idle task is considered special, as it trusts everything and everything
> +trusts it.
> +
> +During a schedule() event on any sibling of a core, the highest priority task on
> +the sibling's core is picked and assigned to the sibling calling schedule(), if
> +the sibling has the task enqueued. For rest of the siblings in the core,
> +highest priority task with the same cookie is selected if there is one runnable
> +in their individual run queues. If a task with same cookie is not available,
> +the idle task is selected.  Idle task is globally trusted.
> +
> +Once a task has been selected for all the siblings in the core, an IPI is sent to
> +siblings for whom a new task was selected. Siblings on receiving the IPI will
> +switch to the new task immediately. If an idle task is selected for a sibling,
> +then the sibling is considered to be in a `forced idle` state. I.e., it may
> +have tasks on its on runqueue to run, however it will still have to run idle.
> +More on this in the next section.
> +
> +Forced-idling of tasks
> +----------------------

You're idling *CPUs*, not tasks, right?

> +The scheduler tries its best to find tasks that trust each other such that all
> +tasks selected to be scheduled are of the highest priority in a core.  However,
> +it is possible that some runqueues had tasks that were incompatible with the
> +highest priority ones in the core. Favoring security over fairness, one or more
> +siblings could be forced to select a lower priority task if the highest
> +priority task is not trusted with respect to the core wide highest priority
> +task.  If a sibling does not have a trusted task to run, it will be forced idle
> +by the scheduler (idle thread is scheduled to run).
> +
> +When the highest priority task is selected to run, a reschedule-IPI is sent to
> +the sibling to force it into idle. This results in 4 cases which need to be
> +considered depending on whether a VM or a regular usermode process was running
> +on either HT::
> +
> +          HT1 (attack)            HT2 (victim)
> +   A      idle -> user space      user space -> idle
> +   B      idle -> user space      guest -> idle
> +   C      idle -> guest           user space -> idle
> +   D      idle -> guest           guest -> idle
> +
> +Note that for better performance, we do not wait for the destination CPU
> +(victim) to enter idle mode. This is because the sending of the IPI would bring
> +the destination CPU immediately into kernel mode from user space, or VMEXIT
> +in the case of guests. At best, this would only leak some scheduler metadata
> +which may not be worth protecting. It is also possible that the IPI is received
> +too late on some architectures, but this has not been observed in the case of
> +x86.
> +
> +Trust model
> +-----------
> +Core scheduling maintains trust relationships amongst groups of tasks by
> +assigning them a tag that is the same cookie value.
> +When a system with core scheduling boots, all tasks are considered to trust
> +each other. This is because the core scheduler does not have information about
> +trust relationships until userspace uses the above mentioned interfaces, to
> +communicate them. In other words, all tasks have a default cookie value of 0.
> +and are considered system-wide trusted. The stunning of siblings running

"stunning"?  Is this idling or are you doing something more violent here?

> +cookie-0 tasks is also avoided.

[...]

Thanks,

jon

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Documentation: Add usecases, design and interface for core scheduling
  2021-05-26 21:43 ` Chris Hyser
@ 2021-06-01 20:24   ` Joel Fernandes
  0 siblings, 0 replies; 5+ messages in thread
From: Joel Fernandes @ 2021-06-01 20:24 UTC (permalink / raw)
  To: Chris Hyser
  Cc: linux-kernel, Josh Don, mingo, peterz, Jonathan Corbet, linux-doc

Apologies for the late reply, holidays and all.

On Wed, May 26, 2021 at 05:43:01PM -0400, Chris Hyser wrote:
[..]
> > +Usage
> > +-----
> > +Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
> > +Using this feature, userspace defines groups of tasks that can be co-scheduled
> > +on the same core. The core scheduler uses this information to make sure that
> > +tasks that are not in the same group never run simultaneously on a core, while
> > +doing its best to satisfy the system's scheduling requirements.
> > +
> > +Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
> > +This interface provides support for the creation of core scheduling groups, as
> > +well as admission and removal of tasks from created groups.
> > +
> > +::
> > +
> > +    #include <sys/prctl.h>
> > +
> > +    int prctl(int option, unsigned long arg2, unsigned long arg3,
> > +            unsigned long arg4, unsigned long arg5);
> > +
> > +option:
> > +    ``PR_SCHED_CORE``
> > +
> > +arg2:
> > +    Command for operation, must be one off:
> > +    - ``PR_SCHED_CORE_GET              0  -- get core_sched cookie of ``pid``.
> > +    - ``PR_SCHED_CORE_CREATE           1  -- create a new unique cookie for ``pid``.
> > +    - ``PR_SCHED_CORE_SHARE_TO         2  -- push core_sched cookie to ``pid``.
> > +    - ``PR_SCHED_CORE_SHARE_FROM       3  -- pull core_sched cookie from ``pid``.
> > +
> > +arg3:
> > +    ``pid`` of the task for which the operation applies.
> > +
> > +arg4:
> > +    ``pid_type`` for which the operation applies. It is of type ``enum pid_type``.
> > +    For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command
> > +    will be performed for all tasks in the task group of ``pid``.
> > +
> > +arg5:
> > +    userspace pointer to an unsigned long for storing the cookie returned by
> > +    ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
> 
> Thanks Joel.

Np, thanks.

> In terms of using the prctl() interface to achieve what was once done with
> cgroups, we might want to add some text somewhere in here along the lines of
> say:

Sure.

> 
> -----------
> 
> The simplest way to build hierarchies of threads/processes which share a
> cookie and thus a core is to rely on the fact that the core-sched cookie is
> inherited across forks/clones and execs, thus setting a cookie for the
> 'initial' script/executable/daemon will place every spawned child in the
> same core-sched group. The prctl() API is useful for verification or making
> more specific or elaborate changes.

Just a question:  What kind of verification and why?

> Clearing a cookie can be done with
> PR_SCHED_CORE_SHARE_* involving a task w/o a cookie presumably owned by root
> or other secure user.

I would drop this part from the description tbh, since it seems a rather
corner case. It seems odd to have to clear a cookie once it is set, but if
you can provide me a usecase for clearing, then I can add that in. We don't
clear the cookie in our ChromeOS usecases.

thanks,

 - Joel


> 
> 
> 
> > +
> > +Cookie Transferral
> > +~~~~~~~~~~~~~~~~~~
> > +Transferring a cookie between the current and other tasks is possible using
> > +PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
> > +specified task or a share a cookie with a task. In combination this allows a
> > +simple helper program to pull a cookie from a task in an existing core
> > +scheduling group and share it with already running tasks.
> > +
> > +Design/Implementation
> > +---------------------
> > +Each task that is tagged is assigned a cookie internally in the kernel. As
> > +mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
> > +each other and share a core.
> > +
> > +The basic idea is that, every schedule event tries to select tasks for all the
> > +siblings of a core such that all the selected tasks running on a core are
> > +trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
> > +The idle task is considered special, as it trusts everything and everything
> > +trusts it.
> > +
> > +During a schedule() event on any sibling of a core, the highest priority task on
> > +the sibling's core is picked and assigned to the sibling calling schedule(), if
> > +the sibling has the task enqueued. For rest of the siblings in the core,
> > +highest priority task with the same cookie is selected if there is one runnable
> > +in their individual run queues. If a task with same cookie is not available,
> > +the idle task is selected.  Idle task is globally trusted.
> > +
> > +Once a task has been selected for all the siblings in the core, an IPI is sent to
> > +siblings for whom a new task was selected. Siblings on receiving the IPI will
> > +switch to the new task immediately. If an idle task is selected for a sibling,
> > +then the sibling is considered to be in a `forced idle` state. I.e., it may
> > +have tasks on its on runqueue to run, however it will still have to run idle.
> > +More on this in the next section.
> > +
> > +Forced-idling of tasks
> > +----------------------
> > +The scheduler tries its best to find tasks that trust each other such that all
> > +tasks selected to be scheduled are of the highest priority in a core.  However,
> > +it is possible that some runqueues had tasks that were incompatible with the
> > +highest priority ones in the core. Favoring security over fairness, one or more
> > +siblings could be forced to select a lower priority task if the highest
> > +priority task is not trusted with respect to the core wide highest priority
> > +task.  If a sibling does not have a trusted task to run, it will be forced idle
> > +by the scheduler (idle thread is scheduled to run).
> > +
> > +When the highest priority task is selected to run, a reschedule-IPI is sent to
> > +the sibling to force it into idle. This results in 4 cases which need to be
> > +considered depending on whether a VM or a regular usermode process was running
> > +on either HT::
> > +
> > +          HT1 (attack)            HT2 (victim)
> > +   A      idle -> user space      user space -> idle
> > +   B      idle -> user space      guest -> idle
> > +   C      idle -> guest           user space -> idle
> > +   D      idle -> guest           guest -> idle
> > +
> > +Note that for better performance, we do not wait for the destination CPU
> > +(victim) to enter idle mode. This is because the sending of the IPI would bring
> > +the destination CPU immediately into kernel mode from user space, or VMEXIT
> > +in the case of guests. At best, this would only leak some scheduler metadata
> > +which may not be worth protecting. It is also possible that the IPI is received
> > +too late on some architectures, but this has not been observed in the case of
> > +x86.
> > +
> > +Trust model
> > +-----------
> > +Core scheduling maintains trust relationships amongst groups of tasks by
> > +assigning them a tag that is the same cookie value.
> > +When a system with core scheduling boots, all tasks are considered to trust
> > +each other. This is because the core scheduler does not have information about
> > +trust relationships until userspace uses the above mentioned interfaces, to
> > +communicate them. In other words, all tasks have a default cookie value of 0.
> > +and are considered system-wide trusted. The stunning of siblings running
> > +cookie-0 tasks is also avoided.
> > +
> > +Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
> > +within such groups are considered to trust each other, but do not trust those
> > +outside. Tasks outside the group also don't trust tasks within.
> > +
> > +Limitations of core-scheduling
> > +------------------------------
> > +Core scheduling tries to guarantee that only trusted tasks run concurrently on a
> > +core. But there could be small window of time during which untrusted tasks run
> > +concurrently or kernel could be running concurrently with a task not trusted by
> > +kernel.
> > +
> > +1. IPI processing delays
> > +########################
> > +Core scheduling selects only trusted tasks to run together. IPI is used to notify
> > +the siblings to switch to the new task. But there could be hardware delays in
> > +receiving of the IPI on some arch (on x86, this has not been observed). This may
> > +cause an attacker task to start running on a CPU before its siblings receive the
> > +IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
> > +may populate data in the cache and micro architectural buffers after the attacker
> > +starts to run and this is a possibility for data leak.
> > +
> > +Open cross-HT issues that core scheduling does not solve
> > +--------------------------------------------------------
> > +1. For MDS
> > +##########
> > +Core scheduling cannot protect against MDS attacks between an HT running in
> > +user mode and another running in kernel mode. Even though both HTs run tasks
> > +which trust each other, kernel memory is still considered untrusted. Such
> > +attacks are possible for any combination of sibling CPU modes (host or guest mode).
> > +
> > +2. For L1TF
> > +###########
> > +Core scheduling cannot protect against an L1TF guest attacker exploiting a
> > +guest or host victim. This is because the guest attacker can craft invalid
> > +PTEs which are not inverted due to a vulnerable guest kernel. The only
> > +solution is to disable EPT (Extended Page Tables).
> > +
> > +For both MDS and L1TF, if the guest vCPU is configured to not trust each
> > +other (by tagging separately), then the guest to guest attacks would go away.
> > +Or it could be a system admin policy which considers guest to guest attacks as
> > +a guest problem.
> > +
> > +Another approach to resolve these would be to make every untrusted task on the
> > +system to not trust every other untrusted task. While this could reduce
> > +parallelism of the untrusted tasks, it would still solve the above issues while
> > +allowing system processes (trusted tasks) to share a core.
> > +
> > +3. Protecting the kernel (IRQ, syscall, VMEXIT)
> > +###############################################
> > +Unfortunately, core scheduling does not protect kernel contexts running on
> > +sibling hyperthreads from one another. Prototypes of mitigations have been posted
> > +to LKML to solve this, but it is debatable whether such windows are practically
> > +exploitable, and whether the performance overhead of the prototypes are worth
> > +it (not to mention, the added code complexity).
> > +
> > +Other Use cases
> > +---------------
> > +The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
> > +with SMT enabled. There are other use cases where this feature could be used:
> > +
> > +- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
> > +  that uses SIMD instructions etc.
> > +- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
> > +  together could also be realized using core scheduling. One example is vCPUs of
> > +  a VM.
> > diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
> > index ca4dbdd9016d..f12cda55538b 100644
> > --- a/Documentation/admin-guide/hw-vuln/index.rst
> > +++ b/Documentation/admin-guide/hw-vuln/index.rst
> > @@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
> >      tsx_async_abort
> >      multihit.rst
> >      special-register-buffer-data-sampling.rst
> > +   core-scheduling.rst
> > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Documentation: Add usecases, design and interface for core scheduling
  2021-05-26 22:52 ` Jonathan Corbet
@ 2021-06-01 20:46   ` Joel Fernandes
  0 siblings, 0 replies; 5+ messages in thread
From: Joel Fernandes @ 2021-06-01 20:46 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: linux-kernel, Chris Hyser, Josh Don, mingo, peterz, linux-doc

Hi Jon,

Apologies for late reply, memorial day holidays and all...

On Wed, May 26, 2021 at 04:52:10PM -0600, Jonathan Corbet wrote:
> "Joel Fernandes (Google)" <joel@joelfernandes.org> writes:
> 
> > Now that core scheduling is merged, update the documentation.
> 
> Yay documentation!

What can I say, it is important and necessary as much as it is boring ;-)

> A couple of nits...
> 
> > Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
> > Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
> > Co-developed-by: Josh Don <joshdon@google.com>
> > Signed-off-by: Josh Don <joshdon@google.com>
> > Cc: mingo@kernel.org
> > Cc: peterz@infradead.org
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > ---
> >  .../admin-guide/hw-vuln/core-scheduling.rst   | 211 ++++++++++++++++++
> >  Documentation/admin-guide/hw-vuln/index.rst   |   1 +
> 
> As I understand it, there are use cases for core scheduling that go well
> beyond dancing around hardware vulnerabilities.  So do we really want to
> bury the documentation for this feature there?  To me it seems like the
> user-space API manual might be a better place, but perhaps I'm missing
> something.

True. But I would say the "main usecase" is security. So perhaps it is better
to house it here, with a slight reference to other usecases - if that's ok
with you.

> >  2 files changed, 212 insertions(+)
> >  create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
> >
> > diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> > new file mode 100644
> > index 000000000000..585edf16183b
> > --- /dev/null
> > +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> > @@ -0,0 +1,211 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +Core Scheduling
> > +***************
> 
> We have a nicely laid-out set of conventions for subsection headings,
> described in Documentation/doc-guide/sphinx.rst; it would be nice if
> this document would conform to that.

Ok, I will go through that. Sorry and thanks.

> > +Core scheduling support allows userspace to define groups of tasks that can
> > +share a core. These groups can be specified either for security usecases (one
> > +group of tasks don't trust another), or for performance usecases (some
> > +workloads may benefit from running on the same core as they don't need the same
> > +hardware resources of the shared core, or may prefer different cores if they
> > +do share hardware resource needs). This document only describes the security
> > +usecase.
> > +
> > +Security usecase
> > +----------------
> > +A cross-HT attack involves the attacker and victim running on different Hyper
> > +Threads of the same core. MDS and L1TF are examples of such attacks.  The only
> > +full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
> > +scheduling is a scheduler feature that can mitigate some (not all) cross-HT
> > +attacks. It allows HT to be turned on safely by ensuring that tasks in a
> 
> by ensuring that *only* tasks in a trusted group ... right?

Yes, ok.

> > +user-designated trusted group can share a core. This increase in core sharing
> > +can also improve performance, however it is not guaranteed that performance
> > +will always improve, though that is seen to be the case with a number of real
> > +world workloads. In theory, core scheduling aims to perform at least as good as
> 
> s/good/well/

Ok.

> > +when Hyper Threading is disabled. In practice, this is mostly the case though
> > +not always: as synchronizing scheduling decisions across 2 or more CPUs in a
> > +core involves additional overhead - especially when the system is lightly
> > +loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
> > +scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
> > +total number of CPUs. Please measure the performance of your workloads always.
> > +
> > +Usage
> > +-----
> > +Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
> 
> The use of ``literal text`` markup isn't necessary here, and is known to
> irritate some people.

Ok.

> > +Using this feature, userspace defines groups of tasks that can be co-scheduled
> > +on the same core. The core scheduler uses this information to make sure that
> > +tasks that are not in the same group never run simultaneously on a core, while
> > +doing its best to satisfy the system's scheduling requirements.
> > +
> > +Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
> > +This interface provides support for the creation of core scheduling groups, as
> > +well as admission and removal of tasks from created groups.
> > +
> > +::
> 
> I'd just say "from created groups::" and leave off the separate "::" line.

Ok sure.

> > +
> > +    #include <sys/prctl.h>
> > +
> > +    int prctl(int option, unsigned long arg2, unsigned long arg3,
> > +            unsigned long arg4, unsigned long arg5);
> > +
> > +option:
> > +    ``PR_SCHED_CORE``
> 
> Did you want that to be in the literal block?  If you don't indent it
> that won't work.  If you *do* want it, you really don't need the literal
> markup. 

makes sense, I did want it literal. Will drop quotes.

> > +
> > +arg2:
> > +    Command for operation, must be one off:
> > +    - ``PR_SCHED_CORE_GET              0  -- get core_sched cookie of ``pid``.
> > +    - ``PR_SCHED_CORE_CREATE           1  -- create a new unique cookie for ``pid``.
> > +    - ``PR_SCHED_CORE_SHARE_TO         2  -- push core_sched cookie to ``pid``.
> > +    - ``PR_SCHED_CORE_SHARE_FROM       3  -- pull core_sched cookie from ``pid``.
> > +
> > +arg3:
> > +    ``pid`` of the task for which the operation applies.
> > +
> > +arg4:
> > +    ``pid_type`` for which the operation applies. It is of type ``enum pid_type``.
> > +    For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command
> > +    will be performed for all tasks in the task group of ``pid``.
> > +
> > +arg5:
> > +    userspace pointer to an unsigned long for storing the cookie returned by
> > +    ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
> > +
> > +Cookie Transferral
> > +~~~~~~~~~~~~~~~~~~
> > +Transferring a cookie between the current and other tasks is possible using
> > +PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
> > +specified task or a share a cookie with a task. In combination this allows a
> > +simple helper program to pull a cookie from a task in an existing core
> > +scheduling group and share it with already running tasks.
> 
> There must be some sort of security model here, right?  You can't just
> steal somebody else's cookies, even if they are the yummy chocolate-chip
> variety.  It would be good to say what the policy is.

Yeah. It is enforced by these ptrace checks in the code. I will add some info
about it:

        /*
         * Check if this process has the right to modify the specified
         * process. Use the regular "ptrace_may_access()" checks.
         */
        if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
                err = -EPERM;
                goto out;
        }

> > +Design/Implementation
> > +---------------------
> > +Each task that is tagged is assigned a cookie internally in the kernel. As
> > +mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
> > +each other and share a core.
> > +
> > +The basic idea is that, every schedule event tries to select tasks for all the
> > +siblings of a core such that all the selected tasks running on a core are
> > +trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
> 
> ...and kernel threads trust random user tasks too?  Interesting.

Not if those untrusted random kernel tasks are assigned a cookie.

> > +The idle task is considered special, as it trusts everything and everything
> > +trusts it.
> > +
> > +During a schedule() event on any sibling of a core, the highest priority task on
> > +the sibling's core is picked and assigned to the sibling calling schedule(), if
> > +the sibling has the task enqueued. For rest of the siblings in the core,
> > +highest priority task with the same cookie is selected if there is one runnable
> > +in their individual run queues. If a task with same cookie is not available,
> > +the idle task is selected.  Idle task is globally trusted.
> > +
> > +Once a task has been selected for all the siblings in the core, an IPI is sent to
> > +siblings for whom a new task was selected. Siblings on receiving the IPI will
> > +switch to the new task immediately. If an idle task is selected for a sibling,
> > +then the sibling is considered to be in a `forced idle` state. I.e., it may
> > +have tasks on its on runqueue to run, however it will still have to run idle.
> > +More on this in the next section.
> > +
> > +Forced-idling of tasks
> > +----------------------
> 
> You're idling *CPUs*, not tasks, right?

You are quite right, I'll correct the wording, thanks.

> > +The scheduler tries its best to find tasks that trust each other such that all
> > +tasks selected to be scheduled are of the highest priority in a core.  However,
> > +it is possible that some runqueues had tasks that were incompatible with the
> > +highest priority ones in the core. Favoring security over fairness, one or more
> > +siblings could be forced to select a lower priority task if the highest
> > +priority task is not trusted with respect to the core wide highest priority
> > +task.  If a sibling does not have a trusted task to run, it will be forced idle
> > +by the scheduler (idle thread is scheduled to run).
> > +
> > +When the highest priority task is selected to run, a reschedule-IPI is sent to
> > +the sibling to force it into idle. This results in 4 cases which need to be
> > +considered depending on whether a VM or a regular usermode process was running
> > +on either HT::
> > +
> > +          HT1 (attack)            HT2 (victim)
> > +   A      idle -> user space      user space -> idle
> > +   B      idle -> user space      guest -> idle
> > +   C      idle -> guest           user space -> idle
> > +   D      idle -> guest           guest -> idle
> > +
> > +Note that for better performance, we do not wait for the destination CPU
> > +(victim) to enter idle mode. This is because the sending of the IPI would bring
> > +the destination CPU immediately into kernel mode from user space, or VMEXIT
> > +in the case of guests. At best, this would only leak some scheduler metadata
> > +which may not be worth protecting. It is also possible that the IPI is received
> > +too late on some architectures, but this has not been observed in the case of
> > +x86.
> > +
> > +Trust model
> > +-----------
> > +Core scheduling maintains trust relationships amongst groups of tasks by
> > +assigning them a tag that is the same cookie value.
> > +When a system with core scheduling boots, all tasks are considered to trust
> > +each other. This is because the core scheduler does not have information about
> > +trust relationships until userspace uses the above mentioned interfaces, to
> > +communicate them. In other words, all tasks have a default cookie value of 0.
> > +and are considered system-wide trusted. The stunning of siblings running
> 
> "stunning"?  Is this idling or are you doing something more violent here?

Yes, idling would be easier to understand. "Stunning" is a term used in the
security circles to mean forced idling on incompatible CPUs. I will just
change it to "forced idling".

Will spin this patch soon with the corrections, thanks Jon!

-Joel

> 
> > +cookie-0 tasks is also avoided.
> 
> [...]
> 
> Thanks,
> 
> jon

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-01 20:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-26 17:56 [PATCH] Documentation: Add usecases, design and interface for core scheduling Joel Fernandes (Google)
2021-05-26 21:43 ` Chris Hyser
2021-06-01 20:24   ` Joel Fernandes
2021-05-26 22:52 ` Jonathan Corbet
2021-06-01 20:46   ` Joel Fernandes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.