All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
@ 2022-03-15 15:31 Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
                   ` (15 more replies)
  0 siblings, 16 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.

The patchset is based on ideas and code from the
task isolation patchset from Alex Belits:
https://lwn.net/Articles/816298/

Please refer to Documentation/userspace-api/task_isolation.rst
(patch 1) for details. Its attached at the end of this message
in .txt format as well.

Note: the prctl interface is independent of nohz_full=.

The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/

- qemu-task-isolation.patch: activate task isolation from CPU execution loop
- rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
- util-linux-chisol.patch: add chisol tool to util-linux.

---

v12
- Set TIF_TASK_ISOL only when necessary         (Frederic Weisbecker)
- Switch from raw_cpu_read to __this_cpu_read   (Frederic Weisbecker)
- Add missing kvm-entry.h change                (Frederic Weisbecker)

v11
- Add TIF_TASK_ISOL bit to thread info flags and use it
  to decide whether to perform task isolation work on
  return to userspace                                   (Frederic Weisbecker)
- Fold patch to add task_isol_exit hooks into
  "sync vmstats on return to userspace" patch.          (Frederic Weisbecker)
- Fix typo on prctl_task_isol_cfg_get declaration      (Oscar Shiang)
- Add preempt notifiers

v10
- no changes (resending series without changelog corruption).

v9
- Clarify inheritance is propagated to all descendents (Frederic Weisbecker)
- Fix inheritance of one-shot mode across exec/fork    (Frederic Weisbecker)
- Unify naming on "task_isol_..."                      (Frederic Weisbecker)
- Introduce CONFIG_TASK_ISOLATION                      (Frederic Weisbecker)

v8
- Document the possibility for ISOL_F_QUIESCE_ONE,
  to configure individual features                    (Frederic Weisbecker).
- Fix PR_ISOL_CFG_GET typo in documentation           (Frederic Weisbecker).
- Rebased against linux-2.6.git.

v7
 - no changes (resending series without changelog corruption).

v6
 - Move oneshot mode enablement to configuration time (Frederic Weisbecker).
 - Allow more extensions to CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
 - Update docs and samples regarding oneshot mode     (Frederic Weisbecker).
 - Update docs and samples regarding more extensibility of
   CFG_SET of ISOL_F_QUIESCE                          (Frederic Weisbecker).
 - prctl_task_isolation_activate_get should copy active_mask
   to address in arg2.
 - modify exit_to_user_mode_loop to cover exceptions
   and interrupts.
 - split exit hooks into its own patch

v5
  - Add changelogs to individual patches                (Peter Zijlstra).
  - Add documentation to patchset intro                 (Peter Zijlstra).

v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and use drop
   "internal configuration" prctls            (Frederic Weisbecker).
 - Add summary of terms to documentation      (Frederic Weisbecker).
 - Examples for compute and one-shot modes    (Thomas G/Christoph L).

v3:

 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
 - Remove "system default" notion (chisol should
   make it obsolete).
 - Update documentation: add new section with explanation
   about configuration/activation and code example.
 - Update samples.
 - Report configuration/activation state at
   /proc/pid/task_isolation.
 - Condense dirty information of per-CPU vmstats counters
   in a bool.
 - In-kernel KVM support.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).

- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

v11 can be found at:

https://lore.kernel.org/all/20220204173537.429902988@fedora.localdomain/

---

 Documentation/userspace-api/task_isolation.rst |  379 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 arch/s390/include/asm/thread_info.h            |    2
 arch/x86/include/asm/thread_info.h             |    2
 fs/proc/base.c                                 |   68 ++++++++++
 include/linux/entry-common.h                   |    2
 include/linux/entry-kvm.h                      |    2
 include/linux/sched.h                          |    5
 include/linux/task_isolation.h                 |  136 ++++++++++++++++++++
 include/linux/vmstat.h                         |   25 +++
 include/uapi/linux/prctl.h                     |   47 +++++++
 init/Kconfig                                   |   16 ++
 init/init_task.c                               |    3
 kernel/Makefile                                |    2
 kernel/entry/common.c                          |    4
 kernel/entry/kvm.c                             |    4
 kernel/exit.c                                  |    2
 kernel/fork.c                                  |   23 +++
 kernel/sys.c                                   |   16 ++
 kernel/task_isolation.c                        |  424 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmstat.c                                    |  173 +++++++++++++++++++++-----
 samples/Kconfig                                |    7 +
 samples/Makefile                               |    1
 samples/task_isolation/Makefile                |   11 +
 samples/task_isolation/task_isol.c             |   92 ++++++++++++++
 samples/task_isolation/task_isol.h             |    9 +
 samples/task_isolation/task_isol_computation.c |   89 +++++++++++++
 samples/task_isolation/task_isol_oneshot.c     |  104 +++++++++++++++
 samples/task_isolation/task_isol_userloop.c    |   54 ++++++++
 28 files changed, 1664 insertions(+), 38 deletions(-)

---

Task isolation prctl interface
******************************

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth
networking applications with user-space drivers can fall into the
category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and end of
the latency sensitive application section (with configurable system
behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:

   * PR_ISOL_FEAT_GET: Retrieve supported features.

   * PR_ISOL_CFG_GET: Retrieve task isolation configuration.

   * PR_ISOL_CFG_SET: Set task isolation configuration.

   * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.

   * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.

Summary of terms:

* feature:

     A distinct attribute or aspect of task isolation. Examples of
     features could be logging, new operating modes (eg: syscalls
     disallowed), userspace notifications, etc. The only feature
     currently available is quiescing.

* configuration:

     A specific choice from a given set of possible choices that
     dictate how the particular feature in question should behave.

* activation state:

     The activation state (whether active/inactive) of the task
     isolation features (features must be configured before being
     activated).

Inheritance of the isolation parameters and state, across fork(2) and
clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.

At a high-level, task isolation is divided in two steps:

1. Configuration.

2. Activation.

Section "Userspace support" describes how to use task isolation.

In terms of the interface, the sequence of steps to activate task
isolation are:

1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).

2. Configure task isolation features
   (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).

3. Activate or deactivate task isolation features
   (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).

This interface is based on ideas and code from the task isolation
patchset from Alex Belits: https://lwn.net/Articles/816298/

Note: if the need arises to configure an individual quiesce feature
with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
features per syscall currently).


Feature description
===================

   * "ISOL_F_QUIESCE"

   This feature allows quiescing selected kernel activities on return
   from system calls.


Interface description
=====================

**PR_ISOL_FEAT**:

   Returns the supported features and feature capabilities, as a
   bitmask:

      prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);

   The 'feat' argument specifies whether to return supported features
   (if zero), or feature capabilities (if not zero). Possible values
   for 'feat' are:

   * "0":

        Return the bitmask of supported features, in the location
        pointed  to  by  "(int *)arg3". The buffer should allow space
        for 8 bytes.

   * "ISOL_F_QUIESCE":

        Return a structure containing which kernel activities are
        supported for quiescing, in the location pointed to by "(int
        *)arg3":

           struct task_isol_quiesce_extensions {
                   __u64 flags;
                   __u64 supported_quiesce_bits;
                   __u64 pad[6];
           };

        Where:

        *flags*: Additional flags (should be zero).

        *supported_quiesce_bits*: Bitmask indicating
           which features are supported for quiescing.

        *pad*: Additional space for future enhancements.

   Features and its capabilities are defined at
   include/uapi/linux/task_isolation.h.

**PR_ISOL_CFG_GET**:

   Retrieve task isolation configuration. The general format is:

      prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);

   The 'what' argument specifies what to configure. Possible values
   are:

   * "I_CFG_FEAT":

        Return configuration of task isolation features. The 'arg3'
        argument specifies whether to return configured features (if
        zero), or individual feature configuration (if not zero), as
        follows.

        * "0":

             Return the bitmask of configured features, in the
             location pointed  to  by  "(int *)arg4". The buffer
             should allow space for 8 bytes.

        * "ISOL_F_QUIESCE":

             If arg4 is QUIESCE_CONTROL, return the control structure
             for quiescing of background kernel activities, in the
             location pointed to by "(int *)arg5":

                struct task_isol_quiesce_control {
                       __u64 flags;
                       __u64 quiesce_mask;
                       __u64 quiesce_oneshot_mask;
                       __u64 pad[5];
                };

             See PR_ISOL_CFG_SET description for meaning of fields.

   * "I_CFG_INHERIT":

        Retrieve inheritance configuration across fork/clone.

        Return the structure which configures inheritance across
        fork/clone, in the location pointed to by "(int *)arg4":

           struct task_isol_inherit_control {
                   __u8    inherit_mask;
                   __u8    pad[7];
           };

        See PR_ISOL_CFG_SET description for meaning of fields.

**PR_ISOL_CFG_SET**:

   Set task isolation configuration. The general format is:

      prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);

   The 'what' argument specifies what to configure. Possible values
   are:

   * "I_CFG_FEAT":

        Set configuration of task isolation features. 'arg3' specifies
        the feature. Possible values are:

        * "ISOL_F_QUIESCE":

             If arg4 is QUIESCE_CONTROL, set the control structure for
             quiescing of background kernel activities, from the
             location pointed to by "(int *)arg5":

                struct task_isol_quiesce_control {
                       __u64 flags;
                       __u64 quiesce_mask;
                       __u64 quiesce_oneshot_mask;
                       __u64 pad[5];
                };

             Where:

             *flags*: Additional flags (should be zero).

             *quiesce_mask*: A bitmask containing which kernel
             activities to quiesce.

             *quiesce_oneshot_mask*: A bitmask indicating which kernel
             activities should behave in oneshot mode, that is,
             quiescing will happen on return from
             prctl(PR_ISOL_ACTIVATE_SET), but not on return of
             subsequent system calls. The corresponding bit(s) must
             also be set at quiesce_mask.

             *pad*: Additional space for future enhancements.

             For quiesce_mask (and quiesce_oneshot_mask), possible bit
             sets are:

             * "ISOL_F_QUIESCE_VMSTATS"

             VM statistics are maintained in per-CPU counters to
             improve performance. When a CPU modifies a VM statistic,
             this modification is kept in the per-CPU counter. Certain
             activities require a global count, which involves
             requesting each CPU to flush its local counters to the
             global VM counters.

             This flush is implemented via a workqueue item, which
             might schedule a workqueue on isolated CPUs.

             To avoid this interruption, task isolation can be
             configured to, upon return from system calls, synchronize
             the per-CPU counters to global counters, thus avoiding
             the interruption.

   * "I_CFG_INHERIT":

        Set inheritance configuration when a new task is created via
        fork and clone.

        The "(int *)arg4" argument is a pointer to:

           struct task_isol_inherit_control {
                   __u8    inherit_mask;
                   __u8    pad[7];
           };

        inherit_mask is a bitmask that specifies which part of task
        isolation should be inherited:

        * Bit ISOL_INHERIT_CONF: Inherit task isolation
          configuration. This is the state written via
          prctl(PR_ISOL_CFG_SET, ...).

        * Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
          (requires ISOL_INHERIT_CONF to be set). The new task should
          behave, after fork/clone, in the same manner as the parent
          task after it executed:

             prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);

        Note: the inheritance propagates to all the descendants and
        not just the immediate children, unless the inheritance is
        explicitly reconfigured by some children.

**PR_ISOL_ACTIVATE_GET**:

   Retrieve task isolation activation state.

   The general format is:

      prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);

   'pmask' specifies the location of a feature mask, where the current
   active mask will be copied. See PR_ISOL_ACTIVATE_SET for
   description of individual bits.

**PR_ISOL_ACTIVATE_SET**:

   Set task isolation activation state (activates/deactivates task
   isolation).

   The general format is:

      prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);

   The 'pmask' argument specifies the location of an 8 byte mask
   containing which features should be activated. Features whose bits
   are cleared will be deactivated. The possible bits for this mask
   are:

      * "ISOL_F_QUIESCE":

      Activate quiescing of background kernel activities. Quiescing
      happens on return to userspace from this system call, and on
      return from subsequent system calls (unless quiesce_oneshot_mask
      has been set at PR_ISOL_CFG_SET time).

   Quiescing can be adjusted (while active) by
   prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).


Userspace support
*****************

Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose.

This results in three combinations:

1. Both configuration and activation performed by the latency
sensitive application. Allows fine grained control of what task
isolation features are enabled and when (see samples section below).

2. Only activation can be performed by the latency sensitive app (and
configuration performed by chisol). This allows the admin/user to
control task isolation parameters, and applications have to be
modified only once.

3. Configuration and activation performed by an external tool. This
allows unmodified applications to take advantage of task isolation.
Activation is performed by the "-a" option of chisol.


Examples
********

The "samples/task_isolation/" directory contains 3 examples:

* task_isol_userloop.c:

     Example of program with a loop on userspace scenario.

* task_isol_computation.c:

     Example of program that enters task isolated mode, performs an
     amount of computation, exits task isolated mode, and writes the
     computation to disk.

* task_isol_oneshot.c:

     Example of program that enables one-shot mode for quiescing,
     enters a processing loop, then upon an external event performs a
     number of syscalls to handle that event.

This is a snippet of code to activate task isolation if it has been
previously configured (by chisol for example):

   #include <sys/prctl.h>
   #include <linux/types.h>

   #ifdef PR_ISOL_CFG_GET
   unsigned long long fmask;

   ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
   if (ret != -1 && fmask != 0) {
           ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
           if (ret == -1) {
                   perror("prctl PR_ISOL_ACTIVATE_SET");
                   return ret;
           }
   }
   #endif






^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 01/13] s390: add support for TIF_TASK_ISOL
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 02/13] x86: " Marcelo Tosatti
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, linux-s390, Marcelo Tosatti

Add TIF_TASK_ISOL handling for s390.

Cc: linux-s390@vger.kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/arch/s390/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/thread_info.h
+++ linux-2.6/arch/s390/include/asm/thread_info.h
@@ -73,6 +73,7 @@ void arch_setup_new_exec(void);
 #define TIF_ISOLATE_BP		8	/* Run process with isolated BP */
 #define TIF_ISOLATE_BP_GUEST	9	/* Run KVM guests with isolated BP */
 #define TIF_PER_TRAP		10	/* Need to handle PER trap on exit to usermode */
+#define TIF_TASK_ISOL		11	/* task isolation work pending */
 
 #define TIF_31BIT		16	/* 32bit process */
 #define TIF_MEMDIE		17	/* is terminating due to OOM killer */
@@ -97,6 +98,7 @@ void arch_setup_new_exec(void);
 #define _TIF_ISOLATE_BP		BIT(TIF_ISOLATE_BP)
 #define _TIF_ISOLATE_BP_GUEST	BIT(TIF_ISOLATE_BP_GUEST)
 #define _TIF_PER_TRAP		BIT(TIF_PER_TRAP)
+#define _TIF_TASK_ISOL		BIT(TIF_TASK_ISOL)
 
 #define _TIF_31BIT		BIT(TIF_31BIT)
 #define _TIF_SINGLE_STEP	BIT(TIF_SINGLE_STEP)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 02/13] x86: add support for TIF_TASK_ISOL
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 03/13] add basic task isolation prctl interface Marcelo Tosatti
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, x86, Marcelo Tosatti

Add TIF_TASK_ISOL handling for x86.

Cc: x86@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/arch/x86/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/thread_info.h
+++ linux-2.6/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
 #define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_TASK_ISOL		6	/* task isolation work pending */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
+#define _TIF_TASK_ISOL		(1 << TIF_TASK_ISOL)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
 #define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 03/13] add basic task isolation prctl interface
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 02/13] x86: " Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-04-25 22:23   ` Thomas Gleixner
  2022-03-15 15:31 ` [patch v12 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Add basic prctl task isolation interface, which allows
informing the kernel that application is executing 
latency sensitive code (where interruptions are undesired).

Interface is described by task_isolation.rst (added by
next patch).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v11:
- Add TIF_TASK_ISOL bit to thread info flags and use it
  to decide whether to perform task isolation work on
  return to userspace                                   (Frederic W. Weisbecker)
- Fix typo on prctl_task_isoln_cfg_get declaration      (Oscar Shiang)

v6:
 - Move oneshot mode enablement to configuration time (Frederic Weisbecker).
 - Allow more extensions to CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
 - prctl_task_isolation_activate_get should copy active_mask 
   to address in arg2.
v5:
 - No changes
v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and drop
   "internal configuration" prctls            (Frederic Weisbecker).

v3:
 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
-  Remove "system default" notion (chisol should
   make it obsolete).

v2:
- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).
- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

 include/linux/sched.h          |    5 
 include/linux/task_isolation.h |   91 ++++++++++
 include/uapi/linux/prctl.h     |   43 +++++
 init/init_task.c               |    3 
 kernel/Makefile                |    2 
 kernel/fork.c                  |   22 ++
 kernel/sys.c                   |   16 +
 kernel/task_isolation.c        |  350 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 530 insertions(+), 2 deletions(-)

Index: linux-2.6/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/prctl.h
+++ linux-2.6/include/uapi/linux/prctl.h
@@ -275,4 +275,51 @@ struct prctl_mm_map {
 #define PR_SET_VMA		0x53564d41
 # define PR_SET_VMA_ANON_NAME		0
 
+#define PR_ISOL_FEAT_GET		63
+#define PR_ISOL_CFG_GET			64
+#define PR_ISOL_CFG_SET			65
+
+/* arg2 to CFG_GET/CFG_SET */
+# define I_CFG_FEAT			1
+# define I_CFG_INHERIT			2
+
+#define PR_ISOL_ACTIVATE_GET		66
+#define PR_ISOL_ACTIVATE_SET		67
+
+
+/* Supported features */
+# define ISOL_F_QUIESCE			(1UL << 0)
+
+# define ISOL_F_QUIESCE_MULTIPLE	(1UL << 0)
+#  define ISOL_F_QUIESCE_VMSTATS	(1UL << 0)
+
+struct task_isol_quiesce_extensions {
+	__u64 flags;
+	__u64 supported_quiesce_bits;
+	__u64 pad[6];
+};
+
+/*
+ * This structure provides control over
+ * inheritance of task isolation across
+ * clone and fork.
+ */
+struct task_isol_inherit_control {
+	__u8	inherit_mask;
+	__u8	flags;
+	__u8	pad[6];
+};
+
+# define ISOL_INHERIT_CONF		(1UL << 0)
+# define ISOL_INHERIT_ACTIVE		(1UL << 1)
+
+struct task_isol_quiesce_control {
+	__u64 flags;
+	__u64 quiesce_mask;
+	__u64 quiesce_oneshot_mask;
+	__u64 pad[5];
+};
+
+# define QUIESCE_CONTROL		(1UL << 0)
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile
+++ linux-2.6/kernel/Makefile
@@ -132,6 +132,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_TASK_ISOLATION) += task_isolation.o
+
 CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
 obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -59,6 +59,7 @@
 #include <linux/sched/coredump.h>
 #include <linux/sched/task.h>
 #include <linux/sched/cputime.h>
+#include <linux/task_isolation.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
@@ -2606,6 +2607,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 	case PR_SET_VMA:
 		error = prctl_set_vma(arg2, arg3, arg4, arg5);
 		break;
+	case PR_ISOL_FEAT_GET:
+		error = prctl_task_isol_feat_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CFG_GET:
+		error = prctl_task_isol_cfg_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CFG_SET:
+		error = prctl_task_isol_cfg_set(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_ACTIVATE_GET:
+		error = prctl_task_isol_activate_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_ACTIVATE_SET:
+		error = prctl_task_isol_activate_set(arg2, arg3, arg4, arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct task_isol_info;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1492,6 +1493,10 @@ struct task_struct {
 	struct callback_head		l1d_flush_kill;
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+	struct task_isol_info		*task_isol_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -209,6 +209,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	.task_isol_info	= NULL,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/task_isolation.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -748,6 +749,7 @@ void __put_task_struct(struct task_struc
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	task_isol_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk, true);
@@ -1557,6 +1559,15 @@ out:
 	return error;
 }
 
+static int copy_task_isol(struct task_struct *tsk)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (current->task_isol_info)
+		return __copy_task_isol(tsk);
+#endif
+	return 0;
+}
+
 static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 {
 	struct sighand_struct *sig;
@@ -2129,7 +2140,9 @@ static __latent_entropy struct task_stru
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
-
+#ifdef CONFIG_TASK_ISOLATION
+	p->task_isol_info = NULL;
+#endif
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
@@ -2173,6 +2186,9 @@ static __latent_entropy struct task_stru
 	retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
+	retval = copy_task_isol(p);
+	if (retval)
+		goto bad_fork_cleanup_thread;
 
 	stackleak_task_init(p);
 
@@ -2181,7 +2197,7 @@ static __latent_entropy struct task_stru
 				args->set_tid_size);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
-			goto bad_fork_cleanup_thread;
+			goto bad_fork_cleanup_task_isol;
 		}
 	}
 
@@ -2410,6 +2426,8 @@ bad_fork_put_pidfd:
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
+bad_fork_cleanup_task_isol:
+	task_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);
 bad_fork_cleanup_io:
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/task_isolation.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_TASK_ISOL_H
+#define __LINUX_TASK_ISOL_H
+
+#ifdef CONFIG_TASK_ISOLATION
+
+struct task_isol_info {
+	/* Which features have been configured */
+	u64 conf_mask;
+	/* Which features are active */
+	u64 active_mask;
+	/* Quiesce mask */
+	u64 quiesce_mask;
+
+	/* Oneshot mask */
+	u64 oneshot_mask;
+
+	u8 inherit_mask;
+};
+
+extern void __task_isol_free(struct task_struct *tsk);
+
+static inline void task_isol_free(struct task_struct *tsk)
+{
+	if (tsk->task_isol_info)
+		__task_isol_free(tsk);
+}
+
+int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5);
+int prctl_task_isol_cfg_get(unsigned long arg2, unsigned long arg3,
+			    unsigned long arg4, unsigned long arg5);
+int prctl_task_isol_cfg_set(unsigned long arg2, unsigned long arg3,
+			    unsigned long arg4, unsigned long arg5);
+int prctl_task_isol_activate_get(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5);
+int prctl_task_isol_activate_set(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5);
+
+int __copy_task_isol(struct task_struct *tsk);
+
+#else
+
+static inline void task_isol_free(struct task_struct *tsk)
+{
+}
+
+static inline int prctl_task_isol_feat_get(unsigned long arg2,
+					   unsigned long arg3,
+					   unsigned long arg4,
+					   unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isol_cfg_get(unsigned long arg2,
+					   unsigned long arg3,
+					   unsigned long arg4,
+					   unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isol_cfg_set(unsigned long arg2,
+					  unsigned long arg3,
+					  unsigned long arg4,
+					  unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isol_activate_get(unsigned long arg2,
+					       unsigned long arg3,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isol_activate_set(unsigned long arg2,
+					       unsigned long arg3,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_TASK_ISOLATION */
+
+#endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/task_isolation.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@belits.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ *   Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/task_isolation.h>
+#include <linux/prctl.h>
+#include <linux/slab.h>
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <linux/sched/task.h>
+
+void __task_isol_free(struct task_struct *tsk)
+{
+	if (!tsk->task_isol_info)
+		return;
+	kfree(tsk->task_isol_info);
+	tsk->task_isol_info = NULL;
+}
+
+static struct task_isol_info *task_isol_alloc_context(void)
+{
+	struct task_isol_info *info;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (unlikely(!info))
+		return ERR_PTR(-ENOMEM);
+
+	return info;
+}
+
+int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	void __user *addr = (void __user *) arg3;
+
+	switch (arg2) {
+	case 0: {
+		u64 supported_fmask = ISOL_F_QUIESCE;
+
+		ret = 0;
+		if (copy_to_user(addr, &supported_fmask, sizeof(u64)))
+			ret = -EFAULT;
+
+		return ret;
+	}
+	case ISOL_F_QUIESCE: {
+		struct task_isol_quiesce_extensions *q_ext;
+
+		q_ext = kzalloc(sizeof(struct task_isol_quiesce_extensions),
+			 GFP_KERNEL);
+		if (!q_ext)
+			return -ENOMEM;
+
+		q_ext->supported_quiesce_bits = ISOL_F_QUIESCE_VMSTATS;
+
+		ret = 0;
+		if (copy_to_user(addr, q_ext, sizeof(*q_ext)))
+			ret = -EFAULT;
+		kfree(q_ext);
+		return ret;
+	}
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int cfg_inherit_get(unsigned long arg3, unsigned long arg4,
+			   unsigned long arg5)
+{
+	struct task_isol_inherit_control *i_ctrl;
+	int ret;
+	void __user *addr = (void __user *) arg3;
+
+	if (!current->task_isol_info)
+		return -EINVAL;
+
+	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+			 GFP_KERNEL);
+	if (!i_ctrl)
+		return -ENOMEM;
+
+	i_ctrl->inherit_mask = current->task_isol_info->inherit_mask;
+
+	ret = 0;
+	if (copy_to_user(addr, i_ctrl, sizeof(*i_ctrl)))
+		ret = -EFAULT;
+	kfree(i_ctrl);
+
+	return ret;
+}
+
+static int cfg_feat_get(unsigned long arg3, unsigned long arg4,
+			unsigned long arg5)
+{
+	int ret = 0;
+
+	switch (arg3) {
+	case 0: {
+		void __user *addr = (void __user *)arg4;
+		u64 cfg_mask = 0;
+
+		if (current->task_isol_info)
+			cfg_mask = current->task_isol_info->conf_mask;
+
+		if (copy_to_user(addr, &cfg_mask, sizeof(u64)))
+			ret = -EFAULT;
+
+		return ret;
+	}
+	case ISOL_F_QUIESCE: {
+		struct task_isol_quiesce_control *i_qctrl;
+		void __user *addr = (void __user *)arg5;
+
+		if (arg4 != QUIESCE_CONTROL)
+			return -EINVAL;
+
+		i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
+				  GFP_KERNEL);
+		if (!i_qctrl)
+			return -ENOMEM;
+
+		if (current->task_isol_info)
+			i_qctrl->quiesce_mask = current->task_isol_info->quiesce_mask;
+
+		if (copy_to_user(addr, i_qctrl, sizeof(*i_qctrl)))
+			ret = -EFAULT;
+
+		kfree(i_qctrl);
+		return ret;
+	}
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int prctl_task_isol_cfg_get(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	switch (arg2) {
+	case I_CFG_FEAT:
+		return cfg_feat_get(arg3, arg4, arg5);
+	case I_CFG_INHERIT:
+		return cfg_inherit_get(arg3, arg4, arg5);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int cfg_inherit_set(unsigned long arg3, unsigned long arg4,
+			   unsigned long arg5)
+{
+	int ret = 0;
+	struct task_isol_inherit_control *i_ctrl;
+	const void __user *addr = (const void __user *)arg3;
+
+	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+			 GFP_KERNEL);
+	if (!i_ctrl)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(i_ctrl, addr, sizeof(*i_ctrl)))
+		goto out_free;
+
+	ret = -EINVAL;
+	if (i_ctrl->inherit_mask & ~(ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE))
+		goto out_free;
+
+	if (i_ctrl->inherit_mask & ISOL_INHERIT_ACTIVE)
+		if (!(i_ctrl->inherit_mask & ISOL_INHERIT_CONF))
+			goto out_free;
+
+	if (!current->task_isol_info) {
+		struct task_isol_info *task_isol_info;
+
+		task_isol_info = task_isol_alloc_context();
+		if (IS_ERR(task_isol_info)) {
+			ret = PTR_ERR(task_isol_info);
+			goto out_free;
+		}
+		current->task_isol_info = task_isol_info;
+	}
+
+	ret = 0;
+	current->task_isol_info->inherit_mask = i_ctrl->inherit_mask;
+
+out_free:
+	kfree(i_ctrl);
+
+	return ret;
+}
+
+static int cfg_feat_quiesce_set(unsigned long arg4, unsigned long arg5)
+{
+	struct task_isol_info *info;
+	struct task_isol_quiesce_control *i_qctrl;
+	int ret = 0;
+	const void __user *addr = (const void __user *)arg5;
+
+	if (arg4 != QUIESCE_CONTROL)
+		return -EINVAL;
+
+	i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
+			 GFP_KERNEL);
+	if (!i_qctrl)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(i_qctrl, addr, sizeof(*i_qctrl)))
+		goto out_free;
+
+	ret = -EINVAL;
+	if (i_qctrl->flags != 0)
+		goto out_free;
+
+	if (i_qctrl->quiesce_mask != ISOL_F_QUIESCE_VMSTATS &&
+	    i_qctrl->quiesce_mask != 0)
+		goto out_free;
+
+	if ((~i_qctrl->quiesce_mask & i_qctrl->quiesce_oneshot_mask) != 0)
+		goto out_free;
+
+	/* current->task_isol_info is only allocated/freed from task
+	 * context.
+	 */
+	if (!current->task_isol_info) {
+		info = task_isol_alloc_context();
+		if (IS_ERR(info)) {
+			ret = PTR_ERR(info);
+			goto out_free;
+		}
+		current->task_isol_info = info;
+	}
+
+	info = current->task_isol_info;
+
+	info->quiesce_mask = i_qctrl->quiesce_mask;
+	info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
+	info->conf_mask |= ISOL_F_QUIESCE;
+	ret = 0;
+
+out_free:
+	kfree(i_qctrl);
+
+	return ret;
+}
+
+int prctl_task_isol_cfg_set(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	switch (arg2) {
+	case I_CFG_FEAT:
+		switch (arg3) {
+		case ISOL_F_QUIESCE:
+			return cfg_feat_quiesce_set(arg4, arg5);
+		default:
+			break;
+		}
+		break;
+	case I_CFG_INHERIT:
+		return cfg_inherit_set(arg3, arg4, arg5);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int __copy_task_isol(struct task_struct *tsk)
+{
+	struct task_isol_info *info, *new_info;
+
+	info = current->task_isol_info;
+	if (!(info->inherit_mask & (ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE)))
+		return 0;
+
+	new_info = task_isol_alloc_context();
+	if (IS_ERR(new_info))
+		return PTR_ERR(new_info);
+
+	new_info->inherit_mask = info->inherit_mask;
+
+	if (info->inherit_mask & ISOL_INHERIT_CONF) {
+		new_info->quiesce_mask = info->quiesce_mask;
+		new_info->conf_mask = info->conf_mask;
+		new_info->oneshot_mask = info->oneshot_mask;
+	}
+
+	if (info->inherit_mask & ISOL_INHERIT_ACTIVE)
+		new_info->active_mask = info->active_mask;
+
+	tsk->task_isol_info = new_info;
+
+	return 0;
+}
+
+int prctl_task_isol_activate_set(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	struct task_isol_info *info;
+	u64 active_mask;
+	const void __user *addr_mask = (const void __user *)arg2;
+
+	ret = -EFAULT;
+	if (copy_from_user(&active_mask, addr_mask, sizeof(u64)))
+		goto out;
+
+	ret = -EINVAL;
+	if (active_mask != ISOL_F_QUIESCE && active_mask != 0)
+		return ret;
+
+	info = current->task_isol_info;
+	if (!info)
+		return ret;
+
+	info->active_mask = active_mask;
+	ret = 0;
+
+out:
+	return ret;
+}
+
+int prctl_task_isol_activate_get(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5)
+{
+	struct task_isol_info *task_isol_info;
+	void __user *addr_mask = (void __user *)arg2;
+
+	task_isol_info = current->task_isol_info;
+	if (!task_isol_info)
+		return -EINVAL;
+
+	if (copy_to_user(addr_mask, &task_isol_info->active_mask, sizeof(u64)))
+		return -EFAULT;
+
+	return 0;
+}
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -675,6 +675,22 @@ config CPU_ISOLATION
 
 	  Say Y if unsure.
 
+config TASK_ISOLATION
+	bool "Task isolation prctl()"
+	depends on GENERIC_ENTRY
+	default n
+	help
+	  Certain types of applications benefit from running uninterrupted by
+	  background OS activities. Realtime systems and high-bandwidth networking
+	  applications with user-space drivers can fall into the category.
+
+	  To create an OS noise free environment for the application, this option
+	  enables the task isolation prctl interface, which allows userspace to
+	  inform the kernel the start and end of the latency sensitive application
+	  section (with configurable system behaviour for that section).
+
+	  Say N if unsure.
+
 source "kernel/rcu/Kconfig"
 
 config BUILD_BIN2C



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 04/13] add prctl task isolation prctl docs and samples
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 03/13] add basic task isolation prctl interface Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-04-26  0:15   ` Thomas Gleixner
  2022-03-15 15:31 ` [patch v12 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Add documentation and userspace sample code for prctl
task isolation interface.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v8:
- Document the possibility for ISOL_F_QUIESCE_ONE, to configure
individual features                                              (Frederic Weisbecker).
- Fix PR_ISOL_CFG_GET typo in documentation                      (Frederic Weisbecker).
- Rebased against linux-2.6.git.

v7:
-  No changes
v6:
 - Update docs and samples regarding oneshot mode (Frederic Weisbecker).
 - Update docs and samples regarding more extensibility of
   CFG_SET of ISOL_F_QUIESCE                   (Frederic Weisbecker).

v5:
 - Fix documentation typos		      (Frederic Weisbecker).
 - Fix oneshot example comment

v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and use drop
   "internal configuration" prctls            (Frederic Weisbecker).
 - Add summary of terms to documentation      (Frederic Weisbecker).
 - Examples for compute and one-shot modes    (Thomas G/Christoph L).

v3:
 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
-  Remove "system default" notion (chisol should
   make it obsolete).
 - Update documentation: add new section with explanation
   about configuration/activation and code example.
 - Update samples.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).
- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

 Documentation/userspace-api/task_isolation.rst |  370 +++++++++++++++++++++++++
 samples/Kconfig                                |    7 
 samples/Makefile                               |    1 
 samples/task_isolation/Makefile                |   11 
 samples/task_isolation/task_isol.c             |   92 ++++++
 samples/task_isolation/task_isol.h             |    9 
 samples/task_isolation/task_isol_computation.c |   89 ++++++
 samples/task_isolation/task_isol_oneshot.c     |  104 +++++++
 samples/task_isolation/task_isol_userloop.c    |   54 +++
 9 files changed, 737 insertions(+)

Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,379 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Task isolation prctl interface
+===============================
+
+Certain types of applications benefit from running uninterrupted by
+background OS activities. Realtime systems and high-bandwidth networking
+applications with user-space drivers can fall into the category.
+
+To create an OS noise free environment for the application, this
+interface allows userspace to inform the kernel the start and
+end of the latency sensitive application section (with configurable
+system behaviour for that section).
+
+Note: the prctl interface is independent of nohz_full=.
+
+The prctl options are:
+
+
+        - PR_ISOL_FEAT_GET: Retrieve supported features.
+        - PR_ISOL_CFG_GET: Retrieve task isolation configuration.
+        - PR_ISOL_CFG_SET: Set task isolation configuration.
+        - PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
+        - PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
+
+Summary of terms:
+
+
+- feature:
+
+        A distinct attribute or aspect of task isolation. Examples of
+        features could be logging, new operating modes (eg: syscalls disallowed),
+        userspace notifications, etc. The only feature currently available is quiescing.
+
+- configuration:
+
+        A specific choice from a given set
+        of possible choices that dictate how the particular feature
+        in question should behave.
+
+- activation state:
+
+        The activation state (whether active/inactive) of the task
+        isolation features (features must be configured before
+        being activated).
+
+Inheritance of the isolation parameters and state, across
+fork(2) and clone(2), can be changed via
+PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
+
+
+At a high-level, task isolation is divided in two steps:
+
+1. Configuration.
+2. Activation.
+
+Section "Userspace support" describes how to use
+task isolation.
+
+In terms of the interface, the sequence of steps to activate
+task isolation are:
+
+1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
+2. Configure task isolation features (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
+3. Activate or deactivate task isolation features (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
+
+This interface is based on ideas and code from the
+task isolation patchset from Alex Belits:
+https://lwn.net/Articles/816298/
+
+Note: if the need arises to configure an individual quiesce feature
+with its own extensible structure, please add ISOL_F_QUIESCE_ONE
+to PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on
+multiple features per syscall currently).
+
+--------------------
+Feature description
+--------------------
+
+        - ``ISOL_F_QUIESCE``
+
+        This feature allows quiescing selected kernel activities on
+        return from system calls.
+
+---------------------
+Interface description
+---------------------
+
+**PR_ISOL_FEAT**:
+
+        Returns the supported features and feature
+        capabilities, as a bitmask::
+
+                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies whether to return
+        supported features (if zero), or feature capabilities
+        (if not zero). Possible values for 'feat' are:
+
+
+        - ``0``:
+               Return the bitmask of supported features, in the location
+               pointed  to  by  ``(int *)arg3``. The buffer should allow space
+               for 8 bytes.
+
+        - ``ISOL_F_QUIESCE``:
+
+               Return a structure containing which kernel
+               activities are supported for quiescing, in the location
+               pointed to by ``(int *)arg3``::
+
+                        struct task_isol_quiesce_extensions {
+                                __u64 flags;
+                                __u64 supported_quiesce_bits;
+                                __u64 pad[6];
+                        };
+
+               Where:
+
+               *flags*: Additional flags (should be zero).
+
+               *supported_quiesce_bits*: Bitmask indicating
+                which features are supported for quiescing.
+
+               *pad*: Additional space for future enhancements.
+
+
+        Features and its capabilities are defined at
+        include/uapi/linux/task_isolation.h.
+
+**PR_ISOL_CFG_GET**:
+
+        Retrieve task isolation configuration.
+        The general format is::
+
+                prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
+
+        The 'what' argument specifies what to configure. Possible values are:
+
+        - ``I_CFG_FEAT``:
+
+                Return configuration of task isolation features. The 'arg3' argument specifies
+                whether to return configured features (if zero), or individual
+                feature configuration (if not zero), as follows.
+
+                - ``0``:
+
+                        Return the bitmask of configured features, in the location
+                        pointed  to  by  ``(int *)arg4``. The buffer should allow space
+                        for 8 bytes.
+
+                - ``ISOL_F_QUIESCE``:
+
+                        If arg4 is QUIESCE_CONTROL, return the control structure for
+                        quiescing of background kernel activities, in the location
+                        pointed to by ``(int *)arg5``::
+
+                         struct task_isol_quiesce_control {
+                                __u64 flags;
+                                __u64 quiesce_mask;
+                                __u64 quiesce_oneshot_mask;
+                                __u64 pad[5];
+                         };
+
+                        See PR_ISOL_CFG_SET description for meaning of fields.
+
+        - ``I_CFG_INHERIT``:
+
+                Retrieve inheritance configuration across fork/clone.
+
+                Return the structure which configures inheritance
+                across fork/clone, in the location pointed to
+                by ``(int *)arg4``::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                See PR_ISOL_CFG_SET description for meaning of fields.
+
+**PR_ISOL_CFG_SET**:
+
+        Set task isolation configuration.
+        The general format is::
+
+                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
+
+        The 'what' argument specifies what to configure. Possible values are:
+
+        - ``I_CFG_FEAT``:
+
+                Set configuration of task isolation features. 'arg3' specifies
+                the feature. Possible values are:
+
+                - ``ISOL_F_QUIESCE``:
+
+                        If arg4 is QUIESCE_CONTROL, set the control structure
+                        for quiescing of background kernel activities, from
+                        the location pointed to by ``(int *)arg5``::
+
+                         struct task_isol_quiesce_control {
+                                __u64 flags;
+                                __u64 quiesce_mask;
+                                __u64 quiesce_oneshot_mask;
+                                __u64 pad[5];
+                         };
+
+                        Where:
+
+                        *flags*: Additional flags (should be zero).
+
+                        *quiesce_mask*: A bitmask containing which kernel
+                        activities to quiesce.
+
+                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
+                        activities should behave in oneshot mode, that is, quiescing
+                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
+                        on return of subsequent system calls. The corresponding bit(s)
+                        must also be set at quiesce_mask.
+
+                        *pad*: Additional space for future enhancements.
+
+                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
+
+                        - ``ISOL_F_QUIESCE_VMSTATS``
+
+                        VM statistics are maintained in per-CPU counters to
+                        improve performance. When a CPU modifies a VM statistic,
+                        this modification is kept in the per-CPU counter.
+                        Certain activities require a global count, which
+                        involves requesting each CPU to flush its local counters
+                        to the global VM counters.
+
+                        This flush is implemented via a workqueue item, which
+                        might schedule a workqueue on isolated CPUs.
+
+                        To avoid this interruption, task isolation can be
+                        configured to, upon return from system calls, synchronize
+                        the per-CPU counters to global counters, thus avoiding
+                        the interruption.
+
+        - ``I_CFG_INHERIT``:
+                Set inheritance configuration when a new task
+                is created via fork and clone.
+
+                The ``(int *)arg4`` argument is a pointer to::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                inherit_mask is a bitmask that specifies which part
+                of task isolation should be inherited:
+
+                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
+                  This is the state written via prctl(PR_ISOL_CFG_SET, ...).
+
+                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
+                  (requires ISOL_INHERIT_CONF to be set). The new task
+                  should behave, after fork/clone, in the same manner
+                  as the parent task after it executed:
+
+                        prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);
+
+                Note: the inheritance propagates to all the descendants and not
+                just the immediate children, unless the inheritance is explicitly
+                reconfigured by some children.
+
+**PR_ISOL_ACTIVATE_GET**:
+
+        Retrieve task isolation activation state.
+
+        The general format is::
+
+                prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);
+
+        'pmask' specifies the location of a feature mask, where
+        the current active mask will be copied. See PR_ISOL_ACTIVATE_SET
+        for description of individual bits.
+
+
+**PR_ISOL_ACTIVATE_SET**:
+
+        Set task isolation activation state (activates/deactivates
+        task isolation).
+
+        The general format is::
+
+                prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);
+
+
+        The 'pmask' argument specifies the location of an 8 byte mask
+        containing which features should be activated. Features whose
+        bits are cleared will be deactivated. The possible
+        bits for this mask are:
+
+                - ``ISOL_F_QUIESCE``:
+
+                Activate quiescing of background kernel activities.
+                Quiescing happens on return to userspace from this
+                system call, and on return from subsequent
+                system calls (unless quiesce_oneshot_mask has been set at
+                PR_ISOL_CFG_SET time).
+
+        Quiescing can be adjusted (while active) by
+        prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).
+
+
+==================
+Userspace support
+==================
+
+Task isolation is divided in two main steps: configuration and activation.
+
+Each step can be performed by an external tool or the latency sensitive
+application itself. util-linux contains the "chisol" tool for this
+purpose.
+
+This results in three combinations:
+
+1. Both configuration and activation performed by the
+latency sensitive application.
+Allows fine grained control of what task isolation
+features are enabled and when (see samples section below).
+
+2. Only activation can be performed by the latency sensitive app
+(and configuration performed by chisol).
+This allows the admin/user to control task isolation parameters,
+and applications have to be modified only once.
+
+3. Configuration and activation performed by an external tool.
+This allows unmodified applications to take advantage of
+task isolation. Activation is performed by the "-a" option
+of chisol.
+
+========
+Examples
+========
+
+The ``samples/task_isolation/`` directory contains 3 examples:
+
+* task_isol_userloop.c:
+
+        Example of program with a loop on userspace scenario.
+
+* task_isol_computation.c:
+
+        Example of program that enters task isolated mode,
+        performs an amount of computation, exits task
+        isolated mode, and writes the computation to disk.
+
+* task_isol_oneshot.c:
+
+        Example of program that enables one-shot
+        mode for quiescing, enters a processing loop, then upon an external
+        event performs a number of syscalls to handle that event.
+
+This is a snippet of code to activate task isolation if
+it has been previously configured (by chisol for example)::
+
+        #include <sys/prctl.h>
+        #include <linux/types.h>
+
+        #ifdef PR_ISOL_CFG_GET
+        unsigned long long fmask;
+
+        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+        if (ret != -1 && fmask != 0) {
+                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
+                if (ret == -1) {
+                        perror("prctl PR_ISOL_ACTIVATE_SET");
+                        return ret;
+                }
+        }
+        #endif
+
Index: linux-2.6/samples/task_isolation/task_isol.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include <errno.h>
+#include "task_isol.h"
+
+#ifdef PR_ISOL_FEAT_GET
+int task_isol_setup(int oneshot)
+{
+	int ret;
+	int errnosv;
+	unsigned long long fmask;
+	struct task_isol_quiesce_extensions qext;
+	struct task_isol_quiesce_control qctrl;
+
+	/* Retrieve supported task isolation features */
+	ret = prctl(PR_ISOL_FEAT_GET, 0, &fmask, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT");
+		return ret;
+	}
+	printf("supported features bitmask: 0x%llx\n", fmask);
+
+	/* Retrieve supported ISOL_F_QUIESCE bits */
+	ret = prctl(PR_ISOL_FEAT_GET, ISOL_F_QUIESCE, &qext, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE)");
+		return ret;
+	}
+	printf("supported ISOL_F_QUIESCE bits: 0x%llx\n",
+		qext.supported_quiesce_bits);
+
+	fmask = 0;
+	ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+	errnosv = errno;
+	if (ret != -1 && fmask != 0) {
+		printf("Task isolation parameters already configured!\n");
+		return ret;
+	}
+	if (ret == -1 && errnosv != ENODATA) {
+		perror("prctl PR_ISOL_GET");
+		return ret;
+	}
+	memset(&qctrl, 0, sizeof(struct task_isol_quiesce_control));
+	qctrl.quiesce_mask = ISOL_F_QUIESCE_VMSTATS;
+	if (oneshot)
+		qctrl.quiesce_oneshot_mask = ISOL_F_QUIESCE_VMSTATS;
+
+	ret = prctl(PR_ISOL_CFG_SET, I_CFG_FEAT, ISOL_F_QUIESCE,
+		    QUIESCE_CONTROL, &qctrl);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CFG_SET");
+		return ret;
+	}
+	return ISOL_F_QUIESCE;
+}
+
+int task_isol_activate_set(unsigned long long mask)
+{
+	int ret;
+
+	ret = prctl(PR_ISOL_ACTIVATE_SET, &mask, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_ACTIVATE_SET");
+		return -1;
+	}
+
+	return 0;
+}
+
+#else
+
+int task_isol_setup(void)
+{
+	return 0;
+}
+
+int task_isol_activate_set(unsigned long long mask)
+{
+	return 0;
+}
+#endif
+
+
Index: linux-2.6/samples/task_isolation/task_isol.h
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TASK_ISOL_H
+#define __TASK_ISOL_H
+
+int task_isol_setup(int oneshot);
+
+int task_isol_activate_set(unsigned long long mask);
+
+#endif /* __TASK_ISOL_H */
Index: linux-2.6/samples/task_isolation/task_isol_userloop.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_userloop.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(0);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+	/* enable quiescing on system call return, oneshot */
+	ret = task_isol_activate_set(mask);
+	if (ret)
+		return EXIT_FAILURE;
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+	/* busy loop */
+	while (ret < NR_LOOPS)  {
+		memset(buf, 0, 4096);
+		ret = ret+1;
+		if (!(ret % NR_PRINT))
+			printf("loops=%d of %d\n", ret, NR_LOOPS);
+	}
+
+
+	ret = task_isol_activate_set(mask & ~ISOL_F_QUIESCE);
+	if (ret)
+		return EXIT_FAILURE;
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/Kconfig
===================================================================
--- linux-2.6.orig/samples/Kconfig
+++ linux-2.6/samples/Kconfig
@@ -250,6 +250,13 @@ config SAMPLE_CORESIGHT_SYSCFG
 	  This demonstrates how a user may create their own CoreSight
 	  configurations and easily load them into the system at runtime.
 
+config SAMPLE_TASK_ISOLATION
+	bool "task isolation sample"
+	depends on CC_CAN_LINK && HEADERS_INSTALL
+	help
+	  Build example userspace program to use prctl task isolation
+	  interface.
+
 endif # SAMPLES
 
 config HAVE_SAMPLE_FTRACE_DIRECT
Index: linux-2.6/samples/Makefile
===================================================================
--- linux-2.6.orig/samples/Makefile
+++ linux-2.6/samples/Makefile
@@ -33,3 +33,4 @@ subdir-$(CONFIG_SAMPLE_WATCHDOG)	+= watc
 subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)	+= watch_queue
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)	+= kmemleak/
 obj-$(CONFIG_SAMPLE_CORESIGHT_SYSCFG)	+= coresight/
+subdir-$(CONFIG_SAMPLE_TASK_ISOLATION)	+= task_isolation
Index: linux-2.6/samples/task_isolation/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+userprogs-always-y += task_isol_userloop task_isol_computation task_isol_oneshot
+task_isol_userloop-objs := task_isol.o task_isol_userloop.o
+task_isol_computation-objs := task_isol.o task_isol_computation.o
+task_isol_oneshot-objs := task_isol.o task_isol_oneshot.o
+
+userccflags += -I usr/include
+
+
+#$(CC) $^ -o $@
Index: linux-2.6/samples/task_isolation/task_isol_computation.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_computation.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Example of task isolation prctl interface with a loop:
+ *
+ *	do {
+ *		enable quiescing of kernel activities
+ *		perform computation
+ *		disable quiescing of kernel activities
+ *		write computation results to disk
+ *	} while (condition);
+ *
+ */
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret, fd, write_loops;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	fd = open("/tmp/comp_output.data", O_RDWR|O_CREAT);
+	if (fd == -1) {
+		perror("open");
+		return EXIT_FAILURE;
+	}
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(0);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+
+	write_loops = 0;
+	do {
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+		/* enable quiescing on system call return */
+		ret = task_isol_activate_set(mask);
+		if (ret)
+			return EXIT_FAILURE;
+
+		/* busy loop */
+		while (ret < NR_LOOPS)  {
+			memset(buf, 0xf, 4096);
+			ret = ret+1;
+			if (!(ret % NR_PRINT))
+				printf("wloop=%d loops=%d of %d\n", write_loops,
+					ret, NR_LOOPS);
+		}
+		/* disable quiescing on system call return */
+		ret = task_isol_activate_set(mask & ~ISOL_F_QUIESCE);
+		if (ret)
+			return EXIT_FAILURE;
+
+		/*
+		 * write computed data to disk, this would be
+		 * multiple writes on a real application, so
+		 * disabling quiescing is advantageous
+		 */
+		ret = write(fd, buf, 4096);
+		if (ret == -1) {
+			perror("write");
+			return EXIT_FAILURE;
+		}
+
+		write_loops += 1;
+	} while (write_loops < 5);
+
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/task_isolation/task_isol_oneshot.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_oneshot.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Example of task isolation prctl interface using
+ * oneshot mode for quiescing.
+ *
+ *
+ *      enable oneshot quiescing of kernel activities
+ *	do {
+ *		process data (no system calls)
+ *		if (event) {
+ *			process event with syscalls
+ *			enable oneshot quiescing of kernel activities
+ *		}
+ *	} while (!exit_condition);
+ *
+ */
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret, fd, cnt;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	fd = open("/dev/zero", O_RDONLY);
+	if (fd == -1) {
+		perror("open");
+		return EXIT_FAILURE;
+	}
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(1);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+
+	/* enable quiescing on system call return, oneshot */
+	ret = task_isol_activate_set(mask);
+	if (ret)
+		return EXIT_FAILURE;
+	/* busy loop */
+	cnt = 0;
+	while (cnt < NR_LOOPS)  {
+		memset(buf, 0xf, 4096);
+		cnt = cnt+1;
+		if (!(cnt % NR_PRINT)) {
+			int i, r;
+
+			/* this could be considered handling an external
+			 * event: with one-shot mode, system calls
+			 * after prctl(PR_SET_ACTIVATE) will not incur
+			 * the penalty of quiescing
+			 */
+			printf("loops=%d of %d\n", cnt, NR_LOOPS);
+			for (i = 0; i < 100; i++) {
+				r = read(fd, buf, 4096);
+				if (r == -1) {
+					perror("read");
+					return EXIT_FAILURE;
+				}
+			}
+
+			ret = munlock(buf, 4096);
+			if (ret) {
+				perror("munlock");
+				return EXIT_FAILURE;
+			}
+
+			ret = mlock(buf, 4096);
+			if (ret) {
+				perror("mlock");
+				return EXIT_FAILURE;
+			}
+
+			/* enable quiescing on system call return */
+			ret = task_isol_activate_set(mask);
+			if (ret)
+				return EXIT_FAILURE;
+		}
+	}
+
+	return EXIT_SUCCESS;
+}
+



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 05/13] task isolation: sync vmstats on return to userspace
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-04-25 23:06   ` Thomas Gleixner
  2022-04-27  6:56   ` Thomas Gleixner
  2022-03-15 15:31 ` [patch v12 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, use the task isolation prctl interface to quiesce 
deferred actions when returning to userspace.

This patch adds hooks to fork and exit code paths.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v12: set TIF_TASK_ISOL only when necessary (Frederic)
v11: fold patch to add task_isol_exit hooks (Frederic)
     Use _TIF_TASK_ISOL bit on thread flags (Frederic)
     
v6: modify exit_to_user_mode_loop to cover exceptions and interrupts
v5: no changes
v4: add oneshot mode support

 include/linux/task_isolation.h |   16 ++++++++++++++++
 include/linux/vmstat.h         |    8 ++++++++
 kernel/entry/common.c          |   15 +++++++++++----
 kernel/task_isolation.c        |   21 +++++++++++++++++++++
 mm/vmstat.c                    |   21 +++++++++++++++++++++
 5 files changed, 77 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -27,6 +27,13 @@ static inline void task_isol_free(struct
 		__task_isol_free(tsk);
 }
 
+void __task_isol_exit(struct task_struct *tsk);
+static inline void task_isol_exit(struct task_struct *tsk)
+{
+	if (tsk->task_isol_info)
+		__task_isol_exit(tsk);
+}
+
 int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
 			     unsigned long arg4, unsigned long arg5);
 int prctl_task_isol_cfg_get(unsigned long arg2, unsigned long arg3,
@@ -40,12 +47,22 @@ int prctl_task_isol_activate_set(unsigne
 
 int __copy_task_isol(struct task_struct *tsk);
 
+void task_isol_exit_to_user_mode(void);
+
 #else
 
+static inline void task_isol_exit_to_user_mode(void)
+{
+}
+
 static inline void task_isol_free(struct task_struct *tsk)
 {
 }
 
+static inline void task_isol_exit(struct task_struct *tsk)
+{
+}
+
 static inline int prctl_task_isol_feat_get(unsigned long arg2,
 					   unsigned long arg3,
 					   unsigned long arg4,
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
 		void *buffer, size_t *length, loff_t *ppos);
 #endif
 
+#if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
+void sync_vmstat(void);
+#else
+static inline void sync_vmstat(void)
+{
+}
+#endif
+
 struct reclaim_stat {
 	unsigned nr_dirty;
 	unsigned nr_unqueued_dirty;
Index: linux-2.6/kernel/entry/common.c
===================================================================
--- linux-2.6.orig/kernel/entry/common.c
+++ linux-2.6/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/task_isolation.h>
 
 #include "common.h"
 
@@ -174,6 +175,9 @@ static unsigned long exit_to_user_mode_l
 		if (ti_work & _TIF_NOTIFY_RESUME)
 			tracehook_notify_resume(regs);
 
+		if (ti_work & _TIF_TASK_ISOL)
+			task_isol_exit_to_user_mode();
+
 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);
 
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -18,6 +18,12 @@
 #include <linux/sysfs.h>
 #include <linux/init.h>
 #include <linux/sched/task.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+
+void __task_isol_exit(struct task_struct *tsk)
+{
+}
 
 void __task_isol_free(struct task_struct *tsk)
 {
@@ -251,6 +257,11 @@ static int cfg_feat_quiesce_set(unsigned
 	info->quiesce_mask = i_qctrl->quiesce_mask;
 	info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
 	info->conf_mask |= ISOL_F_QUIESCE;
+
+	if ((info->active_mask & ISOL_F_QUIESCE) &&
+	    (info->quiesce_mask & ISOL_F_QUIESCE_VMSTATS))
+		set_thread_flag(TIF_TASK_ISOL);
+
 	ret = 0;
 
 out_free:
@@ -303,6 +314,7 @@ int __copy_task_isol(struct task_struct
 		new_info->active_mask = info->active_mask;
 
 	tsk->task_isol_info = new_info;
+	set_ti_thread_flag(task_thread_info(tsk), TIF_TASK_ISOL);
 
 	return 0;
 }
@@ -330,6 +342,10 @@ int prctl_task_isol_activate_set(unsigne
 	info->active_mask = active_mask;
 	ret = 0;
 
+	if ((info->active_mask & ISOL_F_QUIESCE) &&
+	    (info->quiesce_mask & ISOL_F_QUIESCE_VMSTATS))
+		set_thread_flag(TIF_TASK_ISOL);
+
 out:
 	return ret;
 }
@@ -349,3 +365,24 @@ int prctl_task_isol_activate_get(unsigne
 
 	return 0;
 }
+
+void task_isol_exit_to_user_mode(void)
+{
+	struct task_isol_info *i;
+
+	clear_thread_flag(TIF_TASK_ISOL);
+
+	i = current->task_isol_info;
+	if (!i)
+		return;
+
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return;
+
+	if (i->quiesce_mask & ISOL_F_QUIESCE_VMSTATS) {
+		sync_vmstat();
+		if (i->oneshot_mask & ISOL_F_QUIESCE_VMSTATS)
+			i->quiesce_mask &= ~ISOL_F_QUIESCE_VMSTATS;
+	}
+}
+EXPORT_SYMBOL_GPL(task_isol_exit_to_user_mode);
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -2018,6 +2018,29 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void sync_vmstat(void)
+{
+	int cpu;
+
+	cpu = get_cpu();
+
+	refresh_cpu_vm_stats(false);
+	put_cpu();
+
+	/*
+	 * If task is migrated to another CPU between put_cpu
+	 * and cancel_delayed_work_sync, the code below might
+	 * cancel vmstat_update work for a different cpu
+	 * (than the one from which the vmstats were flushed).
+	 *
+	 * However, vmstat shepherd will re-enable it later,
+	 * so its harmless.
+	 */
+	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+}
+#endif
+
 static void __init start_shepherd_timer(void)
 {
 	int cpu;
Index: linux-2.6/include/linux/entry-common.h
===================================================================
--- linux-2.6.orig/include/linux/entry-common.h
+++ linux-2.6/include/linux/entry-common.h
@@ -60,7 +60,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_TASK_ISOL | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_check_user_regs - Architecture specific sanity check for user mode regs
Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -64,6 +64,7 @@
 #include <linux/compat.h>
 #include <linux/io_uring.h>
 #include <linux/kprobes.h>
+#include <linux/task_isolation.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -759,6 +760,7 @@ void __noreturn do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	io_uring_files_cancel();
+	task_isol_exit(tsk);
 	exit_signals(tsk);  /* sets PF_EXITING */
 
 	/* sync mm's RSS info before statistics gathering */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -2427,6 +2427,7 @@ bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
 bad_fork_cleanup_task_isol:
+	task_isol_exit(p);
 	task_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 06/13] procfs: add per-pid task isolation state
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-04-25 23:27   ` Thomas Gleixner
  2022-03-15 15:31 ` [patch v12 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

Add /proc/pid/task_isolation file, to query the state of
task isolation configuration.

---
 fs/proc/base.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -97,6 +97,8 @@
 #include <linux/time_namespace.h>
 #include <linux/resctrl.h>
 #include <linux/cn_proc.h>
+#include <linux/prctl.h>
+#include <linux/task_isolation.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -665,6 +667,69 @@ static int proc_pid_syscall(struct seq_f
 }
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
+#ifdef CONFIG_TASK_ISOLATION
+
+struct qoptions {
+	unsigned long mask;
+	char *name;
+};
+
+static struct qoptions iopts[] = {
+	{ISOL_F_QUIESCE, "quiesce"},
+};
+#define ILEN (sizeof(iopts) / sizeof(struct qoptions))
+
+static struct qoptions qopts[] = {
+	{ISOL_F_QUIESCE_VMSTATS, "vmstat_sync"},
+};
+#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
+
+static void show_isolation_state(struct seq_file *m,
+				 struct qoptions *iopt,
+				 int mask,
+				 const char *hdr)
+{
+	int i;
+
+	seq_printf(m, hdr);
+	for (i = 0; i < ILEN; i++) {
+		if (mask & iopt->mask)
+			seq_printf(m, "%s ", iopt->name);
+		iopt++;
+	}
+	if (mask == 0)
+		seq_printf(m, "none ");
+	seq_printf(m, "\n");
+}
+
+int proc_pid_task_isolation(struct seq_file *m, struct pid_namespace *ns,
+			    struct pid *pid, struct task_struct *t)
+{
+	int active_mask, quiesce_mask, conf_mask;
+	struct task_isol_info *task_isol_info;
+	struct inode *inode = m->private;
+	struct task_struct *task = get_proc_task(inode);
+
+	task_isol_info = t->task_isol_info;
+	if (!task_isol_info)
+		active_mask = quiesce_mask = conf_mask = 0;
+	else {
+		active_mask = task_isol_info->active_mask;
+		quiesce_mask = task_isol_info->quiesce_mask;
+		conf_mask = task_isol_info->conf_mask;
+	}
+
+	show_isolation_state(m, iopts, conf_mask, "Configured state: ");
+	show_isolation_state(m, iopts, active_mask, "Active state: ");
+	show_isolation_state(m, qopts, quiesce_mask, "Quiescing: ");
+
+	put_task_struct(task);
+
+	return 0;
+}
+
+#endif /* CONFIG_TASK_ISOLATION */
+
 /************************************************************************/
 /*                       Here the fs part begins                        */
 /************************************************************************/
@@ -3286,6 +3351,9 @@ static const struct pid_entry tgid_base_
 #ifdef CONFIG_SECCOMP_CACHE_DEBUG
 	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	ONE("task_isolation", S_IRUSR, proc_pid_task_isolation),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 07/13] task isolation: sync vmstats conditional on changes
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-03-17 14:51   ` Frederic Weisbecker
  2022-04-27  8:03   ` Thomas Gleixner
  2022-03-15 15:31 ` [patch v12 08/13] task isolation: enable return to userspace processing Marcelo Tosatti
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Rather than syncing VM-stats on every return to userspace
(or VM-entry), keep track of changes through a per-CPU bool.

This improves performance when enabling task isolated
for vcpu VMs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v11:
- Add TIF_TASK_ISOL bit to thread info flags and use it
  to decide whether to perform task isolation work on
  return to userspace     

 include/linux/vmstat.h |   13 ++++++++++++-
 mm/vmstat.c            |   29 ++++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -22,7 +22,18 @@ int sysctl_vm_numa_stat_handler(struct c
 #endif
 
 #if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
-void sync_vmstat(void);
+DECLARE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+extern struct static_key vmstat_sync_enabled;
+
+void __sync_vmstat(void);
+static inline void sync_vmstat(void)
+{
+	if (static_key_false(&vmstat_sync_enabled))
+		__sync_vmstat();
+}
+
+void init_sync_vmstat(void);
 #else
 static inline void sync_vmstat(void)
 {
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -334,6 +334,31 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+struct static_key vmstat_sync_enabled;
+DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+static inline void mark_vmstat_dirty(void)
+{
+	if (!static_key_false(&vmstat_sync_enabled))
+		return;
+
+	raw_cpu_write(vmstat_dirty, true);
+	set_thread_flag(TIF_TASK_ISOL);
+}
+
+void init_sync_vmstat(void)
+{
+	raw_cpu_write(vmstat_dirty, true);
+	set_thread_flag(TIF_TASK_ISOL);
+}
+EXPORT_SYMBOL_GPL(vmstat_dirty);
+#else
+static inline void mark_vmstat_dirty(void)
+{
+}
+#endif
+
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -366,6 +391,7 @@ void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
@@ -404,6 +430,7 @@ void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
@@ -602,6 +629,7 @@ static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	mark_vmstat_dirty();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -670,6 +698,7 @@ static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	mark_vmstat_dirty();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -1087,6 +1116,7 @@ static void fill_contig_page_info(struct
 			info->free_blocks_suitable += blocks <<
 						(order - suitable_order);
 	}
+	mark_vmstat_dirty();
 }
 
 /*
@@ -1443,6 +1473,7 @@ static void walk_zones_in_node(struct se
 		if (!nolock)
 			spin_unlock_irqrestore(&zone->lock, flags);
 	}
+	mark_vmstat_dirty();
 }
 #endif
 
@@ -1512,6 +1543,7 @@ static void pagetypeinfo_showfree_print(
 		}
 		seq_putc(m, '\n');
 	}
+	mark_vmstat_dirty();
 }
 
 /* Print out the free pages at each order for each migatetype */
@@ -1932,6 +1964,7 @@ static void vmstat_update(struct work_st
 				this_cpu_ptr(&vmstat_work),
 				round_jiffies_relative(sysctl_stat_interval));
 	}
+	mark_vmstat_dirty();
 }
 
 /*
@@ -2019,13 +2052,14 @@ static void vmstat_shepherd(struct work_
 }
 
 #ifdef CONFIG_TASK_ISOLATION
-void sync_vmstat(void)
+void __sync_vmstat(void)
 {
 	int cpu;
 
 	cpu = get_cpu();
 
 	refresh_cpu_vm_stats(false);
+	raw_cpu_write(vmstat_dirty, false);
 	put_cpu();
 
 	/*



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 08/13] task isolation: enable return to userspace processing
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Enable processing of pending task isolation work if per-CPU vmstats
are out of sync with global vmstats.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

v11:
   Fold patch to add task_isol_exit hooks into
  "sync vmstats on return to userspace" patch.          (Frederic W. Weisbecker)

 include/linux/task_isolation.h |   22 +++++++++++++++++++++-
 kernel/task_isolation.c        |   13 +++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -10,7 +10,6 @@
  */
 
 #include <linux/sched.h>
-#include <linux/task_isolation.h>
 #include <linux/prctl.h>
 #include <linux/slab.h>
 #include <linux/kobject.h>
@@ -20,9 +19,17 @@
 #include <linux/sched/task.h>
 #include <linux/mm.h>
 #include <linux/vmstat.h>
+#include <linux/task_isolation.h>
 
 void __task_isol_exit(struct task_struct *tsk)
 {
+	struct task_isol_info *i;
+
+	i = tsk->task_isol_info;
+	if (!i)
+		return;
+
+	static_key_slow_dec(&vmstat_sync_enabled);
 }
 
 void __task_isol_free(struct task_struct *tsk)
@@ -41,6 +48,12 @@ static struct task_isol_info *task_isol_
 	if (unlikely(!info))
 		return ERR_PTR(-ENOMEM);
 
+	preempt_disable();
+	init_sync_vmstat();
+	preempt_enable();
+
+	static_key_slow_inc(&vmstat_sync_enabled);
+
 	return info;
 }
 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 08/13] task isolation: enable return to userspace processing Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-03-16  2:41   ` Oscar Shiang
  2022-04-27  7:11   ` Thomas Gleixner
  2022-03-15 15:31 ` [patch v12 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

If a thread has task isolation activated, is preempted by thread B,
which marks vmstat information dirty, and is preempted back in,
one might return to userspace with vmstat dirty information on the 
CPU in question.

To address this problem, add a preempt notifier that transfers vmstat dirty
information to TIF_TASK_ISOL thread flag.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

v12: 
 - switch from raw_cpu_read to __this_cpu_read (Frederic)

---
 include/linux/task_isolation.h |    2 ++
 include/linux/vmstat.h         |    6 ++++++
 kernel/task_isolation.c        |   23 +++++++++++++++++++++++
 mm/vmstat.c                    |    7 +++++++
 4 files changed, 38 insertions(+)

Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -19,6 +19,7 @@
 #include <linux/sched/task.h>
 #include <linux/mm.h>
 #include <linux/vmstat.h>
+#include <linux/preempt.h>
 #include <linux/task_isolation.h>
 
 void __task_isol_exit(struct task_struct *tsk)
@@ -30,6 +31,9 @@ void __task_isol_exit(struct task_struct
 		return;
 
 	static_key_slow_dec(&vmstat_sync_enabled);
+
+	preempt_notifier_unregister(&i->preempt_notifier);
+	preempt_notifier_dec();
 }
 
 void __task_isol_free(struct task_struct *tsk)
@@ -40,6 +44,21 @@ void __task_isol_free(struct task_struct
 	tsk->task_isol_info = NULL;
 }
 
+static void task_isol_sched_in(struct preempt_notifier *pn, int cpu)
+{
+	vmstat_dirty_to_thread_flag();
+}
+
+static void task_isol_sched_out(struct preempt_notifier *pn,
+				struct task_struct *next)
+{
+}
+
+static __read_mostly struct preempt_ops task_isol_preempt_ops = {
+	.sched_in	= task_isol_sched_in,
+	.sched_out	= task_isol_sched_out,
+};
+
 static struct task_isol_info *task_isol_alloc_context(void)
 {
 	struct task_isol_info *info;
@@ -48,6 +67,10 @@ static struct task_isol_info *task_isol_
 	if (unlikely(!info))
 		return ERR_PTR(-ENOMEM);
 
+	preempt_notifier_inc();
+	preempt_notifier_init(&info->preempt_notifier, &task_isol_preempt_ops);
+	preempt_notifier_register(&info->preempt_notifier);
+
 	preempt_disable();
 	init_sync_vmstat();
 	preempt_enable();
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -17,6 +17,8 @@ struct task_isol_info {
 	u64 oneshot_mask;
 
 	u8 inherit_mask;
+
+	struct preempt_notifier preempt_notifier;
 };
 
 extern void __task_isol_free(struct task_struct *tsk);
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -34,10 +34,16 @@ static inline void sync_vmstat(void)
 }
 
 void init_sync_vmstat(void);
+
+void vmstat_dirty_to_thread_flag(void);
 #else
 static inline void sync_vmstat(void)
 {
 }
+
+static inline void vmstat_dirty_to_thread_flag(void)
+{
+}
 #endif
 
 struct reclaim_stat {
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -353,6 +353,13 @@ void init_sync_vmstat(void)
 	set_thread_flag(TIF_TASK_ISOL);
 }
 EXPORT_SYMBOL_GPL(vmstat_dirty);
+
+void vmstat_dirty_to_thread_flag(void)
+{
+	if (__this_cpu_read(vmstat_dirty) == true)
+		set_thread_flag(TIF_TASK_ISOL);
+}
+EXPORT_SYMBOL_GPL(vmstat_dirty_to_thread_flag);
 #else
 static inline void mark_vmstat_dirty(void)
 {



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 10/13] KVM: x86: process isolation work from VM-entry code path
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 11/13] mm: vmstat: move need_update Marcelo Tosatti
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

VM-entry code path is an entry point similar to userspace return
when task isolation is concerned.

Call isolation_exit_to_user_mode before VM-enter.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v12
- Add missing kvm-entry.h change			(Frederic)

v11
- Add TIF_TASK_ISOL bit to thread info flags and use it
  to decide whether to perform task isolation work on
  return to userspace                                   (Frederic W. Weisbecker)

 include/linux/entry-kvm.h |    4 +++-
 kernel/entry/kvm.c        |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/entry/kvm.c
===================================================================
--- linux-2.6.orig/kernel/entry/kvm.c
+++ linux-2.6/kernel/entry/kvm.c
@@ -2,6 +2,7 @@
 
 #include <linux/entry-kvm.h>
 #include <linux/kvm_host.h>
+#include <linux/task_isolation.h>
 
 static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 {
@@ -22,6 +23,9 @@ static int xfer_to_guest_mode_work(struc
 		if (ti_work & _TIF_NOTIFY_RESUME)
 			tracehook_notify_resume(NULL);
 
+		if (ti_work & _TIF_TASK_ISOL)
+			task_isol_exit_to_user_mode();
+
 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
 		if (ret)
 			return ret;
Index: linux-2.6/include/linux/entry-kvm.h
===================================================================
--- linux-2.6.orig/include/linux/entry-kvm.h
+++ linux-2.6/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@
 
 #define XFER_TO_GUEST_MODE_WORK						\
 	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
-	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+	 _TIF_NOTIFY_RESUME | _TIF_TASK_ISOL | ARCH_XFER_TO_GUEST_MODE_WORK)
 
 struct kvm_vcpu;
 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 11/13] mm: vmstat: move need_update
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-03-15 15:31 ` [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Move need_update() function up in vmstat.c, needed by next patch.
No code changes.

Remove a duplicate comment while at it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   58 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1894,6 +1894,35 @@ static const struct seq_operations vmsta
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	pg_data_t *last_pgdat = NULL;
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		struct per_cpu_nodestat *n;
+
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 */
+		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
+			return true;
+
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
+		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
+		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
+			return true;
+	}
+	return false;
+}
+
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1975,35 +2004,6 @@ static void vmstat_update(struct work_st
 }
 
 /*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
-			return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 11/13] mm: vmstat: move need_update Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-04-27  7:23   ` Thomas Gleixner
  2022-03-15 15:31 ` [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

It is not necessary to queue work item to run refresh_vm_stats 
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.

This fixes sosreport hang (which uses vmstat_refresh) with 
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   49 ++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 44 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1924,6 +1924,31 @@ static bool need_update(int cpu)
 }
 
 #ifdef CONFIG_PROC_FS
+static bool need_drain_remote_zones(int cpu)
+{
+#ifdef CONFIG_NUMA
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		if (!pcp->count)
+			continue;
+
+		if (!pcp->expire)
+			continue;
+
+		if (zone_to_nid(zone) == cpu_to_node(cpu))
+			continue;
+
+		return true;
+	}
+#endif
+
+	return false;
+}
+
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
@@ -1933,8 +1958,12 @@ int vmstat_refresh(struct ctl_table *tab
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	long val;
-	int err;
-	int i;
+	int i, cpu;
+	struct work_struct __percpu *works;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1948,9 +1977,19 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+
+		INIT_WORK(work, refresh_vm_stats);
+		if (need_update(cpu) || need_drain_remote_zones(cpu))
+			schedule_work_on(cpu, work);
+	}
+	for_each_online_cpu(cpu)
+		flush_work(per_cpu_ptr(works, cpu));
+	cpus_read_unlock();
+	free_percpu(works);
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (11 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
@ 2022-03-15 15:31 ` Marcelo Tosatti
  2022-04-27  7:45   ` Thomas Gleixner
  2022-03-17 15:08 ` [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Frederic Weisbecker
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-03-15 15:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

This avoids processing of TIF_TASK_ISOL, when returning to userspace,
for tasks which do not have task isolation configured.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -5,6 +5,8 @@
 
 #ifdef CONFIG_TASK_ISOLATION
 
+#include <uapi/linux/prctl.h>
+
 struct task_isol_info {
 	/* Which features have been configured */
 	u64 conf_mask;
@@ -51,6 +53,24 @@ int __copy_task_isol(struct task_struct
 
 void task_isol_exit_to_user_mode(void);
 
+static inline bool task_isol_quiesce_activated(struct task_struct *tsk,
+					       u64 quiesce_mask)
+{
+	struct task_isol_info *i;
+
+	i = tsk->task_isol_info;
+	if (!i)
+		return false;
+
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return false;
+
+	if ((i->quiesce_mask & quiesce_mask) == quiesce_mask)
+		return true;
+
+	return false;
+}
+
 #else
 
 static inline void task_isol_exit_to_user_mode(void)
@@ -105,6 +125,12 @@ static inline int prctl_task_isol_activa
 	return -EOPNOTSUPP;
 }
 
+static inline bool task_isol_quiesce_activated(struct task_struct *tsk,
+					       u64 quiesce_mask)
+{
+	return false;
+}
+
 #endif /* CONFIG_TASK_ISOLATION */
 
 #endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/task_isolation.h>
 
 #include "internal.h"
 
@@ -344,7 +345,8 @@ static inline void mark_vmstat_dirty(voi
 		return;
 
 	raw_cpu_write(vmstat_dirty, true);
-	set_thread_flag(TIF_TASK_ISOL);
+	if (task_isol_quiesce_activated(current, ISOL_F_QUIESCE_VMSTATS))
+		set_thread_flag(TIF_TASK_ISOL);
 }
 
 void init_sync_vmstat(void)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info
  2022-03-15 15:31 ` [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
@ 2022-03-16  2:41   ` Oscar Shiang
  2022-04-27  7:11   ` Thomas Gleixner
  1 sibling, 0 replies; 44+ messages in thread
From: Oscar Shiang @ 2022-03-16  2:41 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira

On Mar 15, 2022, at 11:31 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> If a thread has task isolation activated, is preempted by thread B,
> which marks vmstat information dirty, and is preempted back in,
> one might return to userspace with vmstat dirty information on the
> CPU in question.
> 
> To address this problem, add a preempt notifier that transfers vmstat dirty
> information to TIF_TASK_ISOL thread flag.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
> 
> v12:
>  - switch from raw_cpu_read to __this_cpu_read (Frederic)
> 
> ---
>  include/linux/task_isolation.h |    2 ++
>  include/linux/vmstat.h         |    6 ++++++
>  kernel/task_isolation.c        |   23 +++++++++++++++++++++++
>  mm/vmstat.c                    |    7 +++++++
>  4 files changed, 38 insertions(+)
> 
> Index: linux-2.6/kernel/task_isolation.c
> ===================================================================
> --- linux-2.6.orig/kernel/task_isolation.c
> +++ linux-2.6/kernel/task_isolation.c
> @@ -19,6 +19,7 @@
>  #include <linux/sched/task.h>
>  #include <linux/mm.h>
>  #include <linux/vmstat.h>
> +#include <linux/preempt.h>
>  #include <linux/task_isolation.h>
> 
>  void __task_isol_exit(struct task_struct *tsk)
> @@ -30,6 +31,9 @@ void __task_isol_exit(struct task_struct
>  		return;
> 
>  	static_key_slow_dec(&vmstat_sync_enabled);
> +
> +	preempt_notifier_unregister(&i->preempt_notifier);
> +	preempt_notifier_dec();
>  }
> 
>  void __task_isol_free(struct task_struct *tsk)
> @@ -40,6 +44,21 @@ void __task_isol_free(struct task_struct
>  	tsk->task_isol_info = NULL;
>  }
> 
> +static void task_isol_sched_in(struct preempt_notifier *pn, int cpu)
> +{
> +	vmstat_dirty_to_thread_flag();
> +}
> +
> +static void task_isol_sched_out(struct preempt_notifier *pn,
> +				struct task_struct *next)
> +{
> +}
> +
> +static __read_mostly struct preempt_ops task_isol_preempt_ops = {
> +	.sched_in	= task_isol_sched_in,
> +	.sched_out	= task_isol_sched_out,
> +};
> +
>  static struct task_isol_info *task_isol_alloc_context(void)
>  {
>  	struct task_isol_info *info;
> @@ -48,6 +67,10 @@ static struct task_isol_info *task_isol_
>  	if (unlikely(!info))
>  		return ERR_PTR(-ENOMEM);
> 
> +	preempt_notifier_inc();
> +	preempt_notifier_init(&info->preempt_notifier, &task_isol_preempt_ops);
> +	preempt_notifier_register(&info->preempt_notifier);
> +
>  	preempt_disable();
>  	init_sync_vmstat();
>  	preempt_enable();
> Index: linux-2.6/include/linux/task_isolation.h
> ===================================================================
> --- linux-2.6.orig/include/linux/task_isolation.h
> +++ linux-2.6/include/linux/task_isolation.h
> @@ -17,6 +17,8 @@ struct task_isol_info {
>  	u64 oneshot_mask;
> 
>  	u8 inherit_mask;
> +
> +	struct preempt_notifier preempt_notifier;

Since preempt_notifier is visible only when CONFIG_KVM is enabled,
I think we can add KVM to the dependencies of CONFIG_TASK_ISOLATION.

Or we could encounter missing definition error while building without
CONFIG_KVM.

Thanks,
Oscar

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 07/13] task isolation: sync vmstats conditional on changes
  2022-03-15 15:31 ` [patch v12 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
@ 2022-03-17 14:51   ` Frederic Weisbecker
  2022-04-27  8:03   ` Thomas Gleixner
  1 sibling, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2022-03-17 14:51 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Tue, Mar 15, 2022 at 12:31:39PM -0300, Marcelo Tosatti wrote:
> Rather than syncing VM-stats on every return to userspace
> (or VM-entry), keep track of changes through a per-CPU bool.
> 
> This improves performance when enabling task isolated
> for vcpu VMs.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
> v11:
> - Add TIF_TASK_ISOL bit to thread info flags and use it
>   to decide whether to perform task isolation work on
>   return to userspace     
> 
>  include/linux/vmstat.h |   13 ++++++++++++-
>  mm/vmstat.c            |   29 ++++++++++++++++++++++++++++-
>  2 files changed, 40 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h
> +++ linux-2.6/include/linux/vmstat.h
> @@ -22,7 +22,18 @@ int sysctl_vm_numa_stat_handler(struct c
>  #endif
>  
>  #if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
> -void sync_vmstat(void);
> +DECLARE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +
> +extern struct static_key vmstat_sync_enabled;
> +
> +void __sync_vmstat(void);
> +static inline void sync_vmstat(void)
> +{
> +	if (static_key_false(&vmstat_sync_enabled))
> +		__sync_vmstat();
> +}
> +
> +void init_sync_vmstat(void);
>  #else
>  static inline void sync_vmstat(void)
>  {
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c
> +++ linux-2.6/mm/vmstat.c
> @@ -334,6 +334,31 @@ void set_pgdat_percpu_threshold(pg_data_
>  	}
>  }
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +struct static_key vmstat_sync_enabled;
> +DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +
> +static inline void mark_vmstat_dirty(void)
> +{
> +	if (!static_key_false(&vmstat_sync_enabled))
> +		return;
> +
> +	raw_cpu_write(vmstat_dirty, true);
> +	set_thread_flag(TIF_TASK_ISOL);
> +}
> +
> +void init_sync_vmstat(void)
> +{
> +	raw_cpu_write(vmstat_dirty, true);
> +	set_thread_flag(TIF_TASK_ISOL);
> +}
> +EXPORT_SYMBOL_GPL(vmstat_dirty);
> +#else
> +static inline void mark_vmstat_dirty(void)
> +{
> +}
> +#endif
> +
>  /*
>   * For use when we know that interrupts are disabled,
>   * or when we know that preemption is disabled and that
> @@ -366,6 +391,7 @@ void __mod_zone_page_state(struct zone *
>  		x = 0;
>  	}
>  	__this_cpu_write(*p, x);
> +	mark_vmstat_dirty();
>  
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT))
>  		preempt_enable();
> @@ -404,6 +430,7 @@ void __mod_node_page_state(struct pglist
>  		x = 0;
>  	}
>  	__this_cpu_write(*p, x);
> +	mark_vmstat_dirty();
>  
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT))
>  		preempt_enable();
> @@ -602,6 +629,7 @@ static inline void mod_zone_state(struct
>  
>  	if (z)
>  		zone_page_state_add(z, zone, item);
> +	mark_vmstat_dirty();
>  }
>  
>  void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> @@ -670,6 +698,7 @@ static inline void mod_node_state(struct
>  
>  	if (z)
>  		node_page_state_add(z, pgdat, item);
> +	mark_vmstat_dirty();
>  }
>  
>  void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
> @@ -1087,6 +1116,7 @@ static void fill_contig_page_info(struct
>  			info->free_blocks_suitable += blocks <<
>  						(order - suitable_order);
>  	}
> +	mark_vmstat_dirty();
>  }
>  
>  /*
> @@ -1443,6 +1473,7 @@ static void walk_zones_in_node(struct se
>  		if (!nolock)
>  			spin_unlock_irqrestore(&zone->lock, flags);
>  	}
> +	mark_vmstat_dirty();
>  }
>  #endif
>  
> @@ -1512,6 +1543,7 @@ static void pagetypeinfo_showfree_print(
>  		}
>  		seq_putc(m, '\n');
>  	}
> +	mark_vmstat_dirty();
>  }
>  
>  /* Print out the free pages at each order for each migatetype */
> @@ -1932,6 +1964,7 @@ static void vmstat_update(struct work_st
>  				this_cpu_ptr(&vmstat_work),
>  				round_jiffies_relative(sysctl_stat_interval));
>  	}
> +	mark_vmstat_dirty();
>  }
>  
>  /*
> @@ -2019,13 +2052,14 @@ static void vmstat_shepherd(struct work_
>  }
>  
>  #ifdef CONFIG_TASK_ISOLATION
> -void sync_vmstat(void)
> +void __sync_vmstat(void)
>  {
>  	int cpu;
>  
>  	cpu = get_cpu();
>  
>  	refresh_cpu_vm_stats(false);
> +	raw_cpu_write(vmstat_dirty, false);

I still see a few raw_cpu_write() here, especially that one should
clearly become __this_cpu_write().

>  	put_cpu();
>  
>  	/*
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (12 preceding siblings ...)
  2022-03-15 15:31 ` [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
@ 2022-03-17 15:08 ` Frederic Weisbecker
  2022-04-25 16:29   ` Marcelo Tosatti
  2022-04-27  9:19 ` Christoph Lameter
  2022-05-04 17:01 ` Tim Chen
  15 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2022-03-17 15:08 UTC (permalink / raw)
  To: Marcelo Tosatti, Peter Zijlstra, Thomas Gleixner
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang

On Tue, Mar 15, 2022 at 12:31:32PM -0300, Marcelo Tosatti wrote:
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
> 
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
> 
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
> 
> To fix this, add task isolation prctl interface to quiesce
> deferred actions when returning to userspace.
> 
> The patchset is based on ideas and code from the
> task isolation patchset from Alex Belits:
> https://lwn.net/Articles/816298/
> 
> Please refer to Documentation/userspace-api/task_isolation.rst
> (patch 1) for details. Its attached at the end of this message
> in .txt format as well.
> 
> Note: the prctl interface is independent of nohz_full=.
> 
> The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> 
> - qemu-task-isolation.patch: activate task isolation from CPU execution loop
> - rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
> - util-linux-chisol.patch: add chisol tool to util-linux.

I still see a few details to sort out but overall the whole thing looks good:

  Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

Perhaps it's time to apply this patchset on some branch and iterate from there.

Thomas, Peter, what do you think?

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-03-17 15:08 ` [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Frederic Weisbecker
@ 2022-04-25 16:29   ` Marcelo Tosatti
  2022-04-25 21:12     ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-04-25 16:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Thomas Gleixner, linux-kernel, Nitesh Lal,
	Nicolas Saenz Julienne, Christoph Lameter, Juri Lelli,
	Alex Belits, Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang

On Thu, Mar 17, 2022 at 04:08:04PM +0100, Frederic Weisbecker wrote:
> On Tue, Mar 15, 2022 at 12:31:32PM -0300, Marcelo Tosatti wrote:
> > The logic to disable vmstat worker thread, when entering
> > nohz full, does not cover all scenarios. For example, it is possible
> > for the following to happen:
> > 
> > 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> > 2) app runs mlock, which increases counters for mlock'ed pages.
> > 3) start -RT loop
> > 
> > Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> > the mlock, vmstat shepherd can restart vmstat worker thread on
> > the CPU in question.
> > 
> > To fix this, add task isolation prctl interface to quiesce
> > deferred actions when returning to userspace.
> > 
> > The patchset is based on ideas and code from the
> > task isolation patchset from Alex Belits:
> > https://lwn.net/Articles/816298/
> > 
> > Please refer to Documentation/userspace-api/task_isolation.rst
> > (patch 1) for details. Its attached at the end of this message
> > in .txt format as well.
> > 
> > Note: the prctl interface is independent of nohz_full=.
> > 
> > The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> > 
> > - qemu-task-isolation.patch: activate task isolation from CPU execution loop
> > - rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
> > - util-linux-chisol.patch: add chisol tool to util-linux.
> 
> I still see a few details to sort out but overall the whole thing looks good:
> 
>   Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
> 
> Perhaps it's time to apply this patchset on some branch and iterate from there.
> 
> Thomas, Peter, what do you think?
> 
> Thanks!

Ping ?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-04-25 16:29   ` Marcelo Tosatti
@ 2022-04-25 21:12     ` Thomas Gleixner
  2022-05-03 18:57       ` Marcelo Tosatti
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-25 21:12 UTC (permalink / raw)
  To: Marcelo Tosatti, Frederic Weisbecker
  Cc: Peter Zijlstra, linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang

On Mon, Apr 25 2022 at 13:29, Marcelo Tosatti wrote:
> On Thu, Mar 17, 2022 at 04:08:04PM +0100, Frederic Weisbecker wrote:
>> 
>> I still see a few details to sort out but overall the whole thing looks good:

From a cursery inspection there is more than a few details to sort out.

>>   Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
>> 
>> Perhaps it's time to apply this patchset on some branch and iterate from there.
>> 
>> Thomas, Peter, what do you think?
>> 
>> Thanks!
>
> Ping ?

This does not apply against 5.18-rc1, which was released on April
3rd. Oh, well. You are really new to kernel development, right?

Don't bother to resend before I finished reviewing the pile.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 03/13] add basic task isolation prctl interface
  2022-03-15 15:31 ` [patch v12 03/13] add basic task isolation prctl interface Marcelo Tosatti
@ 2022-04-25 22:23   ` Thomas Gleixner
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-25 22:23 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

Marcelo,

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> +
> +/* Supported features */
> +# define ISOL_F_QUIESCE			(1UL << 0)
> +
> +# define ISOL_F_QUIESCE_MULTIPLE	(1UL << 0)

There is no user of this.

> +#  define ISOL_F_QUIESCE_VMSTATS	(1UL << 0)

TBH, I find this mightily confusing. Can we please make it clear that
the isolation features above, i.e. ISOL_F_QUIESCE (and whatever comes
next), and the feature specific defines are clearly seperated?

The attempt of seperating this by spaces between # and 'define' is more
than lame.

/* Supported isolation features */
#define ISOL_FEAT_QUIESCE		(1UL << 0)
#define ISOL_FEAT_SUPPORTED		(ISOL_FEAT_QUIESCE)

/* Subfeatures of ISOL_FEAT_QUIESCE */
#define ISOL_QUIESCE_VMSTAT		(1UL << 0)
#define ISOL_QUIESCE_SUPPORTED		(ISOL_QUIESCE_VMSTAT)

This is clearly visible seperated and the SUPPORTED variants are making
it trivial to add new features without rumaging through the code to fix
it up.

> +struct task_isol_info {
> +	/* Which features have been configured */
> +	u64 conf_mask;
> +	/* Which features are active */
> +	u64 active_mask;
> +	/* Quiesce mask */
> +	u64 quiesce_mask;
> +
> +	/* Oneshot mask */
> +	u64 oneshot_mask;

Repeating the member names in the comments is useful because?

> +	u8 inherit_mask;

inherit_mask is really a misnomer. It's a convoluted boolean, but not a
mask, especially not in the context of the other masks in that
structure. But sure, slapping mask on everything is conveniant and
spares to think about proper naming conventions.

> +int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
> +				  unsigned long arg4, unsigned long arg5)
> +{
> +	int ret;
> +	void __user *addr = (void __user *) arg3;
> +
> +	switch (arg2) {
> +	case 0: {

Can we please have a proper constant for that case? 

> +		u64 supported_fmask = ISOL_F_QUIESCE;

According to the above this becomes

		u64 supported_fmask = ISOL_FEAT_SUPPORTED;

which makes the code future proof and self explaining.

> +		ret = 0;
> +		if (copy_to_user(addr, &supported_fmask, sizeof(u64)))
> +			ret = -EFAULT;
> +
> +		return ret;
> +	}
> +	case ISOL_F_QUIESCE: {

This is a bitmask.

Q: How is that supposed to work on 32bit kernels when the number of used
   feature bits reaches 32?

A: Not at all

Q: Why is this mask muck 64bit?

A: Just because it looks good and is so future proof because 64bit...

You are designing a user space interface which has to be supported
forever.

> +		struct task_isol_quiesce_extensions *q_ext;
> +
> +		q_ext = kzalloc(sizeof(struct task_isol_quiesce_extensions),
> +			 GFP_KERNEL);
> +		if (!q_ext)
> +			return -ENOMEM;
> +
> +		q_ext->supported_quiesce_bits = ISOL_F_QUIESCE_VMSTATS;

Same as above, just with ISOL_QUIESCE_SUPPORTED.

> +		ret = 0;
> +		if (copy_to_user(addr, q_ext, sizeof(*q_ext)))
> +			ret = -EFAULT;
> +		kfree(q_ext);
> +		return ret;
> +	}
> +	default:
> +		break;
> +	}
> +	return -EINVAL;
> +}
> +
> +static int cfg_inherit_get(unsigned long arg3, unsigned long arg4,
> +			   unsigned long arg5)
> +{
> +	struct task_isol_inherit_control *i_ctrl;
> +	int ret;
> +	void __user *addr = (void __user *) arg3;
> +
> +	if (!current->task_isol_info)
> +		return -EINVAL;
> +
> +	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
> +			 GFP_KERNEL);

        kzalloc(sizeof(*i_ctrl), GFP_KERNEL);

> +static int cfg_feat_get(unsigned long arg3, unsigned long arg4,
> +			unsigned long arg5)
> +{
> +	int ret = 0;
> +
> +	switch (arg3) {
> +	case 0: {

Proper constant define please.

> +		void __user *addr = (void __user *)arg4;
> +		u64 cfg_mask = 0;
> +
> +		if (current->task_isol_info)
> +			cfg_mask = current->task_isol_info->conf_mask;
> +
> +		if (copy_to_user(addr, &cfg_mask, sizeof(u64)))
> +			ret = -EFAULT;
> +
> +		return ret;
> +	}
> +	case ISOL_F_QUIESCE: {
> +		struct task_isol_quiesce_control *i_qctrl;
> +		void __user *addr = (void __user *)arg5;
> +
> +		if (arg4 != QUIESCE_CONTROL)
> +			return -EINVAL;

The point of arg4 here is?

> +		i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
> +				  GFP_KERNEL);
> +		if (!i_qctrl)
> +			return -ENOMEM;
> +
> +		if (current->task_isol_info)
> +			i_qctrl->quiesce_mask = current->task_isol_info->quiesce_mask;
> +
> +		if (copy_to_user(addr, i_qctrl, sizeof(*i_qctrl)))
> +			ret = -EFAULT;
> +
> +		kfree(i_qctrl);
> +		return ret;
> +	}
> +	default:
> +		break;
> +	}
> +	return -EINVAL;

You can spare all the

        default:
                break;
        }
        return -EINVAL;

instances by returning -EINVAL in the default case directly.

> +static int cfg_inherit_set(unsigned long arg3, unsigned long arg4,
> +			   unsigned long arg5)
> +{
> +	int ret = 0;
> +	struct task_isol_inherit_control *i_ctrl;
> +	const void __user *addr = (const void __user *)arg3;
> +
> +	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
> +			 GFP_KERNEL);

See above.

> +	if (!i_ctrl)
> +		return -ENOMEM;
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(i_ctrl, addr, sizeof(*i_ctrl)))
> +		goto out_free;
> +
> +	ret = -EINVAL;
> +	if (i_ctrl->inherit_mask & ~(ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE))
> +		goto out_free;
> +
> +	if (i_ctrl->inherit_mask & ISOL_INHERIT_ACTIVE)
> +		if (!(i_ctrl->inherit_mask & ISOL_INHERIT_CONF))
> +			goto out_free;

This lacks curly braces on the first condition because it's not followed
by a single line statement.

Aside of that, why is this a wholesale inherit decision and not feature
granular? A big void of information in changelog and comments....

> +	if (!current->task_isol_info) {
> +		struct task_isol_info *task_isol_info;
> +
> +		task_isol_info = task_isol_alloc_context();
> +		if (IS_ERR(task_isol_info)) {
> +			ret = PTR_ERR(task_isol_info);
> +			goto out_free;
> +		}
> +		current->task_isol_info = task_isol_info;
> +	}
> +
> +	ret = 0;
> +	current->task_isol_info->inherit_mask = i_ctrl->inherit_mask;
> +
> +out_free:
> +	kfree(i_ctrl);
> +
> +	return ret;
> +}
> +
> +static int cfg_feat_quiesce_set(unsigned long arg4, unsigned long arg5)
> +{
> +	struct task_isol_info *info;
> +	struct task_isol_quiesce_control *i_qctrl;
> +	int ret = 0;
> +	const void __user *addr = (const void __user *)arg5;

Can you please use a consistent coding style for variable declarations
all over the place? These visually randomly ordered declarations are
annoying as hell.

	const void __user *addr = (const void __user *)arg5;
	struct task_isol_quiesce_control *i_qctrl;
	struct task_isol_info *info;
	int ret = 0;

is visually consistent and the fastest to parse.

> +	if (arg4 != QUIESCE_CONTROL)
> +		return -EINVAL;

Again. What's the purpose of this argument? This is:

       prctl(PR_ISOL_CFG_SET, I_CFG_FEAT, ISOL_F_QUIESCE,....);

So why does this need yet another different instance of (1UL << 0) as
4th argument?

This is confusing at best and smells badly of overengineering. But I
might be missing something due to the void of information provided.

> +	i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
> +			 GFP_KERNEL);

See above.

> +	if (!i_qctrl)
> +		return -ENOMEM;
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(i_qctrl, addr, sizeof(*i_qctrl)))
> +		goto out_free;
> +
> +	ret = -EINVAL;
> +	if (i_qctrl->flags != 0)
> +		goto out_free;
> +
> +	if (i_qctrl->quiesce_mask != ISOL_F_QUIESCE_VMSTATS &&
> +	    i_qctrl->quiesce_mask != 0)
> +		goto out_free;

Yet another place, where extending the QUIESCE subfeatures requires code
changes.

	if (i_qctrl->quiesce_mask & ~ISOL_QUIESCE_SUPPORTED)
        	goto out_free;

would be too simple, self explaining and extensible, right?

> +
> +	if ((~i_qctrl->quiesce_mask & i_qctrl->quiesce_oneshot_mask) != 0)
> +		goto out_free;

	if (i_qctrl->quiesce_oneshot_mask & ~i_qctrl_quiesce_mask)
        	goto out_free;

Yes, it's the same. But now put both conditions in context:
         
	if (i_qctrl->quiesce_mask & ~ISOL_QUIESCE_SUPPORTED)
        	goto out_free;

        if (i_qctrl->quiesce_oneshot_mask & ~i_qctrl_quiesce_mask)
        	goto out_free;

Can you see the pattern? Readable code matters.

> +
> +	/* current->task_isol_info is only allocated/freed from task
> +	 * context.
> +	 */
> +	if (!current->task_isol_info) {
> +		info = task_isol_alloc_context();
> +		if (IS_ERR(info)) {
> +			ret = PTR_ERR(info);
> +			goto out_free;
> +		}
> +		current->task_isol_info = info;
> +	}

This is copied verbatim from cfg_feat_inherit_set() with the add on of a
malformatted useless comment. The concept of helper functions exists for
a reason.

> +	info = current->task_isol_info;
> +
> +	info->quiesce_mask = i_qctrl->quiesce_mask;
> +	info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
> +	info->conf_mask |= ISOL_F_QUIESCE;

So this marks the QUIESCE feature configured whether there is a bit set
in info->quiesce_mask or not. What's the point when quiesce_mask is 0?

I kinda understand what you are trying to achieve, but the outcome is
inconsistent garbage. Why?

   1) What's the point of activating a feature which has no subfeature
      configured?

   2) What's the point of inheriting a feature which has no subfeature
      configured?

   3) Once you activated a feature, then clearing all subfeatures keeps
      the feature itself activated.

   4) Once you activated a feature, then setting an additional
      subfeature, activates it immediately.

Aside of being inconsistent, how is that useful and explainable to mere
mortals, aka. application programmers? And no, your attempt of
documentation does not address that either.

The point is that due to your design choice of feature/subfeatures,
which is again not explained at all (that's what changelogs and comments
are for), this inconsistency is part of the design, unless you refuse to
configure a feature after it has been activated.

I really have to ask the question, whether this envisioned granularity
makes sense at all. How many unique feature/subfeature pairs are going
to be there? I'm all for extensible and future proof interfaces, but
there has to be at least some useful argument why this needs to be a
64x64 bitmap array...

And no, I'm not going to read back on a gazillion of mail threads to
find that information. This information has to be in the changelog and
comments. Not a fundamentaly new requirement, right?

> +int prctl_task_isol_activate_set(unsigned long arg2, unsigned long arg3,
> +				      unsigned long arg4, unsigned long arg5)
> +{
> +	int ret;
> +	struct task_isol_info *info;
> +	u64 active_mask;
> +	const void __user *addr_mask = (const void __user *)arg2;
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(&active_mask, addr_mask, sizeof(u64)))
> +		goto out;
> +
> +	ret = -EINVAL;
> +	if (active_mask != ISOL_F_QUIESCE && active_mask != 0)

Yet another place which needs to be changed when the isolation feature
set is expanded and yet another unparseable condition.

	if (active_mask & ~ ISOL_FEAT_SUPPORTED)
        	return -EINVAL;

Of course, it does not matter whether a feature has been configured or
not. So what's the point of this conf_mask aside of having it available
for the more than questionable inheritance control.

> +		return ret;

Sigh. What's the purpose of ret here? First you do:

	ret = -EFAULT;
	if (copy_from_user(&active_mask, addr_mask, sizeof(u64)))
		goto out;

where @out you just return ret.

Then you preset ret with -EINVAL for the mask check to return that in
the error case. And so forth... ret is not required at all in this
function.

Consistent and comprehensible code is too much to ask for, right?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 05/13] task isolation: sync vmstats on return to userspace
  2022-03-15 15:31 ` [patch v12 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2022-04-25 23:06   ` Thomas Gleixner
  2022-04-27  6:56   ` Thomas Gleixner
  1 sibling, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-25 23:06 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
>
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
>
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
>
> To fix this, use the task isolation prctl interface to quiesce 
> deferred actions when returning to userspace.
>
> This patch adds hooks to fork and exit code paths.

git grep 'This patch' Documentation/process/

> +void __task_isol_exit(struct task_struct *tsk);
> +static inline void task_isol_exit(struct task_struct *tsk)

I assume the amount of new lines per patch is restricted somehow, right?

Glueing the __task_isol_exit() declaration to the definition of
task_isol_exit() is just annoyingly disturbing the reading flow.

New lines exist for a reason.

> +{
> +	if (tsk->task_isol_info)
> +		__task_isol_exit(tsk);
> +}
>  #else

but ...

> +static inline void task_isol_exit_to_user_mode(void)
> +{
> +}
> +
>  static inline void task_isol_free(struct task_struct *tsk)
>  {
>  }
>  
> +static inline void task_isol_exit(struct task_struct *tsk)
> +{
> +}
> +

here you use plenty of them where it does not matter at all....
What's wrong with:

   static inline void task_isol_exit_to_user_mode(void) { }
   static inline void task_isol_free(struct task_struct *tsk) { }
   static inline void task_isol_exit(struct task_struct *tsk) { }

and spending at least one of the saved newlines for separating the
above:

+ void __task_isol_exit(struct task_struct *tsk);
+ 
+ static inline void task_isol_exit(struct task_struct *tsk)

Hmm?

> @@ -251,6 +257,11 @@ static int cfg_feat_quiesce_set(unsigned
>  	info->quiesce_mask = i_qctrl->quiesce_mask;
>  	info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
>  	info->conf_mask |= ISOL_F_QUIESCE;
> +
> +	if ((info->active_mask & ISOL_F_QUIESCE) &&
> +	    (info->quiesce_mask & ISOL_F_QUIESCE_VMSTATS))
> +		set_thread_flag(TIF_TASK_ISOL);

Yet more hard coded special purpose muck. Plus the proof of the
inconsistency I described before...

> +void task_isol_exit_to_user_mode(void)
> +{
> +	struct task_isol_info *i;

*i is really a descriptive variable name. Is this supposed to be
submitted to the convoluted C-code contest?

Dammit, we are not short of characters here and 'i' is generally used as
iterator variable which is hardly of type struct task_isol_info *.

> +	clear_thread_flag(TIF_TASK_ISOL);

What? See below....

> +	i = current->task_isol_info;
> +	if (!i)
> +		return;

That really makes sense. Why can a task which has TIF_TASK_ISOL set,
have current->task_isol_info != NULL?

I'm all for defensive programming, but if you really want to check this
then this should be:

	isol_info = current->task_isol_info;
	if (WARN_ON_ONCE(!isol_info))
		return;
No?

> +	if (i->active_mask != ISOL_F_QUIESCE)
> +		return;

Yay, more future proof hard coding!

> +	if (i->quiesce_mask & ISOL_F_QUIESCE_VMSTATS) {
> +		sync_vmstat();
> +		if (i->oneshot_mask & ISOL_F_QUIESCE_VMSTATS)
> +			i->quiesce_mask &= ~ISOL_F_QUIESCE_VMSTATS;

The point of this exercise is?

To clear quiesce_mask because this code path cannot be reached anymore
due to TIF_TASK_ISOL being cleared above.

Of course the active vs. no subfeature configured inconsistency is
preserved here for consistency reasons. At least something which is
consistent.

>  /**
>   * arch_check_user_regs - Architecture specific sanity check for user mode regs
> Index: linux-2.6/kernel/exit.c
> ===================================================================
> --- linux-2.6.orig/kernel/exit.c
> +++ linux-2.6/kernel/exit.c
> @@ -64,6 +64,7 @@
>  #include <linux/compat.h>
>  #include <linux/io_uring.h>
>  #include <linux/kprobes.h>
> +#include <linux/task_isolation.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/unistd.h>
> @@ -759,6 +760,7 @@ void __noreturn do_exit(long code)
>  	validate_creds_for_do_exit(tsk);
>  
>  	io_uring_files_cancel();
> +	task_isol_exit(tsk);

The purpose of this is?

> +static inline void task_isol_exit(struct task_struct *tsk)
> +{
> +	if (tsk->task_isol_info)
> +		__task_isol_exit(tsk);
> +}

and

>+ void __task_isol_exit(struct task_struct *tsk)
>+ {
>+ }

Makes a lot of sense and is thoroughly explained in the changelog and
comments....

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 06/13] procfs: add per-pid task isolation state
  2022-03-15 15:31 ` [patch v12 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
@ 2022-04-25 23:27   ` Thomas Gleixner
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-25 23:27 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> Add /proc/pid/task_isolation file, to query the state of
> task isolation configuration.
>
> ---

Lacks a Signed-off-by...

>  fs/proc/base.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++

> +#ifdef CONFIG_TASK_ISOLATION
> +
> +struct qoptions {
> +	unsigned long mask;
> +	char *name;
> +};
> +
> +static struct qoptions iopts[] = {
> +	{ISOL_F_QUIESCE, "quiesce"},
> +};
> +#define ILEN (sizeof(iopts) / sizeof(struct qoptions))

Reinventing ARRAY_SIZE() just because this isolation muck is special?

> +static struct qoptions qopts[] = {
> +	{ISOL_F_QUIESCE_VMSTATS, "vmstat_sync"},
> +};
> +#define QLEN (sizeof(qopts) / sizeof(struct qoptions))

Ditto.

> +static void show_isolation_state(struct seq_file *m,
> +				 struct qoptions *iopt,
> +				 int mask,
> +				 const char *hdr)
> +{
> +	int i;
> +
> +	seq_printf(m, hdr);
> +	for (i = 0; i < ILEN; i++) {
> +		if (mask & iopt->mask)
> +			seq_printf(m, "%s ", iopt->name);
> +		iopt++;
> +	}
> +	if (mask == 0)
> +		seq_printf(m, "none ");
> +	seq_printf(m, "\n");
> +}
> +
> +int proc_pid_task_isolation(struct seq_file *m, struct pid_namespace *ns,
> +			    struct pid *pid, struct task_struct *t)

This is required to be global without a prototype because?

> +{
> +	int active_mask, quiesce_mask, conf_mask;
> +	struct task_isol_info *task_isol_info;
> +	struct inode *inode = m->private;
> +	struct task_struct *task = get_proc_task(inode);
> +
> +	task_isol_info = t->task_isol_info;
> +	if (!task_isol_info)
> +		active_mask = quiesce_mask = conf_mask = 0;
> +	else {
> +		active_mask = task_isol_info->active_mask;
> +		quiesce_mask = task_isol_info->quiesce_mask;
> +		conf_mask = task_isol_info->conf_mask;
> +	}
> +
> +	show_isolation_state(m, iopts, conf_mask, "Configured state: ");
> +	show_isolation_state(m, iopts, active_mask, "Active state: ");
> +	show_isolation_state(m, qopts, quiesce_mask, "Quiescing: ");

And once you have 10 features with 10 subfeature masks supported, all of
this ends up in fs/proc/base.c just because all of this nonsense is
required to be disconnected from the actual task isolation code, right?

Just because a lot of crap has been dumped over time into that file does
not justify to mindlessly dump more crap into it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 04/13] add prctl task isolation prctl docs and samples
  2022-03-15 15:31 ` [patch v12 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
@ 2022-04-26  0:15   ` Thomas Gleixner
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-26  0:15 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> +++ linux-2.6/samples/task_isolation/task_isol.c

> +#ifdef PR_ISOL_FEAT_GET

This ifdef is there because the kernel on which this sample is compiled
does not support PR_ISOL_FEAT_GET? Try again...

> +int task_isol_setup(int oneshot)
> +{
> +	int ret;
> +	int errnosv;
> +	unsigned long long fmask;
> +	struct task_isol_quiesce_extensions qext;
> +	struct task_isol_quiesce_control qctrl;
> +
> +	/* Retrieve supported task isolation features */
> +	ret = prctl(PR_ISOL_FEAT_GET, 0, &fmask, 0, 0);
> +	if (ret == -1) {
> +		perror("prctl PR_ISOL_FEAT");
> +		return ret;
> +	}
> +	printf("supported features bitmask: 0x%llx\n", fmask);
> +
> +	/* Retrieve supported ISOL_F_QUIESCE bits */
> +	ret = prctl(PR_ISOL_FEAT_GET, ISOL_F_QUIESCE, &qext, 0, 0);

It makes a lot of sense to query ISOL_F_QUIESCE if the supported
features bitmask has not set it, right?

> +	if (ret == -1) {
> +		perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE)");
> +		return ret;
> +	}
> +	printf("supported ISOL_F_QUIESCE bits: 0x%llx\n",
> +		qext.supported_quiesce_bits);
> +
> +	fmask = 0;
> +	ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
> +	errnosv = errno;
> +	if (ret != -1 && fmask != 0) {
> +		printf("Task isolation parameters already configured!\n");
> +		return ret;
> +	}

Really useful because if that code is executed after a fork/clone then
it fails, not in that particular case, but this is _NOT_ a test case,
this is a sample to demonstrate usage.

> +	if (ret == -1 && errnosv != ENODATA) {

How exactly ends this prctl() up returning ENODATA?

> +		perror("prctl PR_ISOL_GET");
> +		return ret;
> +	}
> +	memset(&qctrl, 0, sizeof(struct task_isol_quiesce_control));
> +	qctrl.quiesce_mask = ISOL_F_QUIESCE_VMSTATS;
> +	if (oneshot)
> +		qctrl.quiesce_oneshot_mask = ISOL_F_QUIESCE_VMSTATS;
> +
> +	ret = prctl(PR_ISOL_CFG_SET, I_CFG_FEAT, ISOL_F_QUIESCE,
> +		    QUIESCE_CONTROL, &qctrl);
> +	if (ret == -1) {
> +		perror("prctl PR_ISOL_CFG_SET");
> +		return ret;
> +	}
> +	return ISOL_F_QUIESCE;

Very consistent return value:

     int task_isol_setup(int oneshot)

which just works because the whole definition of the ISOL_F_* feature
space is bogus and inconsistent hackery, but if that ever goes up to bit
31bit+ then all of this is just crap.

> +}
> +
> +int task_isol_activate_set(unsigned long long mask)

While you here make sure that @mask is properly sized. Btw. uint64_t
exists for a reason...

> +int main(void)
> +{
> +	int ret;
> +	void *buf = malloc(4096);
> +	unsigned long mask;

Works by chance...

> +	memset(buf, 1, 4096);
> +	ret = mlock(buf, 4096);
> +	if (ret) {
> +		perror("mlock");
> +		return EXIT_FAILURE;
> +	}
> +
> +	ret = task_isol_setup(0);
> +	if (ret == -1)
> +		return EXIT_FAILURE;
> +
> +	mask = ret;
> +	/* enable quiescing on system call return, oneshot */
> +	ret = task_isol_activate_set(mask);
> +	if (ret)
> +		return EXIT_FAILURE;
> +
> +#define NR_LOOPS 999999999
> +#define NR_PRINT 100000000
> +	/* busy loop */

Really readable code.... Not.

> +	while (ret < NR_LOOPS)  {
> +		memset(buf, 0, 4096);
> +		ret = ret+1;

The kernel has a well define coding style which is not optional for
samples.

> +int main(void)
> +{
> +	write_loops = 0;
> +	do {
> +#define NR_LOOPS 999999999
> +#define NR_PRINT 100000000

Groan.

> +		/* enable quiescing on system call return */
> +		ret = task_isol_activate_set(mask);
> +		if (ret)
> +			return EXIT_FAILURE;
> +
> +		/* busy loop */
> +		while (ret < NR_LOOPS)  {
> +			memset(buf, 0xf, 4096);
> +			ret = ret+1;
> +			if (!(ret % NR_PRINT))
> +				printf("wloop=%d loops=%d of %d\n", write_loops,
> +					ret, NR_LOOPS);

This is really a brilliant example of design fail at the conceptual level:

     task_isol_activate_set()
       set_thread_flag(TIF_TASK_ISOL);
       exit_to_user_mode()
          if (thread_flag(TIF_TASK_ISOL)) {
             handle_isol_muck() {
               clear_thread_flag(TIF_TASK_ISOL);
               ....
             }
     printf()
       sys_write()....
       exit_to_user_mode()
         ....
         
         --->  which might coincidentaly quiesce stuff or not just
               because something might have set TIF_TASK_ISOL or not.

Are you serious that setting TIF_TASK_ISOL from each of these envisioned
facilities which need quiescing is a maintainable approach?

That's a recipe for disaster and a guarantee for hard to diagnose
problems which ends up with a flood of non-sensical patches sprinkling
set_thread_flag(TIF_TASK_ISOL) all over the place just to cure the
symptoms.

Sure you can claim that this did not blow up in your face so far, but
that's a useless argument because _one_ out of the proposed 64 x 64 is
perhaps maintainable, but not anything beyond that.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 05/13] task isolation: sync vmstats on return to userspace
  2022-03-15 15:31 ` [patch v12 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
  2022-04-25 23:06   ` Thomas Gleixner
@ 2022-04-27  6:56   ` Thomas Gleixner
  1 sibling, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-27  6:56 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> --- linux-2.6.orig/kernel/entry/common.c
> +++ linux-2.6/kernel/entry/common.c
>  
> @@ -174,6 +175,9 @@ static unsigned long exit_to_user_mode_l
>  		if (ti_work & _TIF_NOTIFY_RESUME)
>  			tracehook_notify_resume(regs);
>  
> +		if (ti_work & _TIF_TASK_ISOL)
> +			task_isol_exit_to_user_mode();
> +
>  		/* Architecture specific TIF work */
>  		arch_exit_to_user_mode_work(regs, ti_work);

> --- linux-2.6.orig/include/linux/entry-common.h
> +++ linux-2.6/include/linux/entry-common.h
> @@ -60,7 +60,7 @@
>  #define EXIT_TO_USER_MODE_WORK						\
>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
>  	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
> -	 ARCH_EXIT_TO_USER_MODE_WORK)
> +	 _TIF_TASK_ISOL | ARCH_EXIT_TO_USER_MODE_WORK)

How is this supposed to compile when _TIF_TASK_ISOL is not defined by an
architecture?

Hint: Search for _TIF_PATCH_PENDING in the very same header file.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info
  2022-03-15 15:31 ` [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
  2022-03-16  2:41   ` Oscar Shiang
@ 2022-04-27  7:11   ` Thomas Gleixner
  2022-04-27 12:09     ` Thomas Gleixner
  1 sibling, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-27  7:11 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> If a thread has task isolation activated, is preempted by thread B,
> which marks vmstat information dirty, and is preempted back in,
> one might return to userspace with vmstat dirty information on the 
> CPU in question.
>
> To address this problem, add a preempt notifier that transfers vmstat dirty
> information to TIF_TASK_ISOL thread flag.

How does this compile with CONFIG_KVM=n?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2022-03-15 15:31 ` [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
@ 2022-04-27  7:23   ` Thomas Gleixner
  2022-05-03 19:17     ` Marcelo Tosatti
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-27  7:23 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
>  	/*
>  	 * The regular update, every sysctl_stat_interval, may come later
> @@ -1948,9 +1977,19 @@ int vmstat_refresh(struct ctl_table *tab
>  	 * transiently negative values, report an error here if any of
>  	 * the stats is negative, so we know to go looking for imbalance.
>  	 */
> -	err = schedule_on_each_cpu(refresh_vm_stats);
> -	if (err)
> -		return err;
> +	cpus_read_lock();
> +	for_each_online_cpu(cpu) {
> +		struct work_struct *work = per_cpu_ptr(works, cpu);
> +
> +		INIT_WORK(work, refresh_vm_stats);
> +		if (need_update(cpu) || need_drain_remote_zones(cpu))

Of course that makes sense in general, but now you have two ways of
deciding whether updating this is required.

   1) The above

   2) The per CPU boolean which tells whether vmstats are dirty or not.

Can we have a third method perhaps?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled
  2022-03-15 15:31 ` [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
@ 2022-04-27  7:45   ` Thomas Gleixner
  2022-05-03 19:12     ` Marcelo Tosatti
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-27  7:45 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:

$Subject does not qualify as a parseable sentence.

> This avoids processing of TIF_TASK_ISOL, when returning to userspace,
> for tasks which do not have task isolation configured.

That's how kernel development works, right:

   1) Add half baken stuff
   2) Apply duct tape on top

You know exactly, that this is _not_ the way it works.

This whole thing is half thought out tinkerware with [ill|un]defined
semantics.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 07/13] task isolation: sync vmstats conditional on changes
  2022-03-15 15:31 ` [patch v12 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
  2022-03-17 14:51   ` Frederic Weisbecker
@ 2022-04-27  8:03   ` Thomas Gleixner
  1 sibling, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-27  8:03 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +struct static_key vmstat_sync_enabled;

jump_label.h:

"The use of 'struct static_key' directly, is now DEPRECATED."

> +DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +
> +static inline void mark_vmstat_dirty(void)
> +{
> +	if (!static_key_false(&vmstat_sync_enabled))
> +		return;
> +
> +	raw_cpu_write(vmstat_dirty, true);

What's the justification for raw_cpu_write()?

>  
> @@ -1512,6 +1543,7 @@ static void pagetypeinfo_showfree_print(
>  		}
>  		seq_putc(m, '\n');
>  	}
> +	mark_vmstat_dirty();

Why does a function which just dumps information via /proc/pagetypeinfo
make vmstats dirty?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (13 preceding siblings ...)
  2022-03-17 15:08 ` [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Frederic Weisbecker
@ 2022-04-27  9:19 ` Christoph Lameter
  2022-05-03 18:57   ` Marcelo Tosatti
  2022-05-04 17:01 ` Tim Chen
  15 siblings, 1 reply; 44+ messages in thread
From: Christoph Lameter @ 2022-04-27  9:19 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, linux-rdma

Ok I actually have started an opensource project that may make use of the
onshot interface. This is a bridging tool between two RDMA protocols
called ib2roce. See https://gentwo.org/christoph/2022-bridging-rdma.pdf

The relevant code can be found at
https://github.com/clameter/rdma-core/tree/ib2roce/ib2roce. In
particular look at the ib2roce.c source code. This is still
under development.

The ib2roce briding can run in a busy loop mode (-k option) where it spins
on ibv_poll_cq() which is an RDMA call to handle incoming packets without
kernel interaction. See busyloop() in ib2roce.c

Currently I have configured the system to use CONFIG_NOHZ_FULL. With that
I am able to reliably forward packets at a rate that saturates 100G
Ethernet / EDR Infiniband from a single spinning thread.

Without CONFIG_NOHZ_FULL any slight disturbance causes the forwarding to
fall behind which will lead to dramatic packet loss since we are looking
here at a potential data rate of 12.5Gbyte/sec and about 12.5Mbyte per
msec. If the kernel interrupts the forwarding by say 10 msecs then we are
falling behind by 125MB which would have to be buffered and processing by
additional codes. That complexity makes it processing packets much slower
which could cause the forwarding to slow down so that a recovery is not
possible should the data continue to arrive at line rate.

Isolation of the threads was done through the following kernel parameters:

nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt

And systemd was configured with the following affinites:

system.conf:CPUAffinity=0-7,16-23

This means that the second socket will be generally free of tasks and
kernel threads.

The NUMA configuration:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 94798 MB
node 0 free: 92000 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 96765 MB
node 1 free: 96082 MB

node distances:
node   0   1
  0:  10  21
  1:  21  10


I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
provided by this patch instead of the NOHZ_FULL.

What kind of metric could I be using to show the difference in idleness of
the quality of the cpu isolation?

The ib2roce tool already has a CLI mode where one can monitor the
latencies that the busyloop experiences. See the latency calculations in
busyloop() and the CLI command "core". Stats can be reset via the "zap"
command.

I can see the usefulness of the oneshot mode but (I am very very sorry) I
still think that this patchset overdoes what is needed and I fail to
understand what the point of inheritance, per syscall quiescint etc is.
Those cause needless overhead in syscall handling and increase the
complexity of managing a busyloop. Special handling when the scheduler
switches a task? If tasks are being switched that requires them to be low
latency and undisturbed then something went very very wrong with the
system configuration and the only thing I would suggest is to issue some
kernel warning that this is not the way one should configure the system.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info
  2022-04-27  7:11   ` Thomas Gleixner
@ 2022-04-27 12:09     ` Thomas Gleixner
  2022-05-04 16:32       ` Marcelo Tosatti
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2022-04-27 12:09 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang,
	Marcelo Tosatti

On Wed, Apr 27 2022 at 09:11, Thomas Gleixner wrote:
> On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
>> If a thread has task isolation activated, is preempted by thread B,
>> which marks vmstat information dirty, and is preempted back in,
>> one might return to userspace with vmstat dirty information on the 
>> CPU in question.
>>
>> To address this problem, add a preempt notifier that transfers vmstat dirty
>> information to TIF_TASK_ISOL thread flag.
>
> How does this compile with CONFIG_KVM=n?

Aside of that, the existance of this preempt notifier alone tells me
that this is either a design fail or has no design in the first place.

The state of vmstat does not matter at all at the point where a task is
scheduled in. It matters when an isolated task goes out to user space or
enters a VM.

We already have something similar in the exit to user path:

   tick_nohz_user_enter_prepare()

So you can do something like the below and have:

static inline void task_isol_exit_to_user_prepare(void)
{
        if (unlikely(current_needs_isol_exit_to_user())
        	__task_isol_exit_to_user_prepare();
}

where current_needs_isol_exit_to_user() is a simple check of either an
existing mechanism like

         task->syscall_work & SYSCALL_WORK_TASK_ISOL_EXIT

or of some new task isolation specific member of task_struct which is
placed so it is cache hot at that point:

        task->isol_work & SYSCALL_TASK_ISOL_EXIT

which is going to be almost zero overhead for any non isolated task.

It's trivial enough to encode the real stuff into task->isol_work and
I'm pretty sure, that a 32bit member is sufficient for that. There is
absolutely no need for a potential 64x64 bit feature matrix.

Thanks,

        tglx
---
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -142,6 +142,12 @@ void noinstr exit_to_user_mode(void)
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 
+static void exit_to_user_update_work(void)
+{
+	tick_nohz_user_enter_prepare();
+	task_isol_exit_to_user_prepare();
+}
+
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
@@ -178,8 +184,7 @@ static unsigned long exit_to_user_mode_l
 		 */
 		local_irq_disable_exit_to_user();
 
-		/* Check if any of the above work has queued a deferred wakeup */
-		tick_nohz_user_enter_prepare();
+		exit_to_user_update_work();
 
 		ti_work = read_thread_flags();
 	}
@@ -194,8 +199,7 @@ static void exit_to_user_mode_prepare(st
 
 	lockdep_assert_irqs_disabled();
 
-	/* Flush pending rcuog wakeup before the last need_resched() check */
-	tick_nohz_user_enter_prepare();
+	exit_to_user_update_work();
 
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-04-27  9:19 ` Christoph Lameter
@ 2022-05-03 18:57   ` Marcelo Tosatti
  2022-05-04 13:20     ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-05-03 18:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, linux-rdma

On Wed, Apr 27, 2022 at 11:19:02AM +0200, Christoph Lameter wrote:
> Ok I actually have started an opensource project that may make use of the
> onshot interface. This is a bridging tool between two RDMA protocols
> called ib2roce. See https://gentwo.org/christoph/2022-bridging-rdma.pdf
> 
> The relevant code can be found at
> https://github.com/clameter/rdma-core/tree/ib2roce/ib2roce. In
> particular look at the ib2roce.c source code. This is still
> under development.
> 
> The ib2roce briding can run in a busy loop mode (-k option) where it spins
> on ibv_poll_cq() which is an RDMA call to handle incoming packets without
> kernel interaction. See busyloop() in ib2roce.c
> 
> Currently I have configured the system to use CONFIG_NOHZ_FULL. With that
> I am able to reliably forward packets at a rate that saturates 100G
> Ethernet / EDR Infiniband from a single spinning thread.
> 
> Without CONFIG_NOHZ_FULL any slight disturbance causes the forwarding to
> fall behind which will lead to dramatic packet loss since we are looking
> here at a potential data rate of 12.5Gbyte/sec and about 12.5Mbyte per
> msec. If the kernel interrupts the forwarding by say 10 msecs then we are
> falling behind by 125MB which would have to be buffered and processing by
> additional codes. That complexity makes it processing packets much slower
> which could cause the forwarding to slow down so that a recovery is not
> possible should the data continue to arrive at line rate.

Right.

> Isolation of the threads was done through the following kernel parameters:
> 
> nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
> numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt
> 
> And systemd was configured with the following affinites:
> 
> system.conf:CPUAffinity=0-7,16-23
> 
> This means that the second socket will be generally free of tasks and
> kernel threads.
> 
> The NUMA configuration:
> 
> $ numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 94798 MB
> node 0 free: 92000 MB
> node 1 cpus: 8 9 10 11 12 13 14 15
> node 1 size: 96765 MB
> node 1 free: 96082 MB
> 
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> 
> I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
> provided by this patch instead of the NOHZ_FULL.
> 
> What kind of metric could I be using to show the difference in idleness of
> the quality of the cpu isolation?

Interruption length and frequencies:

-------|xxxxx|---------------|xxx|---------
	 5us 		      3us

which is what should be reported by oslat ?

> 
> The ib2roce tool already has a CLI mode where one can monitor the
> latencies that the busyloop experiences. See the latency calculations in
> busyloop() and the CLI command "core". Stats can be reset via the "zap"
> command.
> 
> I can see the usefulness of the oneshot mode but (I am very very sorry) 

Its in there...

> I
> still think that this patchset overdoes what is needed and I fail to
> understand what the point of inheritance, per syscall quiescint etc is.

Inheritance is an attempt to support unmodified binaries like so:

1) configure task isolation parameters (eg sync per-CPU vmstat to global
stats on system call returns).
2) enable inheritance (so that task isolation configuration and
activation states are copied across to child processes).
3) enable task isolation.
4) execv(binary, params)

Per syscall quiescint ? Not sure what you mean here.

> Those cause needless overhead in syscall handling and increase the
> complexity of managing a busyloop. 

Inheritance seems like a useful feature to us. Isnt it? (to be able to
configure and activate task isolation for unmodified binaries).

> Special handling when the scheduler
> switches a task? If tasks are being switched that requires them to be low
> latency and undisturbed then something went very very wrong with the
> system configuration and the only thing I would suggest is to issue some
> kernel warning that this is not the way one should configure the system.

Trying to provide mechanisms, not policy? 

Or from another POV: if the user desires, we can display the warning.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-04-25 21:12     ` Thomas Gleixner
@ 2022-05-03 18:57       ` Marcelo Tosatti
  0 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-05-03 18:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Peter Zijlstra, linux-kernel, Nitesh Lal,
	Nicolas Saenz Julienne, Christoph Lameter, Juri Lelli,
	Alex Belits, Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang

On Mon, Apr 25, 2022 at 11:12:27PM +0200, Thomas Gleixner wrote:
> On Mon, Apr 25 2022 at 13:29, Marcelo Tosatti wrote:
> > On Thu, Mar 17, 2022 at 04:08:04PM +0100, Frederic Weisbecker wrote:
> >> 
> >> I still see a few details to sort out but overall the whole thing looks good:
> 
> >From a cursery inspection there is more than a few details to sort out.
> 
> >>   Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
> >> 
> >> Perhaps it's time to apply this patchset on some branch and iterate from there.
> >> 
> >> Thomas, Peter, what do you think?
> >> 
> >> Thanks!
> >
> > Ping ?
> 
> This does not apply against 5.18-rc1, which was released on April
> 3rd. Oh, well. You are really new to kernel development, right?

I can resend.

> Don't bother to resend before I finished reviewing the pile.

Sure, thanks!!!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled
  2022-04-27  7:45   ` Thomas Gleixner
@ 2022-05-03 19:12     ` Marcelo Tosatti
  2022-05-04 13:03       ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-05-03 19:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang

On Wed, Apr 27, 2022 at 09:45:54AM +0200, Thomas Gleixner wrote:
> On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> 
> $Subject does not qualify as a parseable sentence.
> 
> > This avoids processing of TIF_TASK_ISOL, when returning to userspace,
> > for tasks which do not have task isolation configured.
> 
> That's how kernel development works, right:
> 
>    1) Add half baken stuff
>    2) Apply duct tape on top
> 
> You know exactly, that this is _not_ the way it works.
> 
> This whole thing is half thought out tinkerware with [ill|un]defined
> semantics.

It seems to be inline with the remaining TIF_ bits:

                if (ti_work & _TIF_NOTIFY_RESUME)
                        tracehook_notify_resume(regs);

+               if (ti_work & _TIF_TASK_ISOL)
+                       task_isol_exit_to_user_mode();
+


And there is even:

--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_NEED_RESCHED       3       /* rescheduling necessary */
 #define TIF_SINGLESTEP         4       /* reenable singlestep on user return*/
 #define TIF_SSBD               5       /* Speculative store bypass disable */
+#define TIF_TASK_ISOL          6       /* task isolation work pending */
 #define TIF_SPEC_IB            9       /* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH     10      /* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY 11      /* notify kernel of userspace return */

So the purpose of TIF_TASK_ISOL is to condense in a single bit the 
question: "is there task isolation work pending?"

By looking at the code, we see the sites where this bit is set are:

1) Task isolation configuration.

@@ -251,6 +257,11 @@ static int cfg_feat_quiesce_set(unsigned
        info->quiesce_mask = i_qctrl->quiesce_mask;
        info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
        info->conf_mask |= ISOL_F_QUIESCE;
+
+       if ((info->active_mask & ISOL_F_QUIESCE) &&
+           (info->quiesce_mask & ISOL_F_QUIESCE_VMSTATS))
+               set_thread_flag(TIF_TASK_ISOL);
+
        ret = 0;

2) copy_process (clone / fork):

@@ -303,6 +314,7 @@ int __copy_task_isol(struct task_struct
                new_info->active_mask = info->active_mask;

        tsk->task_isol_info = new_info;
+       set_ti_thread_flag(task_thread_info(tsk), TIF_TASK_ISOL);

        return 0;
 }

3) task isolation activation:

@@ -330,6 +342,10 @@ int prctl_task_isol_activate_set(unsigne
        info->active_mask = active_mask;
        ret = 0;

+       if ((info->active_mask & ISOL_F_QUIESCE) &&
+           (info->quiesce_mask & ISOL_F_QUIESCE_VMSTATS))
+               set_thread_flag(TIF_TASK_ISOL);
+
 out:
        return ret;

Would you prefer an explanation, in words, when these bits are set, when
they are cleared? 

So the meaning is:

+#define TIF_TASK_ISOL          6       /* task isolation work pending */

And the definition is "task isolation work" depends on what task
isolation features are implemented.

(honestly, seem pretty clear to me, but willing to improve...).


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2022-04-27  7:23   ` Thomas Gleixner
@ 2022-05-03 19:17     ` Marcelo Tosatti
  0 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-05-03 19:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang

On Wed, Apr 27, 2022 at 09:23:06AM +0200, Thomas Gleixner wrote:
> On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> >  	/*
> >  	 * The regular update, every sysctl_stat_interval, may come later
> > @@ -1948,9 +1977,19 @@ int vmstat_refresh(struct ctl_table *tab
> >  	 * transiently negative values, report an error here if any of
> >  	 * the stats is negative, so we know to go looking for imbalance.
> >  	 */
> > -	err = schedule_on_each_cpu(refresh_vm_stats);
> > -	if (err)
> > -		return err;
> > +	cpus_read_lock();
> > +	for_each_online_cpu(cpu) {
> > +		struct work_struct *work = per_cpu_ptr(works, cpu);
> > +
> > +		INIT_WORK(work, refresh_vm_stats);
> > +		if (need_update(cpu) || need_drain_remote_zones(cpu))
> 
> Of course that makes sense in general, but now you have two ways of
> deciding whether updating this is required.
> 
>    1) The above
> 
>    2) The per CPU boolean which tells whether vmstats are dirty or not.
> 
> Can we have a third method perhaps?

Ok, will think of a third method to increase clarity :-)

By the fourth method it will be clear for sure!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled
  2022-05-03 19:12     ` Marcelo Tosatti
@ 2022-05-04 13:03       ` Thomas Gleixner
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-05-04 13:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang

On Tue, May 03 2022 at 16:12, Marcelo Tosatti wrote:
> On Wed, Apr 27, 2022 at 09:45:54AM +0200, Thomas Gleixner wrote:
> It seems to be inline with the remaining TIF_ bits:
>
>                 if (ti_work & _TIF_NOTIFY_RESUME)
>                         tracehook_notify_resume(regs);
>
> +               if (ti_work & _TIF_TASK_ISOL)
> +                       task_isol_exit_to_user_mode();
> +
>
>
> And there is even:

I know that the bit is defined, but that does still not make an argument.

> By looking at the code, we see the sites where this bit is set are:

Again. I'm able to read the patches myself.

> Would you prefer an explanation, in words, when these bits are set, when
> they are cleared?

No. The point is that contrary to TIF_NOTIFY_RESUME and other TIF bits,
this is going to end up being sprinkled all over the place.

With the current vmstat quiesce that's limited, but it's bound to
increase and spread simply because the whole thing has no semantics and
it's all headed to be adhoc cure for the isolation itch of the day.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-05-03 18:57   ` Marcelo Tosatti
@ 2022-05-04 13:20     ` Thomas Gleixner
  2022-05-04 18:56       ` Marcelo Tosatti
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2022-05-04 13:20 UTC (permalink / raw)
  To: Marcelo Tosatti, Christoph Lameter
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Daniel Bristot de Oliveira, Oscar Shiang, linux-rdma

On Tue, May 03 2022 at 15:57, Marcelo Tosatti wrote:
> On Wed, Apr 27, 2022 at 11:19:02AM +0200, Christoph Lameter wrote:
>> I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
>> provided by this patch instead of the NOHZ_FULL.
>> 
>> What kind of metric could I be using to show the difference in idleness of
>> the quality of the cpu isolation?
>
> Interruption length and frequencies:
>
> -------|xxxxx|---------------|xxx|---------
> 	 5us 		      3us
>
> which is what should be reported by oslat ?

How is oslat helpful there? That's running artifical workload benchmarks
which are not necessarily representing the actual
idle->interrupt->idle... timing sequence of the real world usecase.

> Inheritance is an attempt to support unmodified binaries like so:
>
> 1) configure task isolation parameters (eg sync per-CPU vmstat to global
> stats on system call returns).
> 2) enable inheritance (so that task isolation configuration and
> activation states are copied across to child processes).
> 3) enable task isolation.
> 4) execv(binary, params)

What for? If an application has isolation requirements, then the
specific requirements are part of the application design and not of some
arbitrary wrapper. Can we please focus on the initial problem of
providing a sensible isolation mechanism with well defined semantics?

Inheritance is an orthogonal problem and there is no reason to have this
initially.

>> Special handling when the scheduler
>> switches a task? If tasks are being switched that requires them to be low
>> latency and undisturbed then something went very very wrong with the
>> system configuration and the only thing I would suggest is to issue some
>> kernel warning that this is not the way one should configure the system.
>
> Trying to provide mechanisms, not policy? 

This preemption notifier is not a mechanism, it's simply mindless
hackery as I told you already.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info
  2022-04-27 12:09     ` Thomas Gleixner
@ 2022-05-04 16:32       ` Marcelo Tosatti
  2022-05-04 17:39         ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-05-04 16:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang

On Wed, Apr 27, 2022 at 02:09:16PM +0200, Thomas Gleixner wrote:
> On Wed, Apr 27 2022 at 09:11, Thomas Gleixner wrote:
> > On Tue, Mar 15 2022 at 12:31, Marcelo Tosatti wrote:
> >> If a thread has task isolation activated, is preempted by thread B,
> >> which marks vmstat information dirty, and is preempted back in,
> >> one might return to userspace with vmstat dirty information on the 
> >> CPU in question.
> >>
> >> To address this problem, add a preempt notifier that transfers vmstat dirty
> >> information to TIF_TASK_ISOL thread flag.
> >
> > How does this compile with CONFIG_KVM=n?
> 
> Aside of that, the existance of this preempt notifier alone tells me
> that this is either a design fail or has no design in the first place.
> 
> The state of vmstat does not matter at all at the point where a task is
> scheduled in. It matters when an isolated task goes out to user space or
> enters a VM.

If the following happens, with two threads with names that mean whether
a thread has task isolation enabled or not:

Thread-no-task-isol, Thread-task-isol.

Events:

not-runnable  		Thread-task-isol
runnable      		Thread-task-no-isol
marks vmstat dirty	Thread-task-no-isol (writes to some per-CPU vmstat
counter)
not-runnable		Thread-task-no-isol
runnable		Thread-task-isol

Then we have to transfer the "vmstat dirty" information from per-CPU 
bool to per-thread TIF_TASK_ISOL bit (so that the
task_isolation_process_work thing executes on return to userspace).

> We already have something similar in the exit to user path:
> 
>    tick_nohz_user_enter_prepare()
> 
> So you can do something like the below and have:
> 
> static inline void task_isol_exit_to_user_prepare(void)
> {
>         if (unlikely(current_needs_isol_exit_to_user())
>         	__task_isol_exit_to_user_prepare();
> }
> 
> where current_needs_isol_exit_to_user() is a simple check of either an
> existing mechanism like
> 
>          task->syscall_work & SYSCALL_WORK_TASK_ISOL_EXIT
> 
> or of some new task isolation specific member of task_struct which is
> placed so it is cache hot at that point:
> 
>         task->isol_work & SYSCALL_TASK_ISOL_EXIT
> 
> which is going to be almost zero overhead for any non isolated task.

Sure, but who sets SYSCALL_TASK_ISOL_EXIT or SYSCALL_TASK_ISOL_EXIT ?

> It's trivial enough to encode the real stuff into task->isol_work and
> I'm pretty sure, that a 32bit member is sufficient for that. There is
> absolutely no need for a potential 64x64 bit feature matrix.

Well, OK, the meaning of TIF_TASK_ISOL thread flag is ambiguous:

1) We set it when quiescing vmstat feature of task isolation.
2) We set it when switching between tasks A and B, B has 
task isolation configured and activated, and per-CPU vmstat information 
was dirty.
3) We clear it on return to userspace:

	if (test_bit(TIF_TASK_ISOL, thread->flags)) {
		clear_bit(TIF_TASK_ISOL, thread->flags))
		process_task_isol_work();
	}

So you prefer to separate:

Use TIF_TASK_ISOL for "task isolation configured and activated,
quiesce vmstat work on return to userspace" only, and then have
the "is vmstat per-CPU data dirty?" information held on 
task->syscall_work or task->isol_work ? (that will be probably be two
cachelines).

You'd still need the preempt notifier, though (unless i am missing
something).

Happy with either case.

Thanks for the review!

> Thanks,
> 
>         tglx
> ---
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -142,6 +142,12 @@ void noinstr exit_to_user_mode(void)
>  /* Workaround to allow gradual conversion of architecture code */
>  void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
>  
> +static void exit_to_user_update_work(void)
> +{
> +	tick_nohz_user_enter_prepare();
> +	task_isol_exit_to_user_prepare();
> +}
> +
>  static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  					    unsigned long ti_work)
>  {
> @@ -178,8 +184,7 @@ static unsigned long exit_to_user_mode_l
>  		 */
>  		local_irq_disable_exit_to_user();
>  
> -		/* Check if any of the above work has queued a deferred wakeup */
> -		tick_nohz_user_enter_prepare();
> +		exit_to_user_update_work();
>  
>  		ti_work = read_thread_flags();
>  	}
> @@ -194,8 +199,7 @@ static void exit_to_user_mode_prepare(st
>  
>  	lockdep_assert_irqs_disabled();
>  
> -	/* Flush pending rcuog wakeup before the last need_resched() check */
> -	tick_nohz_user_enter_prepare();
> +	exit_to_user_update_work();
>  
>  	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (14 preceding siblings ...)
  2022-04-27  9:19 ` Christoph Lameter
@ 2022-05-04 17:01 ` Tim Chen
  2022-05-04 20:08   ` Marcelo Tosatti
  15 siblings, 1 reply; 44+ messages in thread
From: Tim Chen @ 2022-05-04 17:01 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Tue, 2022-03-15 at 12:31 -0300, Marcelo Tosatti wrote:
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
> 
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
> 
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
> 
> To fix this, add task isolation prctl interface to quiesce
> deferred actions when returning to userspace.
> 
> The patchset is based on ideas and code from the
> task isolation patchset from Alex Belits:
> https://lwn.net/Articles/816298/
> 
> Please refer to Documentation/userspace-api/task_isolation.rst
> (patch 1) for details. Its attached at the end of this message

Patch 1 doesn't seem to be the documentation patch but rather is
in patch 4.

> 
> Task isolation prctl interface
> ******************************
> 
> Certain types of applications benefit from running uninterrupted by
> background OS activities. Realtime systems and high-bandwidth
> networking applications with user-space drivers can fall into the
> category.
> 
> To create an OS noise free environment for the application, this
> interface allows userspace to inform the kernel the start and end of
> the latency sensitive application section (with configurable system
> behaviour for that section).
> 
> Note: the prctl interface is independent of nohz_full=.
> 
> The prctl options are:
> 
>    * PR_ISOL_FEAT_GET: Retrieve supported features.
> 
>    * PR_ISOL_CFG_GET: Retrieve task isolation configuration.
> 
>    * PR_ISOL_CFG_SET: Set task isolation configuration.
> 
>    * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
> 
>    * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
> 
> Summary of terms:
> 
> * feature:
> 
>      A distinct attribute or aspect of task isolation. Examples of
>      features could be logging, new operating modes (eg: syscalls
>      disallowed), userspace notifications, etc. The only feature
>      currently available is quiescing.
> 
> * configuration:
> 
>      A specific choice from a given set of possible choices that
>      dictate how the particular feature in question should behave.
> 
> * activation state:
> 
>      The activation state (whether active/inactive) of the task
>      isolation features (features must be configured before being
>      activated).
> 
> Inheritance of the isolation parameters and state, across fork(2) and
> clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
> 
> At a high-level, task isolation is divided in two steps:
> 
> 1. Configuration.
> 
> 2. Activation.
> 
> Section "Userspace support" describes how to use task isolation.
> 
> In terms of the interface, the sequence of steps to activate task
> isolation are:
> 
> 1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
> 
> 2. Configure task isolation features
>    (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
> 
> 3. Activate or deactivate task isolation features
>    (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
> 
> This interface is based on ideas and code from the task isolation
> patchset from Alex Belits: https://lwn.net/Articles/816298/
> 
> Note: if the need arises to configure an individual quiesce feature
> with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
> PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
> features per syscall currently).
> 
> 
> Feature description
> ===================
> 
>    * "ISOL_F_QUIESCE"
> 
>    This feature allows quiescing selected kernel activities on return
>    from system calls.
> 
> 
> Interface description
> =====================
> 
> **PR_ISOL_FEAT**:
> 
>    Returns the supported features and feature capabilities, as a
>    bitmask:
> 
>       prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> 
>    The 'feat' argument specifies whether to return supported features
>    (if zero), or feature capabilities (if not zero). Possible values
>    for 'feat' are:
> 
>    * "0":
> 
>         Return the bitmask of supported features, in the location
>         pointed  to  by  "(int *)arg3". The buffer should allow space
>         for 8 bytes.
> 
>    * "ISOL_F_QUIESCE":
> 
>         Return a structure containing which kernel activities are
>         supported for quiescing, in the location pointed to by "(int
>         *)arg3":
> 
>            struct task_isol_quiesce_extensions {
>                    __u64 flags;
>                    __u64 supported_quiesce_bits;
>                    __u64 pad[6];
>            };
> 
>         Where:
> 
>         *flags*: Additional flags (should be zero).
> 
>         *supported_quiesce_bits*: Bitmask indicating
>            which features are supported for quiescing.
> 
>         *pad*: Additional space for future enhancements.
> 
>    Features and its capabilities are defined at
>    include/uapi/linux/task_isolation.h.
> 
> **PR_ISOL_CFG_GET**:
> 
>    Retrieve task isolation configuration. The general format is:
> 
>       prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
> 
>    The 'what' argument specifies what to configure. Possible values
>    are:
> 
>    * "I_CFG_FEAT":
> 
>         Return configuration of task isolation features. The 'arg3'
>         argument specifies whether to return configured features (if
>         zero), or individual feature configuration (if not zero), as
>         follows.
> 
>         * "0":
> 
>              Return the bitmask of configured features, in the
>              location pointed  to  by  "(int *)arg4". The buffer
>              should allow space for 8 bytes.
> 
>         * "ISOL_F_QUIESCE":
> 
>              If arg4 is QUIESCE_CONTROL, return the control structure
>              for quiescing of background kernel activities, in the
>              location pointed to by "(int *)arg5":
> 
>                 struct task_isol_quiesce_control {
>                        __u64 flags;
>                        __u64 quiesce_mask;
>                        __u64 quiesce_oneshot_mask;
>                        __u64 pad[5];
>                 };
> 
>              See PR_ISOL_CFG_SET description for meaning of fields.
> 
>    * "I_CFG_INHERIT":
> 
>         Retrieve inheritance configuration across fork/clone.
> 
>         Return the structure which configures inheritance across
>         fork/clone, in the location pointed to by "(int *)arg4":
> 
>            struct task_isol_inherit_control {
>                    __u8    inherit_mask;
>                    __u8    pad[7];
>            };
> 
>         See PR_ISOL_CFG_SET description for meaning of fields.
> 
> **PR_ISOL_CFG_SET**:
> 
>    Set task isolation configuration. The general format is:
> 
>       prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> 
>    The 'what' argument specifies what to configure. Possible values
>    are:
> 
>    * "I_CFG_FEAT":
> 
>         Set configuration of task isolation features. 'arg3' specifies
>         the feature. Possible values are:
> 
>         * "ISOL_F_QUIESCE":

Is it really necessary for such fine grain control for which kernel
activity to quiesce?  

For most user, all they care about is their
task is not disturbed by kernel activities and not be bothered about
setting which particular activities to quiesce.  And in your patches there
is only ISOL_F_QUIESCE_VMSTATS and nothing else.  I think you could
probably skip the QUIESCE control for now and add it when there's really
a true need for fine grain control.  This will make the interface simpler
for user applications.

Tim






^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info
  2022-05-04 16:32       ` Marcelo Tosatti
@ 2022-05-04 17:39         ` Thomas Gleixner
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-05-04 17:39 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang

On Wed, May 04 2022 at 13:32, Marcelo Tosatti wrote:
> On Wed, Apr 27, 2022 at 02:09:16PM +0200, Thomas Gleixner wrote:
>> Aside of that, the existance of this preempt notifier alone tells me
>> that this is either a design fail or has no design in the first place.
>> 
>> The state of vmstat does not matter at all at the point where a task is
>> scheduled in. It matters when an isolated task goes out to user space or
>> enters a VM.
>
> If the following happens, with two threads with names that mean whether
> a thread has task isolation enabled or not:
>
> Thread-no-task-isol, Thread-task-isol.
>
> Events:
>
> not-runnable  		Thread-task-isol
> runnable      		Thread-task-no-isol
> marks vmstat dirty	Thread-task-no-isol (writes to some per-CPU vmstat
> counter)
> not-runnable		Thread-task-no-isol
> runnable		Thread-task-isol
>
> Then we have to transfer the "vmstat dirty" information from per-CPU 
> bool to per-thread TIF_TASK_ISOL bit (so that the
> task_isolation_process_work thing executes on return to userspace).

That's absolute nonsense.

sched_out()      isolated task
vmstat_dirty()
  this_cpu_or(isolwork, VMSTAT);
sched_in()       isolated task

return_to_user()
  local_irq_disable();
  exit_to_user_update_work()
    task_isol_exit_to_user_prepare()
      if (!isolated_task())
          return;
      if (this_cpu_read(isolwork) & current->isol_work_mask)
      	  set_thread_flag(TIF_ISOL);

  exit_to_user_mode_loop()
     do {
        local_irq_enable();
        handle_TIF_bits();
        local_irq_disable();
        exit_to_user_update_work();
        work = read_thread_flags();
     } while (work & EXIT_WORK);
          
Solves the problem nicely with a minimal overhead for non-isolated
tasks.

Plus some of these isolwork bits could even be handled _after_ returning
from exit_do_user_mode_loop() if they are good to be done in irq
diasbled context.

> Sure, but who sets SYSCALL_TASK_ISOL_EXIT or SYSCALL_TASK_ISOL_EXIT ?

It's set once by the prctl() when an isolation feature is enabled for a
task and it's cleared by the prctl() when the last isolation feature is
disabled for the task.

That's then used in:

static inline bool isolated_task()
{
       return current->XXXX_work & TASK_ISOL_EXIT;
}

IOW, the return to user path has

     - _ONE_ extra cache hot conditional for non-isolated tasks.

     - _ONE_ central place to transform the per cpu isolation muck into
       the TIF flag.

See? No sprinkling of TIF bits, no preempt notifiers, nothing.

> Use TIF_TASK_ISOL for "task isolation configured and activated,
> quiesce vmstat work on return to userspace" only, and then have
> the "is vmstat per-CPU data dirty?" information held on 
> task->syscall_work or task->isol_work ? (that will be probably be two
> cachelines).

See above.

> You'd still need the preempt notifier, though (unless i am missing
> something).

Yes, see above.

Using a preempt notifier isa design fail because it tags information at
a place where this information is absolutely irrelevant and subject to
change.

Aside of that this information is not a task property. vmmstat_is_dirty
is a per CPU property. The only point where this per CPU property is
relevant for a task is when the task is isolated and goes out to user
space or enters a VM.

Trying to carry this information in a task flag is fundamentaly wrong
for obvious reasons and causes pointless overhead and complexity for
absolutely no value.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-05-04 13:20     ` Thomas Gleixner
@ 2022-05-04 18:56       ` Marcelo Tosatti
  2022-05-04 20:15         ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-05-04 18:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Lameter, linux-kernel, Nitesh Lal,
	Nicolas Saenz Julienne, Frederic Weisbecker, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang, linux-rdma

On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
> On Tue, May 03 2022 at 15:57, Marcelo Tosatti wrote:
> > On Wed, Apr 27, 2022 at 11:19:02AM +0200, Christoph Lameter wrote:
> >> I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
> >> provided by this patch instead of the NOHZ_FULL.
> >> 
> >> What kind of metric could I be using to show the difference in idleness of
> >> the quality of the cpu isolation?
> >
> > Interruption length and frequencies:
> >
> > -------|xxxxx|---------------|xxx|---------
> > 	 5us 		      3us
> >
> > which is what should be reported by oslat ?
> 
> How is oslat helpful there? That's running artifical workload benchmarks
> which are not necessarily representing the actual
> idle->interrupt->idle... timing sequence of the real world usecase.

Ok, so what is happening today on production on some telco installations 
(considering virtualized RAN usecase) is this:

1) Basic testing: Verify the hardware, software and its configuration
(cpu isolation parameters etc) are able to achieve the desired maximum
interruption length/frequencies through a synthetic benchmark like 
cyclictest and oslat, for a duration that is considered sufficient.

One might also use the actual application in a synthetic configuration
(for example FlexRAN with synthetic data).

2) From the above, assume real world usecase is able to achieve the
desired maximum interruption length/frequency.

Of course, that is sub-optimal. The rt-trace-bcc.py scripts instruments
certain functions in the kernel (say smp_call_function family), 
allowing one to check if certain interruptions have happened
on production. Example:

$ sudo ./rt-trace-bcc.py -c 36-39
[There can be some warnings dumped; we can ignore them]
Enabled hook point: process_one_work
Enabled hook point: __queue_work
Enabled hook point: __queue_delayed_work
Enabled hook point: generic_exec_single
Enabled hook point: smp_call_function_many_cond
Enabled hook point: irq_work_queue
Enabled hook point: irq_work_queue_on
TIME(s)            COMM                 CPU  PID      MSG
0.009599293        rcuc/8               8    75       irq_work_queue_on (target=36, func=nohz_full_kick_func)
0.009603039        rcuc/8               8    75       irq_work_queue_on (target=37, func=nohz_full_kick_func)
0.009604047        rcuc/8               8    75       irq_work_queue_on (target=38, func=nohz_full_kick_func)
0.009604848        rcuc/8               8    75       irq_work_queue_on (target=39, func=nohz_full_kick_func)
0.103600589        rcuc/8               8    75       irq_work_queue_on (target=36, func=nohz_full_kick_func)
...

Currently it does not record the length of each interruption, but it
could (or you can do that from its output).

Note however that the sum of the interruptions is not the entire
overhead caused by the interruptions: there might also cachelines thrown 
away which are only going to be "counted" when the latency sensitive
app executes.

But you could say the overhead is _at least_ the sum of interruptions +
cache effects unaccounted for.

Also note that for idle->interrupt->idle type scenarios (the vRAN
usecase above currently does not idle at all, but there is interest
from the field for that to happen for power saving reasons), you'd
also sum return from idle.

> > Inheritance is an attempt to support unmodified binaries like so:
> >
> > 1) configure task isolation parameters (eg sync per-CPU vmstat to global
> > stats on system call returns).
> > 2) enable inheritance (so that task isolation configuration and
> > activation states are copied across to child processes).
> > 3) enable task isolation.
> > 4) execv(binary, params)
> 
> What for? If an application has isolation requirements, then the
> specific requirements are part of the application design and not of some
> arbitrary wrapper. 

To be able to configure and active task isolation for an unmodified
binary, which seems a useful feature. However, have no problem of not 
supporting unmodified binaries (would have then to change the
applications).

There are 3 types of application arrangements:

==================
Userspace support
==================

Task isolation is divided in two main steps: configuration and activation.

Each step can be performed by an external tool or the latency sensitive
application itself. util-linux contains the "chisol" tool for this
purpose.

This results in three combinations:

1. Both configuration and activation performed by the
latency sensitive application.
Allows fine grained control of what task isolation
features are enabled and when (see samples section below).

2. Only activation can be performed by the latency sensitive app
(and configuration performed by chisol).
This allows the admin/user to control task isolation parameters,
and applications have to be modified only once.

3. Configuration and activation performed by an external tool.
This allows unmodified applications to take advantage of
task isolation. Activation is performed by the "-a" option
of chisol.

---

Some features might not be supportable (or have awkward behavior) on a
given combination. For example, if a feature such as "warn if
sched_out/sched_in is ever performed if task isolation is
configured/activated", then you'll get those warnings 
for combination 3 (which is the case of unmodified binaries above).

> Inheritance is an orthogonal problem and there is no reason to have this
> initially.

No problem, will drop it.

> Can we please focus on the initial problem of
> providing a sensible isolation mechanism with well defined semantics?

Case 2, however, was implicitly suggested by you (or at least i
understood that):

"Summary: The problem to be solved cannot be restricted to

    self_defined_important_task(OWN_WORLD);

Policy is not a binary on/off problem. It's manifold across all levels
of the stack and only a kernel problem when it comes down to the last
line of defence.

Up to the point where the kernel puts the line of last defence, policy
is defined by the user/admin via mechanims provided by the kernel.

Emphasis on "mechanims provided by the kernel", aka. user API.

Just in case, I hope that I don't have to explain what level of scrunity
and thought this requires." 

The idea, as i understood was that certain task isolation features (or
they parameters) might have to be changed at runtime (which depends on
the task isolation features themselves, and the plan is to create
an extensible interface). So for case 2, all you'd have to do is to 
modify the application only once and allow the admin to configure 
the features. From the documentation:

This is a snippet of code to activate task isolation if
it has been previously configured (by chisol for example)::

        #include <sys/prctl.h>
        #include <linux/types.h>

        #ifdef PR_ISOL_CFG_GET
        unsigned long long fmask;

        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
        if (ret != -1 && fmask != 0) {
                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
                if (ret == -1) {
                        perror("prctl PR_ISOL_ACTIVATE_SET");
                        return ret;
                }
        }
        #endif

This seemed pretty useful to me (which is possible if the features 
being discussed do not require further modifications on the part of the
application). For example, a new task isolation feature can be enabled 
without having to modify the application.

Again, maybe that was misunderstood (and i'm OK with dropping this 
and forcing both configuration and activation to be performed
inside the app), no problem.

> >> Special handling when the scheduler
> >> switches a task? If tasks are being switched that requires them to be low
> >> latency and undisturbed then something went very very wrong with the
> >> system configuration and the only thing I would suggest is to issue some
> >> kernel warning that this is not the way one should configure the system.
> >
> > Trying to provide mechanisms, not policy? 
> 
> This preemption notifier is not a mechanism, it's simply mindless
> hackery as I told you already.

Sure, if there is another way of checking "if per-CPU vmstats require
syncing" that is cheap (which seems you suggested on the other email),
can drop preempt notifiers.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-05-04 17:01 ` Tim Chen
@ 2022-05-04 20:08   ` Marcelo Tosatti
  0 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-05-04 20:08 UTC (permalink / raw)
  To: Tim Chen, Thomas Gleixner
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira, Oscar Shiang

On Wed, May 04, 2022 at 10:01:47AM -0700, Tim Chen wrote:
> On Tue, 2022-03-15 at 12:31 -0300, Marcelo Tosatti wrote:
> > The logic to disable vmstat worker thread, when entering
> > nohz full, does not cover all scenarios. For example, it is possible
> > for the following to happen:
> > 
> > 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> > 2) app runs mlock, which increases counters for mlock'ed pages.
> > 3) start -RT loop
> > 
> > Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> > the mlock, vmstat shepherd can restart vmstat worker thread on
> > the CPU in question.
> > 
> > To fix this, add task isolation prctl interface to quiesce
> > deferred actions when returning to userspace.
> > 
> > The patchset is based on ideas and code from the
> > task isolation patchset from Alex Belits:
> > https://lwn.net/Articles/816298/
> > 
> > Please refer to Documentation/userspace-api/task_isolation.rst
> > (patch 1) for details. Its attached at the end of this message
> 
> Patch 1 doesn't seem to be the documentation patch but rather is
> in patch 4.

Hi Tim,

Yes, one might consider that awkward (but that order made sense when 
writing the patches).

Is there a different order that makes more sense?

> > 
> > Task isolation prctl interface
> > ******************************
> > 
> > Certain types of applications benefit from running uninterrupted by
> > background OS activities. Realtime systems and high-bandwidth
> > networking applications with user-space drivers can fall into the
> > category.
> > 
> > To create an OS noise free environment for the application, this
> > interface allows userspace to inform the kernel the start and end of
> > the latency sensitive application section (with configurable system
> > behaviour for that section).
> > 
> > Note: the prctl interface is independent of nohz_full=.
> > 
> > The prctl options are:
> > 
> >    * PR_ISOL_FEAT_GET: Retrieve supported features.
> > 
> >    * PR_ISOL_CFG_GET: Retrieve task isolation configuration.
> > 
> >    * PR_ISOL_CFG_SET: Set task isolation configuration.
> > 
> >    * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
> > 
> >    * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
> > 
> > Summary of terms:
> > 
> > * feature:
> > 
> >      A distinct attribute or aspect of task isolation. Examples of
> >      features could be logging, new operating modes (eg: syscalls
> >      disallowed), userspace notifications, etc. The only feature
> >      currently available is quiescing.
> > 
> > * configuration:
> > 
> >      A specific choice from a given set of possible choices that
> >      dictate how the particular feature in question should behave.
> > 
> > * activation state:
> > 
> >      The activation state (whether active/inactive) of the task
> >      isolation features (features must be configured before being
> >      activated).
> > 
> > Inheritance of the isolation parameters and state, across fork(2) and
> > clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
> > 
> > At a high-level, task isolation is divided in two steps:
> > 
> > 1. Configuration.
> > 
> > 2. Activation.
> > 
> > Section "Userspace support" describes how to use task isolation.
> > 
> > In terms of the interface, the sequence of steps to activate task
> > isolation are:
> > 
> > 1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
> > 
> > 2. Configure task isolation features
> >    (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
> > 
> > 3. Activate or deactivate task isolation features
> >    (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
> > 
> > This interface is based on ideas and code from the task isolation
> > patchset from Alex Belits: https://lwn.net/Articles/816298/
> > 
> > Note: if the need arises to configure an individual quiesce feature
> > with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
> > PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
> > features per syscall currently).
> > 
> > 
> > Feature description
> > ===================
> > 
> >    * "ISOL_F_QUIESCE"
> > 
> >    This feature allows quiescing selected kernel activities on return
> >    from system calls.
> > 
> > 
> > Interface description
> > =====================
> > 
> > **PR_ISOL_FEAT**:
> > 
> >    Returns the supported features and feature capabilities, as a
> >    bitmask:
> > 
> >       prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> > 
> >    The 'feat' argument specifies whether to return supported features
> >    (if zero), or feature capabilities (if not zero). Possible values
> >    for 'feat' are:
> > 
> >    * "0":
> > 
> >         Return the bitmask of supported features, in the location
> >         pointed  to  by  "(int *)arg3". The buffer should allow space
> >         for 8 bytes.
> > 
> >    * "ISOL_F_QUIESCE":
> > 
> >         Return a structure containing which kernel activities are
> >         supported for quiescing, in the location pointed to by "(int
> >         *)arg3":
> > 
> >            struct task_isol_quiesce_extensions {
> >                    __u64 flags;
> >                    __u64 supported_quiesce_bits;
> >                    __u64 pad[6];
> >            };
> > 
> >         Where:
> > 
> >         *flags*: Additional flags (should be zero).
> > 
> >         *supported_quiesce_bits*: Bitmask indicating
> >            which features are supported for quiescing.
> > 
> >         *pad*: Additional space for future enhancements.
> > 
> >    Features and its capabilities are defined at
> >    include/uapi/linux/task_isolation.h.
> > 
> > **PR_ISOL_CFG_GET**:
> > 
> >    Retrieve task isolation configuration. The general format is:
> > 
> >       prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
> > 
> >    The 'what' argument specifies what to configure. Possible values
> >    are:
> > 
> >    * "I_CFG_FEAT":
> > 
> >         Return configuration of task isolation features. The 'arg3'
> >         argument specifies whether to return configured features (if
> >         zero), or individual feature configuration (if not zero), as
> >         follows.
> > 
> >         * "0":
> > 
> >              Return the bitmask of configured features, in the
> >              location pointed  to  by  "(int *)arg4". The buffer
> >              should allow space for 8 bytes.
> > 
> >         * "ISOL_F_QUIESCE":
> > 
> >              If arg4 is QUIESCE_CONTROL, return the control structure
> >              for quiescing of background kernel activities, in the
> >              location pointed to by "(int *)arg5":
> > 
> >                 struct task_isol_quiesce_control {
> >                        __u64 flags;
> >                        __u64 quiesce_mask;
> >                        __u64 quiesce_oneshot_mask;
> >                        __u64 pad[5];
> >                 };
> > 
> >              See PR_ISOL_CFG_SET description for meaning of fields.
> > 
> >    * "I_CFG_INHERIT":
> > 
> >         Retrieve inheritance configuration across fork/clone.
> > 
> >         Return the structure which configures inheritance across
> >         fork/clone, in the location pointed to by "(int *)arg4":
> > 
> >            struct task_isol_inherit_control {
> >                    __u8    inherit_mask;
> >                    __u8    pad[7];
> >            };
> > 
> >         See PR_ISOL_CFG_SET description for meaning of fields.
> > 
> > **PR_ISOL_CFG_SET**:
> > 
> >    Set task isolation configuration. The general format is:
> > 
> >       prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> > 
> >    The 'what' argument specifies what to configure. Possible values
> >    are:
> > 
> >    * "I_CFG_FEAT":
> > 
> >         Set configuration of task isolation features. 'arg3' specifies
> >         the feature. Possible values are:
> > 
> >         * "ISOL_F_QUIESCE":
> 
> Is it really necessary for such fine grain control for which kernel
> activity to quiesce?  
> 
> For most user, all they care about is their
> task is not disturbed by kernel activities and not be bothered about
> setting which particular activities to quiesce.  And in your patches there
> is only ISOL_F_QUIESCE_VMSTATS and nothing else.  I think you could
> probably skip the QUIESCE control for now and add it when there's really
> a true need for fine grain control.  This will make the interface simpler
> for user applications.
> 
> Tim

Yes, this could be done (for example can maintain the current scheme,
add a way to query all supported features, enable them at chisol).
Then the application only has to be modified with:

This is a snippet of code to activate task isolation if
it has been previously configured (by chisol for example)::

        #include <sys/prctl.h>
        #include <linux/types.h>

        #ifdef PR_ISOL_CFG_GET
        unsigned long long fmask;

        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
        if (ret != -1 && fmask != 0) {
                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
                if (ret == -1) {
                        perror("prctl PR_ISOL_ACTIVATE_SET");
                        return ret;
                }
        }
        #endif

Or not modified at all (well not initially since support for executing
unmodified binaries will be dropped).









^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-05-04 18:56       ` Marcelo Tosatti
@ 2022-05-04 20:15         ` Thomas Gleixner
  2022-05-05 16:52           ` Marcelo Tosatti
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2022-05-04 20:15 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, linux-kernel, Nitesh Lal,
	Nicolas Saenz Julienne, Frederic Weisbecker, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang, linux-rdma

On Wed, May 04 2022 at 15:56, Marcelo Tosatti wrote:
> On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
>> Can we please focus on the initial problem of
>> providing a sensible isolation mechanism with well defined semantics?
>
> Case 2, however, was implicitly suggested by you (or at least i
> understood that):
>
> "Summary: The problem to be solved cannot be restricted to
>
>     self_defined_important_task(OWN_WORLD);
>
> Policy is not a binary on/off problem. It's manifold across all levels
> of the stack and only a kernel problem when it comes down to the last
> line of defence.
>
> Up to the point where the kernel puts the line of last defence, policy
> is defined by the user/admin via mechanims provided by the kernel.
>
> Emphasis on "mechanims provided by the kernel", aka. user API.
>
> Just in case, I hope that I don't have to explain what level of scrunity
> and thought this requires."

Correct. This reasoning is still valid and I haven't changed my opinion
on that since then.

My main objections against the proposed solution back then were the all
or nothing approach and the implicit hard coded policies.

> The idea, as i understood was that certain task isolation features (or
> they parameters) might have to be changed at runtime (which depends on
> the task isolation features themselves, and the plan is to create
> an extensible interface).

Again. I'm not against useful controls to select the isolation an
application requires. I'm neither against extensible interfaces.

But I'm against overengineered implementations which lack any form of
sensible design and have ill defined semantics at the user ABI.

Designing user space ABI is _hard_ and needs a lot of thoughts. It's not
done with throwing something 'extensible' at the kernel and hope it
sticks. As I showed you in the review, the ABI is inconsistent in
itself, it has ill defined semantics and lacks any form of justification
of the approach taken.

Can we please take a step back and:

  1) Define what is trying to be solved and what are the pieces known
     today which need to be controlled in order to achieve the desired
     isolation properties.

  2) Describe the usage scenarios and the resulting constraints.

  3) Describe the requirements for features on top, e.g. inheritance
     or external control.

Once we have that, we can have a discussion about the desired control
granularity and how to support the extra features in a consistent and
well defined way.

A good and extensible UABI design comes with well defined functionality
for the start and an obvious and maintainable extension path. The most
important part is the well defined functionality.

There have been enough examples in the past how well received approaches
are, which lack the well defined part. Linus really loves to get a pull
request for something which cannot be described what it does, but could
be used for cool things in the future.

> So for case 2, all you'd have to do is to modify the application only
> once and allow the admin to configure the features.

That's still an orthogonal problem, which can be solved once a sensible
mechanism to control the isolation and handle it at the transition
points is in place. You surely want to consider it when designing the
UABI, but it's not required to create the real isolation mechanism in
the first place.

Problem decomposition is not an entirely new concept, really.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-05-04 20:15         ` Thomas Gleixner
@ 2022-05-05 16:52           ` Marcelo Tosatti
  2022-06-01 16:14             ` Marcelo Tosatti
  0 siblings, 1 reply; 44+ messages in thread
From: Marcelo Tosatti @ 2022-05-05 16:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Lameter, linux-kernel, Nitesh Lal,
	Nicolas Saenz Julienne, Frederic Weisbecker, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang, linux-rdma


Hi Thomas,

On Wed, May 04, 2022 at 10:15:14PM +0200, Thomas Gleixner wrote:
> On Wed, May 04 2022 at 15:56, Marcelo Tosatti wrote:
> > On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
> >> Can we please focus on the initial problem of
> >> providing a sensible isolation mechanism with well defined semantics?
> >
> > Case 2, however, was implicitly suggested by you (or at least i
> > understood that):
> >
> > "Summary: The problem to be solved cannot be restricted to
> >
> >     self_defined_important_task(OWN_WORLD);
> >
> > Policy is not a binary on/off problem. It's manifold across all levels
> > of the stack and only a kernel problem when it comes down to the last
> > line of defence.
> >
> > Up to the point where the kernel puts the line of last defence, policy
> > is defined by the user/admin via mechanims provided by the kernel.
> >
> > Emphasis on "mechanims provided by the kernel", aka. user API.
> >
> > Just in case, I hope that I don't have to explain what level of scrunity
> > and thought this requires."
> 
> Correct. This reasoning is still valid and I haven't changed my opinion
> on that since then.
> 
> My main objections against the proposed solution back then were the all
> or nothing approach and the implicit hard coded policies.
> 
> > The idea, as i understood was that certain task isolation features (or
> > they parameters) might have to be changed at runtime (which depends on
> > the task isolation features themselves, and the plan is to create
> > an extensible interface).
> 
> Again. I'm not against useful controls to select the isolation an
> application requires. I'm neither against extensible interfaces.
> 
> But I'm against overengineered implementations which lack any form of
> sensible design and have ill defined semantics at the user ABI.
> 
> Designing user space ABI is _hard_ and needs a lot of thoughts. It's not
> done with throwing something 'extensible' at the kernel and hope it
> sticks. As I showed you in the review, the ABI is inconsistent in
> itself, it has ill defined semantics and lacks any form of justification
> of the approach taken.
> 
> Can we please take a step back and:
> 
>   1) Define what is trying to be solved

Avoid interruptions to application code execution on isolated CPUs.

Different use-cases might accept different length/frequencies
of interruptions (including no interruptions).

>      and what are the pieces known
>      today which need to be controlled in order to achieve the desired
>      isolation properties.

I hope you don't mean the current CPU isolation features which have to
be enabled, but only the ones which are not enabled today:

"Isolation of the threads was done through the following kernel parameters:

nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt

And systemd was configured with the following affinites:

system.conf:CPUAffinity=0-7,16-23

This means that the second socket will be generally free of tasks and   
kernel threads."

So here are some features which could be written on top of the proposed
task isolation via prctl:

1) 

Enable or disable the following optional behaviour

A.
if (cpu->isolated_avoid_queue_work)
	return -EBUSY;

queue_work_on(cpu, workfn);

(for the functions that can handle errors gracefully).

B.
if (cpu->isolated_avoid_function_ipi)
	return -EBUSY;

smp_call_function_single(cpu, fn);
(for the functions that can handle errors gracefully).
Those that can't handle errors gracefully should be changed 
to either handle errors or to remote work.

Not certain if this should be on per-case basis: say
"avoid action1|avoid action2|avoid action3|..." (bit per
action) and a "ALL" control, where actionZ is an action
that triggers an IPI or remote work (then you would check
for whether to fail not at smp_call_function_single 
time but before the action starts).

Also, one might use something such as stalld (that schedules 
tasks in/out for a short amount of time every given time window),
which might be acceptable for his workload, so he'd disable
cpu->isolated_avoid_queue_work (or expose this on per-case basis,
unsure which is better).

As for IPIs, whether to block a function call to an isolated
CPU depends on whether that function call (and its frequency) 
will cause the latency sensitive application to violate its "latency" 
requirements.

Perhaps "ALL - action1, action2, action3" is useful.

=======================================

2)

In general, avoiding (or uncaching on return to userspace) 
a CPU from caching per-CPU data (which might require an 
IPI to invalidate later on) (see point [1] below for more thoughts
on this issue).


For example, for KVM:

/*
 * MMU notifier 'invalidate_range_start' hook.
 */
void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
                                       unsigned long end, bool may_block)
{
        DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS);
        struct gfn_to_pfn_cache *gpc;
        bool wake_vcpus = false;
	...
	called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);

	which will
	smp_call_function_many(cpus, ack_flush, NULL, wait);
...


====================================================

3) Enabling a kernel warning when a task switch happens on a CPU
which runs a task isolated thread?

From Christoph:

Special handling when the scheduler
switches a task? If tasks are being switched that requires them to be low
latency and undisturbed then something went very very wrong with the
system configuration and the only thing I would suggest is to issue some
kernel warning that this is not the way one should configure the system.

====================================================

4) Sending a signal whenever an application is interrupted
(hum, this could be done via BPF).

Those are the ones i can think of at the moment. 
Not sure what other people can think of.

>   2) Describe the usage scenarios and the resulting constraints.

Well the constraints should be in the form

	"In a given window of time, there should be no more than N
	 CPU interruptions of length L each."

	(should be more complicated due to cache effects, but choosing
	 a lower N and L one is able to correct that)

I believe?

Also some memory bandwidth must be available to the application
(or data/code in shared caches).
Which depends on what other CPUs in the system are doing, the
cache hierarchy, the application, etc.

[1]: There is also a question of whether to focus only on 
applications that do not perform system calls on their latency 
sensitive path, and applications that perform system calls. 

Because some CPU interruptions can't be avoided if the application 
is in the kernel: for example instruction cache flushes due to 
static_key rewrites or kernel TLB flushes (well they could be avoided 
with more infrastructure, but there is no such infrastructure at
the moment).

>   3) Describe the requirements for features on top, e.g. inheritance
>      or external control.

1) Be able to use unmodified applications (as long as the features
to be enabled are compatible with such usage, for example "killing 
/ sending signal to application if task is interrupted" is obviously
incompatible with unmodified applications).

2) External control: be able to modify what task isolation features are
enabled externally (not within the application itself). The latency
sensitive application should inform the kernel the beginning of 
the latency sensitive section (at this time, the task isolation 
features configured externally will be activated).

3) One-shot mode: be able to quiesce certain kernel activities
only on the first time a syscall is made (because the overhead
of subsequent quiescing, for the subsequent system calls, is
undesired).

> Once we have that, we can have a discussion about the desired control
> granularity and how to support the extra features in a consistent and
> well defined way.
> 
> A good and extensible UABI design comes with well defined functionality
> for the start and an obvious and maintainable extension path. The most
> important part is the well defined functionality.
> 
> There have been enough examples in the past how well received approaches
> are, which lack the well defined part. Linus really loves to get a pull
> request for something which cannot be described what it does, but could
> be used for cool things in the future.
> 
> > So for case 2, all you'd have to do is to modify the application only
> > once and allow the admin to configure the features.
> 
> That's still an orthogonal problem, which can be solved once a sensible
> mechanism to control the isolation and handle it at the transition
> points is in place. You surely want to consider it when designing the
> UABI, but it's not required to create the real isolation mechanism in
> the first place.

Ok, can drop all of that for smaller patches with the handling 
of transition points only (then later add oneshot mode, inheritance,
external control).

But might wait for discussion of requirements that you raise 
first.

> Problem decomposition is not an entirely new concept, really.

Sure, thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
  2022-05-05 16:52           ` Marcelo Tosatti
@ 2022-06-01 16:14             ` Marcelo Tosatti
  0 siblings, 0 replies; 44+ messages in thread
From: Marcelo Tosatti @ 2022-06-01 16:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Lameter, linux-kernel, Nitesh Lal,
	Nicolas Saenz Julienne, Frederic Weisbecker, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu,
	Daniel Bristot de Oliveira, Oscar Shiang, linux-rdma

On Thu, May 05, 2022 at 01:52:35PM -0300, Marcelo Tosatti wrote:
> 
> Hi Thomas,
> 
> On Wed, May 04, 2022 at 10:15:14PM +0200, Thomas Gleixner wrote:
> > On Wed, May 04 2022 at 15:56, Marcelo Tosatti wrote:
> > > On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
> > >> Can we please focus on the initial problem of
> > >> providing a sensible isolation mechanism with well defined semantics?
> > >
> > > Case 2, however, was implicitly suggested by you (or at least i
> > > understood that):
> > >
> > > "Summary: The problem to be solved cannot be restricted to
> > >
> > >     self_defined_important_task(OWN_WORLD);
> > >
> > > Policy is not a binary on/off problem. It's manifold across all levels
> > > of the stack and only a kernel problem when it comes down to the last
> > > line of defence.
> > >
> > > Up to the point where the kernel puts the line of last defence, policy
> > > is defined by the user/admin via mechanims provided by the kernel.
> > >
> > > Emphasis on "mechanims provided by the kernel", aka. user API.
> > >
> > > Just in case, I hope that I don't have to explain what level of scrunity
> > > and thought this requires."
> > 
> > Correct. This reasoning is still valid and I haven't changed my opinion
> > on that since then.
> > 
> > My main objections against the proposed solution back then were the all
> > or nothing approach and the implicit hard coded policies.
> > 
> > > The idea, as i understood was that certain task isolation features (or
> > > they parameters) might have to be changed at runtime (which depends on
> > > the task isolation features themselves, and the plan is to create
> > > an extensible interface).
> > 
> > Again. I'm not against useful controls to select the isolation an
> > application requires. I'm neither against extensible interfaces.
> > 
> > But I'm against overengineered implementations which lack any form of
> > sensible design and have ill defined semantics at the user ABI.
> > 
> > Designing user space ABI is _hard_ and needs a lot of thoughts. It's not
> > done with throwing something 'extensible' at the kernel and hope it
> > sticks. As I showed you in the review, the ABI is inconsistent in
> > itself, it has ill defined semantics and lacks any form of justification
> > of the approach taken.
> > 
> > Can we please take a step back and:
> > 
> >   1) Define what is trying to be solved
> 
> Avoid interruptions to application code execution on isolated CPUs.
> 
> Different use-cases might accept different length/frequencies
> of interruptions (including no interruptions).
> 
> >      and what are the pieces known
> >      today which need to be controlled in order to achieve the desired
> >      isolation properties.
> 
> I hope you don't mean the current CPU isolation features which have to
> be enabled, but only the ones which are not enabled today:
> 
> "Isolation of the threads was done through the following kernel parameters:
> 
> nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
> numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt
> 
> And systemd was configured with the following affinites:
> 
> system.conf:CPUAffinity=0-7,16-23
> 
> This means that the second socket will be generally free of tasks and   
> kernel threads."
> 
> So here are some features which could be written on top of the proposed
> task isolation via prctl:
> 
> 1) 
> 
> Enable or disable the following optional behaviour
> 
> A.
> if (cpu->isolated_avoid_queue_work)
> 	return -EBUSY;
> 
> queue_work_on(cpu, workfn);
> 
> (for the functions that can handle errors gracefully).
> 
> B.
> if (cpu->isolated_avoid_function_ipi)
> 	return -EBUSY;
> 
> smp_call_function_single(cpu, fn);
> (for the functions that can handle errors gracefully).
> Those that can't handle errors gracefully should be changed 
> to either handle errors or to remote work.
> 
> Not certain if this should be on per-case basis: say
> "avoid action1|avoid action2|avoid action3|..." (bit per
> action) and a "ALL" control, where actionZ is an action
> that triggers an IPI or remote work (then you would check
> for whether to fail not at smp_call_function_single 
> time but before the action starts).
> 
> Also, one might use something such as stalld (that schedules 
> tasks in/out for a short amount of time every given time window),
> which might be acceptable for his workload, so he'd disable
> cpu->isolated_avoid_queue_work (or expose this on per-case basis,
> unsure which is better).
> 
> As for IPIs, whether to block a function call to an isolated
> CPU depends on whether that function call (and its frequency) 
> will cause the latency sensitive application to violate its "latency" 
> requirements.
> 
> Perhaps "ALL - action1, action2, action3" is useful.
> 
> =======================================
> 
> 2)
> 
> In general, avoiding (or uncaching on return to userspace) 
> a CPU from caching per-CPU data (which might require an 
> IPI to invalidate later on) (see point [1] below for more thoughts
> on this issue).
> 
> 
> For example, for KVM:
> 
> /*
>  * MMU notifier 'invalidate_range_start' hook.
>  */
> void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
>                                        unsigned long end, bool may_block)
> {
>         DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS);
>         struct gfn_to_pfn_cache *gpc;
>         bool wake_vcpus = false;
> 	...
> 	called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);
> 
> 	which will
> 	smp_call_function_many(cpus, ack_flush, NULL, wait);
> ...
> 
> 
> ====================================================
> 
> 3) Enabling a kernel warning when a task switch happens on a CPU
> which runs a task isolated thread?
> 
> From Christoph:
> 
> Special handling when the scheduler
> switches a task? If tasks are being switched that requires them to be low
> latency and undisturbed then something went very very wrong with the
> system configuration and the only thing I would suggest is to issue some
> kernel warning that this is not the way one should configure the system.
> 
> ====================================================
> 
> 4) Sending a signal whenever an application is interrupted
> (hum, this could be done via BPF).
> 
> Those are the ones i can think of at the moment. 
> Not sure what other people can think of.
> 
> >   2) Describe the usage scenarios and the resulting constraints.
> 
> Well the constraints should be in the form
> 
> 	"In a given window of time, there should be no more than N
> 	 CPU interruptions of length L each."
> 
> 	(should be more complicated due to cache effects, but choosing
> 	 a lower N and L one is able to correct that)
> 
> I believe?
> 
> Also some memory bandwidth must be available to the application
> (or data/code in shared caches).
> Which depends on what other CPUs in the system are doing, the
> cache hierarchy, the application, etc.
> 
> [1]: There is also a question of whether to focus only on 
> applications that do not perform system calls on their latency 
> sensitive path, and applications that perform system calls. 
> 
> Because some CPU interruptions can't be avoided if the application 
> is in the kernel: for example instruction cache flushes due to 
> static_key rewrites or kernel TLB flushes (well they could be avoided 
> with more infrastructure, but there is no such infrastructure at
> the moment).
> 
> >   3) Describe the requirements for features on top, e.g. inheritance
> >      or external control.
> 
> 1) Be able to use unmodified applications (as long as the features
> to be enabled are compatible with such usage, for example "killing 
> / sending signal to application if task is interrupted" is obviously
> incompatible with unmodified applications).
> 
> 2) External control: be able to modify what task isolation features are
> enabled externally (not within the application itself). The latency
> sensitive application should inform the kernel the beginning of 
> the latency sensitive section (at this time, the task isolation 
> features configured externally will be activated).
> 
> 3) One-shot mode: be able to quiesce certain kernel activities
> only on the first time a syscall is made (because the overhead
> of subsequent quiescing, for the subsequent system calls, is
> undesired).
> 
> > Once we have that, we can have a discussion about the desired control
> > granularity and how to support the extra features in a consistent and
> > well defined way.
> > 
> > A good and extensible UABI design comes with well defined functionality
> > for the start and an obvious and maintainable extension path. The most
> > important part is the well defined functionality.
> > 
> > There have been enough examples in the past how well received approaches
> > are, which lack the well defined part. Linus really loves to get a pull
> > request for something which cannot be described what it does, but could
> > be used for cool things in the future.
> > 
> > > So for case 2, all you'd have to do is to modify the application only
> > > once and allow the admin to configure the features.
> > 
> > That's still an orthogonal problem, which can be solved once a sensible
> > mechanism to control the isolation and handle it at the transition
> > points is in place. You surely want to consider it when designing the
> > UABI, but it's not required to create the real isolation mechanism in
> > the first place.
> 
> Ok, can drop all of that for smaller patches with the handling 
> of transition points only (then later add oneshot mode, inheritance,
> external control).
> 
> But might wait for discussion of requirements that you raise 
> first.
> 
> > Problem decomposition is not an entirely new concept, really.
> 
> Sure, thanks.

Actually, hope that the patches from Aaron:

[RFC PATCH v3] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too

https://lore.kernel.org/all/alpine.DEB.2.22.394.2204250919400.2367@gentwo.de/T/

Can enable syncing of vmstat on return to userspace, for nohz_full CPUs.

Then the remaining items such as 

> if (cpu->isolated_avoid_queue_work)
>       return -EBUSY;

Can be enabled with a different (more flexible) interface such as writes
to filesystem (or task attribute that is transferred to per-CPU variable
on task initialization and remove from per-CPU variables when task
dies)



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-06-01 16:14 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 02/13] x86: " Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 03/13] add basic task isolation prctl interface Marcelo Tosatti
2022-04-25 22:23   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
2022-04-26  0:15   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2022-04-25 23:06   ` Thomas Gleixner
2022-04-27  6:56   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
2022-04-25 23:27   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
2022-03-17 14:51   ` Frederic Weisbecker
2022-04-27  8:03   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 08/13] task isolation: enable return to userspace processing Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
2022-03-16  2:41   ` Oscar Shiang
2022-04-27  7:11   ` Thomas Gleixner
2022-04-27 12:09     ` Thomas Gleixner
2022-05-04 16:32       ` Marcelo Tosatti
2022-05-04 17:39         ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 11/13] mm: vmstat: move need_update Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
2022-04-27  7:23   ` Thomas Gleixner
2022-05-03 19:17     ` Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
2022-04-27  7:45   ` Thomas Gleixner
2022-05-03 19:12     ` Marcelo Tosatti
2022-05-04 13:03       ` Thomas Gleixner
2022-03-17 15:08 ` [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Frederic Weisbecker
2022-04-25 16:29   ` Marcelo Tosatti
2022-04-25 21:12     ` Thomas Gleixner
2022-05-03 18:57       ` Marcelo Tosatti
2022-04-27  9:19 ` Christoph Lameter
2022-05-03 18:57   ` Marcelo Tosatti
2022-05-04 13:20     ` Thomas Gleixner
2022-05-04 18:56       ` Marcelo Tosatti
2022-05-04 20:15         ` Thomas Gleixner
2022-05-05 16:52           ` Marcelo Tosatti
2022-06-01 16:14             ` Marcelo Tosatti
2022-05-04 17:01 ` Tim Chen
2022-05-04 20:08   ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.