All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch v11 00/13] extensible prctl task isolation interface and vmstat sync
@ 2022-02-04 17:35 Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
                   ` (14 more replies)
  0 siblings, 15 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.

The patchset is based on ideas and code from the
task isolation patchset from Alex Belits:
https://lwn.net/Articles/816298/

Please refer to Documentation/userspace-api/task_isolation.rst
(patch 1) for details. Its attached at the end of this message
in .txt format as well.

Note: the prctl interface is independent of nohz_full=.

The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/

- qemu-task-isolation.patch: activate task isolation from CPU execution loop
- rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
- util-linux-chisol.patch: add chisol tool to util-linux.

---

v11
- Add TIF_TASK_ISOL bit to thread info flags and use it
  to decide whether to perform task isolation work on
  return to userspace                                   (Frederic Weisbecker)
- Fold patch to add task_isol_exit hooks into
  "sync vmstats on return to userspace" patch.          (Frederic Weisbecker)
- Fix typo on prctl_task_isol_cfg_get declaration      (Oscar Shiang)
- Add preempt notifiers

v10
- no changes (resending series without changelog corruption).

v9
- Clarify inheritance is propagated to all descendents (Frederic Weisbecker)
- Fix inheritance of one-shot mode across exec/fork    (Frederic Weisbecker)
- Unify naming on "task_isol_..."                      (Frederic Weisbecker)
- Introduce CONFIG_TASK_ISOLATION                      (Frederic Weisbecker)

v8
- Document the possibility for ISOL_F_QUIESCE_ONE,
  to configure individual features                    (Frederic Weisbecker).
- Fix PR_ISOL_CFG_GET typo in documentation           (Frederic Weisbecker).
- Rebased against linux-2.6.git.

v7
 - no changes (resending series without changelog corruption).

v6
 - Move oneshot mode enablement to configuration time (Frederic Weisbecker).
 - Allow more extensions to CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
 - Update docs and samples regarding oneshot mode     (Frederic Weisbecker).
 - Update docs and samples regarding more extensibility of
   CFG_SET of ISOL_F_QUIESCE                          (Frederic Weisbecker).
 - prctl_task_isolation_activate_get should copy active_mask
   to address in arg2.
 - modify exit_to_user_mode_loop to cover exceptions
   and interrupts.
 - split exit hooks into its own patch

v5
  - Add changelogs to individual patches                (Peter Zijlstra).
  - Add documentation to patchset intro                 (Peter Zijlstra).

v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and use drop
   "internal configuration" prctls            (Frederic Weisbecker).
 - Add summary of terms to documentation      (Frederic Weisbecker).
 - Examples for compute and one-shot modes    (Thomas G/Christoph L).

v3:

 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
 - Remove "system default" notion (chisol should
   make it obsolete).
 - Update documentation: add new section with explanation
   about configuration/activation and code example.
 - Update samples.
 - Report configuration/activation state at
   /proc/pid/task_isolation.
 - Condense dirty information of per-CPU vmstats counters
   in a bool.
 - In-kernel KVM support.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).

- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).


v10 can be found at:

https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/

---

 Documentation/userspace-api/task_isolation.rst |  379 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 arch/s390/include/asm/thread_info.h            |    2 
 arch/x86/include/asm/thread_info.h             |    2 
 fs/proc/base.c                                 |   68 ++++++++++
 include/linux/entry-common.h                   |    2 
 include/linux/sched.h                          |    5 
 include/linux/task_isolation.h                 |  136 +++++++++++++++++++++
 include/linux/vmstat.h                         |   25 +++
 include/uapi/linux/prctl.h                     |   47 +++++++
 init/Kconfig                                   |   16 ++
 init/init_task.c                               |    3 
 kernel/Makefile                                |    2 
 kernel/entry/common.c                          |    4 
 kernel/entry/kvm.c                             |    4 
 kernel/exit.c                                  |    2 
 kernel/fork.c                                  |   23 +++
 kernel/sys.c                                   |   16 ++
 kernel/task_isolation.c                        |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmstat.c                                    |  173 +++++++++++++++++++++-----
 samples/Kconfig                                |    7 +
 samples/Makefile                               |    1 
 samples/task_isolation/Makefile                |   11 +
 samples/task_isolation/task_isol.c             |   92 ++++++++++++++
 samples/task_isolation/task_isol.h             |    9 +
 samples/task_isolation/task_isol_computation.c |   89 +++++++++++++
 samples/task_isolation/task_isol_oneshot.c     |  104 ++++++++++++++++
 samples/task_isolation/task_isol_userloop.c    |   54 ++++++++
 27 files changed, 1658 insertions(+), 37 deletions(-)

---

Task isolation prctl interface
******************************

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth
networking applications with user-space drivers can fall into the
category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and end of
the latency sensitive application section (with configurable system
behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:

   * PR_ISOL_FEAT_GET: Retrieve supported features.

   * PR_ISOL_CFG_GET: Retrieve task isolation configuration.

   * PR_ISOL_CFG_SET: Set task isolation configuration.

   * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.

   * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.

Summary of terms:

* feature:

     A distinct attribute or aspect of task isolation. Examples of
     features could be logging, new operating modes (eg: syscalls
     disallowed), userspace notifications, etc. The only feature
     currently available is quiescing.

* configuration:

     A specific choice from a given set of possible choices that
     dictate how the particular feature in question should behave.

* activation state:

     The activation state (whether active/inactive) of the task
     isolation features (features must be configured before being
     activated).

Inheritance of the isolation parameters and state, across fork(2) and
clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.

At a high-level, task isolation is divided in two steps:

1. Configuration.

2. Activation.

Section "Userspace support" describes how to use task isolation.

In terms of the interface, the sequence of steps to activate task
isolation are:

1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).

2. Configure task isolation features
   (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).

3. Activate or deactivate task isolation features
   (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).

This interface is based on ideas and code from the task isolation
patchset from Alex Belits: https://lwn.net/Articles/816298/

Note: if the need arises to configure an individual quiesce feature
with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
features per syscall currently).


Feature description
===================

   * "ISOL_F_QUIESCE"

   This feature allows quiescing selected kernel activities on return
   from system calls.


Interface description
=====================

**PR_ISOL_FEAT**:

   Returns the supported features and feature capabilities, as a
   bitmask:

      prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);

   The 'feat' argument specifies whether to return supported features
   (if zero), or feature capabilities (if not zero). Possible values
   for 'feat' are:

   * "0":

        Return the bitmask of supported features, in the location
        pointed  to  by  "(int *)arg3". The buffer should allow space
        for 8 bytes.

   * "ISOL_F_QUIESCE":

        Return a structure containing which kernel activities are
        supported for quiescing, in the location pointed to by "(int
        *)arg3":

           struct task_isol_quiesce_extensions {
                   __u64 flags;
                   __u64 supported_quiesce_bits;
                   __u64 pad[6];
           };

        Where:

        *flags*: Additional flags (should be zero).

        *supported_quiesce_bits*: Bitmask indicating
           which features are supported for quiescing.

        *pad*: Additional space for future enhancements.

   Features and its capabilities are defined at
   include/uapi/linux/task_isolation.h.

**PR_ISOL_CFG_GET**:

   Retrieve task isolation configuration. The general format is:

      prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);

   The 'what' argument specifies what to configure. Possible values
   are:

   * "I_CFG_FEAT":

        Return configuration of task isolation features. The 'arg3'
        argument specifies whether to return configured features (if
        zero), or individual feature configuration (if not zero), as
        follows.

        * "0":

             Return the bitmask of configured features, in the
             location pointed  to  by  "(int *)arg4". The buffer
             should allow space for 8 bytes.

        * "ISOL_F_QUIESCE":

             If arg4 is QUIESCE_CONTROL, return the control structure
             for quiescing of background kernel activities, in the
             location pointed to by "(int *)arg5":

                struct task_isol_quiesce_control {
                       __u64 flags;
                       __u64 quiesce_mask;
                       __u64 quiesce_oneshot_mask;
                       __u64 pad[5];
                };

             See PR_ISOL_CFG_SET description for meaning of fields.

   * "I_CFG_INHERIT":

        Retrieve inheritance configuration across fork/clone.

        Return the structure which configures inheritance across
        fork/clone, in the location pointed to by "(int *)arg4":

           struct task_isol_inherit_control {
                   __u8    inherit_mask;
                   __u8    pad[7];
           };

        See PR_ISOL_CFG_SET description for meaning of fields.

**PR_ISOL_CFG_SET**:

   Set task isolation configuration. The general format is:

      prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);

   The 'what' argument specifies what to configure. Possible values
   are:

   * "I_CFG_FEAT":

        Set configuration of task isolation features. 'arg3' specifies
        the feature. Possible values are:

        * "ISOL_F_QUIESCE":

             If arg4 is QUIESCE_CONTROL, set the control structure for
             quiescing of background kernel activities, from the
             location pointed to by "(int *)arg5":

                struct task_isol_quiesce_control {
                       __u64 flags;
                       __u64 quiesce_mask;
                       __u64 quiesce_oneshot_mask;
                       __u64 pad[5];
                };

             Where:

             *flags*: Additional flags (should be zero).

             *quiesce_mask*: A bitmask containing which kernel
             activities to quiesce.

             *quiesce_oneshot_mask*: A bitmask indicating which kernel
             activities should behave in oneshot mode, that is,
             quiescing will happen on return from
             prctl(PR_ISOL_ACTIVATE_SET), but not on return of
             subsequent system calls. The corresponding bit(s) must
             also be set at quiesce_mask.

             *pad*: Additional space for future enhancements.

             For quiesce_mask (and quiesce_oneshot_mask), possible bit
             sets are:

             * "ISOL_F_QUIESCE_VMSTATS"

             VM statistics are maintained in per-CPU counters to
             improve performance. When a CPU modifies a VM statistic,
             this modification is kept in the per-CPU counter. Certain
             activities require a global count, which involves
             requesting each CPU to flush its local counters to the
             global VM counters.

             This flush is implemented via a workqueue item, which
             might schedule a workqueue on isolated CPUs.

             To avoid this interruption, task isolation can be
             configured to, upon return from system calls, synchronize
             the per-CPU counters to global counters, thus avoiding
             the interruption.

   * "I_CFG_INHERIT":

        Set inheritance configuration when a new task is created via
        fork and clone.

        The "(int *)arg4" argument is a pointer to:

           struct task_isol_inherit_control {
                   __u8    inherit_mask;
                   __u8    pad[7];
           };

        inherit_mask is a bitmask that specifies which part of task
        isolation should be inherited:

        * Bit ISOL_INHERIT_CONF: Inherit task isolation
          configuration. This is the state written via
          prctl(PR_ISOL_CFG_SET, ...).

        * Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
          (requires ISOL_INHERIT_CONF to be set). The new task should
          behave, after fork/clone, in the same manner as the parent
          task after it executed:

             prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);

        Note: the inheritance propagates to all the descendants and
        not just the immediate children, unless the inheritance is
        explicitly reconfigured by some children.

**PR_ISOL_ACTIVATE_GET**:

   Retrieve task isolation activation state.

   The general format is:

      prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);

   'pmask' specifies the location of a feature mask, where the current
   active mask will be copied. See PR_ISOL_ACTIVATE_SET for
   description of individual bits.

**PR_ISOL_ACTIVATE_SET**:

   Set task isolation activation state (activates/deactivates task
   isolation).

   The general format is:

      prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);

   The 'pmask' argument specifies the location of an 8 byte mask
   containing which features should be activated. Features whose bits
   are cleared will be deactivated. The possible bits for this mask
   are:

      * "ISOL_F_QUIESCE":

      Activate quiescing of background kernel activities. Quiescing
      happens on return to userspace from this system call, and on
      return from subsequent system calls (unless quiesce_oneshot_mask
      has been set at PR_ISOL_CFG_SET time).

   Quiescing can be adjusted (while active) by
   prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).


Userspace support
*****************

Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose.

This results in three combinations:

1. Both configuration and activation performed by the latency
sensitive application. Allows fine grained control of what task
isolation features are enabled and when (see samples section below).

2. Only activation can be performed by the latency sensitive app (and
configuration performed by chisol). This allows the admin/user to
control task isolation parameters, and applications have to be
modified only once.

3. Configuration and activation performed by an external tool. This
allows unmodified applications to take advantage of task isolation.
Activation is performed by the "-a" option of chisol.


Examples
********

The "samples/task_isolation/" directory contains 3 examples:

* task_isol_userloop.c:

     Example of program with a loop on userspace scenario.

* task_isol_computation.c:

     Example of program that enters task isolated mode, performs an
     amount of computation, exits task isolated mode, and writes the
     computation to disk.

* task_isol_oneshot.c:

     Example of program that enables one-shot mode for quiescing,
     enters a processing loop, then upon an external event performs a
     number of syscalls to handle that event.

This is a snippet of code to activate task isolation if it has been
previously configured (by chisol for example):

   #include <sys/prctl.h>
   #include <linux/types.h>

   #ifdef PR_ISOL_CFG_GET
   unsigned long long fmask;

   ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
   if (ret != -1 && fmask != 0) {
           ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
           if (ret == -1) {
                   perror("prctl PR_ISOL_ACTIVATE_SET");
                   return ret;
           }
   }
   #endif





^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 01/13] s390: add support for TIF_TASK_ISOL
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-07 11:40   ` Sven Schnelle
  2022-02-04 17:35 ` [patch v11 02/13] x86: " Marcelo Tosatti
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, linux-s390, Marcelo Tosatti

Add TIF_TASK_ISOL handling for s390.

Cc: linux-s390@vger.kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/arch/s390/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/thread_info.h
+++ linux-2.6/arch/s390/include/asm/thread_info.h
@@ -73,6 +73,7 @@ void arch_setup_new_exec(void);
 #define TIF_ISOLATE_BP		8	/* Run process with isolated BP */
 #define TIF_ISOLATE_BP_GUEST	9	/* Run KVM guests with isolated BP */
 #define TIF_PER_TRAP		10	/* Need to handle PER trap on exit to usermode */
+#define TIF_TASK_ISOL		11	/* task isolation work pending */
 
 #define TIF_31BIT		16	/* 32bit process */
 #define TIF_MEMDIE		17	/* is terminating due to OOM killer */
@@ -97,6 +98,7 @@ void arch_setup_new_exec(void);
 #define _TIF_ISOLATE_BP		BIT(TIF_ISOLATE_BP)
 #define _TIF_ISOLATE_BP_GUEST	BIT(TIF_ISOLATE_BP_GUEST)
 #define _TIF_PER_TRAP		BIT(TIF_PER_TRAP)
+#define _TIF_TASK_ISOL		BIT(TIF_TASK_ISOL)
 
 #define _TIF_31BIT		BIT(TIF_31BIT)
 #define _TIF_SINGLE_STEP	BIT(TIF_SINGLE_STEP)



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 02/13] x86: add support for TIF_TASK_ISOL
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 03/13] add basic task isolation prctl interface Marcelo Tosatti
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, x86, Marcelo Tosatti

Add TIF_TASK_ISOL handling for x86.

Cc: x86@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/arch/x86/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/thread_info.h
+++ linux-2.6/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
 #define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_TASK_ISOL		6	/* task isolation work pending */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
+#define _TIF_TASK_ISOL		(1 << TIF_TASK_ISOL)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
 #define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 03/13] add basic task isolation prctl interface
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 02/13] x86: " Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Add basic prctl task isolation interface, which allows
informing the kernel that application is executing 
latency sensitive code (where interruptions are undesired).

Interface is described by task_isolation.rst (added by
next patch).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v11:
- Add TIF_TASK_ISOL bit to thread info flags and use it
  to decide whether to perform task isolation work on
  return to userspace                                   (Frederic W. Weisbecker)
- Fix typo on prctl_task_isoln_cfg_get declaration      (Oscar Shiang)

v6:
 - Move oneshot mode enablement to configuration time (Frederic Weisbecker).
 - Allow more extensions to CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
 - prctl_task_isolation_activate_get should copy active_mask 
   to address in arg2.
v5:
 - No changes
v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and drop
   "internal configuration" prctls            (Frederic Weisbecker).

v3:
 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
-  Remove "system default" notion (chisol should
   make it obsolete).

v2:
- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).
- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

 include/linux/sched.h          |    5 
 include/linux/task_isolation.h |   91 ++++++++++
 include/uapi/linux/prctl.h     |   43 +++++
 init/init_task.c               |    3 
 kernel/Makefile                |    2 
 kernel/fork.c                  |   22 ++
 kernel/sys.c                   |   16 +
 kernel/task_isolation.c        |  350 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 530 insertions(+), 2 deletions(-)

Index: linux-2.6/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/prctl.h
+++ linux-2.6/include/uapi/linux/prctl.h
@@ -275,4 +275,51 @@ struct prctl_mm_map {
 #define PR_SET_VMA		0x53564d41
 # define PR_SET_VMA_ANON_NAME		0
 
+#define PR_ISOL_FEAT_GET		63
+#define PR_ISOL_CFG_GET			64
+#define PR_ISOL_CFG_SET			65
+
+/* arg2 to CFG_GET/CFG_SET */
+# define I_CFG_FEAT			1
+# define I_CFG_INHERIT			2
+
+#define PR_ISOL_ACTIVATE_GET		66
+#define PR_ISOL_ACTIVATE_SET		67
+
+
+/* Supported features */
+# define ISOL_F_QUIESCE			(1UL << 0)
+
+# define ISOL_F_QUIESCE_MULTIPLE	(1UL << 0)
+#  define ISOL_F_QUIESCE_VMSTATS	(1UL << 0)
+
+struct task_isol_quiesce_extensions {
+	__u64 flags;
+	__u64 supported_quiesce_bits;
+	__u64 pad[6];
+};
+
+/*
+ * This structure provides control over
+ * inheritance of task isolation across
+ * clone and fork.
+ */
+struct task_isol_inherit_control {
+	__u8	inherit_mask;
+	__u8	flags;
+	__u8	pad[6];
+};
+
+# define ISOL_INHERIT_CONF		(1UL << 0)
+# define ISOL_INHERIT_ACTIVE		(1UL << 1)
+
+struct task_isol_quiesce_control {
+	__u64 flags;
+	__u64 quiesce_mask;
+	__u64 quiesce_oneshot_mask;
+	__u64 pad[5];
+};
+
+# define QUIESCE_CONTROL		(1UL << 0)
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile
+++ linux-2.6/kernel/Makefile
@@ -132,6 +132,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_TASK_ISOLATION) += task_isolation.o
+
 CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
 obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -58,6 +58,7 @@
 #include <linux/sched/coredump.h>
 #include <linux/sched/task.h>
 #include <linux/sched/cputime.h>
+#include <linux/task_isolation.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
@@ -2593,6 +2594,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 	case PR_SET_VMA:
 		error = prctl_set_vma(arg2, arg3, arg4, arg5);
 		break;
+	case PR_ISOL_FEAT_GET:
+		error = prctl_task_isol_feat_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CFG_GET:
+		error = prctl_task_isol_cfg_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CFG_SET:
+		error = prctl_task_isol_cfg_set(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_ACTIVATE_GET:
+		error = prctl_task_isol_activate_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_ACTIVATE_SET:
+		error = prctl_task_isol_activate_set(arg2, arg3, arg4, arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct task_isol_info;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1492,6 +1493,10 @@ struct task_struct {
 	struct callback_head		l1d_flush_kill;
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+	struct task_isol_info		*task_isol_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -209,6 +209,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	.task_isol_info	= NULL,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/task_isolation.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -748,6 +749,7 @@ void __put_task_struct(struct task_struc
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	task_isol_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk, true);
@@ -1557,6 +1559,15 @@ out:
 	return error;
 }
 
+static int copy_task_isol(struct task_struct *tsk)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (current->task_isol_info)
+		return __copy_task_isol(tsk);
+#endif
+	return 0;
+}
+
 static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 {
 	struct sighand_struct *sig;
@@ -2129,7 +2140,9 @@ static __latent_entropy struct task_stru
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
-
+#ifdef CONFIG_TASK_ISOLATION
+	p->task_isol_info = NULL;
+#endif
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
@@ -2173,6 +2186,9 @@ static __latent_entropy struct task_stru
 	retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
+	retval = copy_task_isol(p);
+	if (retval)
+		goto bad_fork_cleanup_thread;
 
 	stackleak_task_init(p);
 
@@ -2181,7 +2197,7 @@ static __latent_entropy struct task_stru
 				args->set_tid_size);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
-			goto bad_fork_cleanup_thread;
+			goto bad_fork_cleanup_task_isol;
 		}
 	}
 
@@ -2400,6 +2416,8 @@ bad_fork_put_pidfd:
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
+bad_fork_cleanup_task_isol:
+	task_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);
 bad_fork_cleanup_io:
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/task_isolation.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_TASK_ISOL_H
+#define __LINUX_TASK_ISOL_H
+
+#ifdef CONFIG_TASK_ISOLATION
+
+struct task_isol_info {
+	/* Which features have been configured */
+	u64 conf_mask;
+	/* Which features are active */
+	u64 active_mask;
+	/* Quiesce mask */
+	u64 quiesce_mask;
+
+	/* Oneshot mask */
+	u64 oneshot_mask;
+
+	u8 inherit_mask;
+};
+
+extern void __task_isol_free(struct task_struct *tsk);
+
+static inline void task_isol_free(struct task_struct *tsk)
+{
+	if (tsk->task_isol_info)
+		__task_isol_free(tsk);
+}
+
+int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5);
+int prctl_task_isol_cfg_get(unsigned long arg2, unsigned long arg3,
+			    unsigned long arg4, unsigned long arg5);
+int prctl_task_isol_cfg_set(unsigned long arg2, unsigned long arg3,
+			    unsigned long arg4, unsigned long arg5);
+int prctl_task_isol_activate_get(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5);
+int prctl_task_isol_activate_set(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5);
+
+int __copy_task_isol(struct task_struct *tsk);
+
+#else
+
+static inline void task_isol_free(struct task_struct *tsk)
+{
+}
+
+static inline int prctl_task_isol_feat_get(unsigned long arg2,
+					   unsigned long arg3,
+					   unsigned long arg4,
+					   unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isol_cfg_get(unsigned long arg2,
+					   unsigned long arg3,
+					   unsigned long arg4,
+					   unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isol_cfg_set(unsigned long arg2,
+					  unsigned long arg3,
+					  unsigned long arg4,
+					  unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isol_activate_get(unsigned long arg2,
+					       unsigned long arg3,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isol_activate_set(unsigned long arg2,
+					       unsigned long arg3,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_TASK_ISOLATION */
+
+#endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/task_isolation.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@belits.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ *   Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/task_isolation.h>
+#include <linux/prctl.h>
+#include <linux/slab.h>
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <linux/sched/task.h>
+
+void __task_isol_free(struct task_struct *tsk)
+{
+	if (!tsk->task_isol_info)
+		return;
+	kfree(tsk->task_isol_info);
+	tsk->task_isol_info = NULL;
+}
+
+static struct task_isol_info *task_isol_alloc_context(void)
+{
+	struct task_isol_info *info;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (unlikely(!info))
+		return ERR_PTR(-ENOMEM);
+
+	return info;
+}
+
+int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	void __user *addr = (void __user *) arg3;
+
+	switch (arg2) {
+	case 0: {
+		u64 supported_fmask = ISOL_F_QUIESCE;
+
+		ret = 0;
+		if (copy_to_user(addr, &supported_fmask, sizeof(u64)))
+			ret = -EFAULT;
+
+		return ret;
+	}
+	case ISOL_F_QUIESCE: {
+		struct task_isol_quiesce_extensions *q_ext;
+
+		q_ext = kzalloc(sizeof(struct task_isol_quiesce_extensions),
+			 GFP_KERNEL);
+		if (!q_ext)
+			return -ENOMEM;
+
+		q_ext->supported_quiesce_bits = ISOL_F_QUIESCE_VMSTATS;
+
+		ret = 0;
+		if (copy_to_user(addr, q_ext, sizeof(*q_ext)))
+			ret = -EFAULT;
+		kfree(q_ext);
+		return ret;
+	}
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int cfg_inherit_get(unsigned long arg3, unsigned long arg4,
+			   unsigned long arg5)
+{
+	struct task_isol_inherit_control *i_ctrl;
+	int ret;
+	void __user *addr = (void __user *) arg3;
+
+	if (!current->task_isol_info)
+		return -EINVAL;
+
+	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+			 GFP_KERNEL);
+	if (!i_ctrl)
+		return -ENOMEM;
+
+	i_ctrl->inherit_mask = current->task_isol_info->inherit_mask;
+
+	ret = 0;
+	if (copy_to_user(addr, i_ctrl, sizeof(*i_ctrl)))
+		ret = -EFAULT;
+	kfree(i_ctrl);
+
+	return ret;
+}
+
+static int cfg_feat_get(unsigned long arg3, unsigned long arg4,
+			unsigned long arg5)
+{
+	int ret = 0;
+
+	switch (arg3) {
+	case 0: {
+		void __user *addr = (void __user *)arg4;
+		u64 cfg_mask = 0;
+
+		if (current->task_isol_info)
+			cfg_mask = current->task_isol_info->conf_mask;
+
+		if (copy_to_user(addr, &cfg_mask, sizeof(u64)))
+			ret = -EFAULT;
+
+		return ret;
+	}
+	case ISOL_F_QUIESCE: {
+		struct task_isol_quiesce_control *i_qctrl;
+		void __user *addr = (void __user *)arg5;
+
+		if (arg4 != QUIESCE_CONTROL)
+			return -EINVAL;
+
+		i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
+				  GFP_KERNEL);
+		if (!i_qctrl)
+			return -ENOMEM;
+
+		if (current->task_isol_info)
+			i_qctrl->quiesce_mask = current->task_isol_info->quiesce_mask;
+
+		if (copy_to_user(addr, i_qctrl, sizeof(*i_qctrl)))
+			ret = -EFAULT;
+
+		kfree(i_qctrl);
+		return ret;
+	}
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int prctl_task_isol_cfg_get(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	switch (arg2) {
+	case I_CFG_FEAT:
+		return cfg_feat_get(arg3, arg4, arg5);
+	case I_CFG_INHERIT:
+		return cfg_inherit_get(arg3, arg4, arg5);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int cfg_inherit_set(unsigned long arg3, unsigned long arg4,
+			   unsigned long arg5)
+{
+	int ret = 0;
+	struct task_isol_inherit_control *i_ctrl;
+	const void __user *addr = (const void __user *)arg3;
+
+	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+			 GFP_KERNEL);
+	if (!i_ctrl)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(i_ctrl, addr, sizeof(*i_ctrl)))
+		goto out_free;
+
+	ret = -EINVAL;
+	if (i_ctrl->inherit_mask & ~(ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE))
+		goto out_free;
+
+	if (i_ctrl->inherit_mask & ISOL_INHERIT_ACTIVE)
+		if (!(i_ctrl->inherit_mask & ISOL_INHERIT_CONF))
+			goto out_free;
+
+	if (!current->task_isol_info) {
+		struct task_isol_info *task_isol_info;
+
+		task_isol_info = task_isol_alloc_context();
+		if (IS_ERR(task_isol_info)) {
+			ret = PTR_ERR(task_isol_info);
+			goto out_free;
+		}
+		current->task_isol_info = task_isol_info;
+	}
+
+	ret = 0;
+	current->task_isol_info->inherit_mask = i_ctrl->inherit_mask;
+
+out_free:
+	kfree(i_ctrl);
+
+	return ret;
+}
+
+static int cfg_feat_quiesce_set(unsigned long arg4, unsigned long arg5)
+{
+	struct task_isol_info *task_isol_info;
+	struct task_isol_quiesce_control *i_qctrl;
+	int ret = 0;
+	const void __user *addr = (const void __user *)arg5;
+
+	if (arg4 != QUIESCE_CONTROL)
+		return -EINVAL;
+
+	i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
+			 GFP_KERNEL);
+	if (!i_qctrl)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(i_qctrl, addr, sizeof(*i_qctrl)))
+		goto out_free;
+
+	ret = -EINVAL;
+	if (i_qctrl->flags != 0)
+		goto out_free;
+
+	if (i_qctrl->quiesce_mask != ISOL_F_QUIESCE_VMSTATS &&
+	    i_qctrl->quiesce_mask != 0)
+		goto out_free;
+
+	if ((~i_qctrl->quiesce_mask & i_qctrl->quiesce_oneshot_mask) != 0)
+		goto out_free;
+
+	/* current->task_isol_info is only allocated/freed from task
+	 * context.
+	 */
+	if (!current->task_isol_info) {
+		task_isol_info = task_isol_alloc_context();
+		if (IS_ERR(task_isol_info)) {
+			ret = PTR_ERR(task_isol_info);
+			goto out_free;
+		}
+		current->task_isol_info = task_isol_info;
+	}
+
+	task_isol_info = current->task_isol_info;
+
+	task_isol_info->quiesce_mask = i_qctrl->quiesce_mask;
+	task_isol_info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
+	task_isol_info->conf_mask |= ISOL_F_QUIESCE;
+	ret = 0;
+
+out_free:
+	kfree(i_qctrl);
+
+	return ret;
+}
+
+int prctl_task_isol_cfg_set(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	switch (arg2) {
+	case I_CFG_FEAT:
+		switch (arg3) {
+		case ISOL_F_QUIESCE:
+			return cfg_feat_quiesce_set(arg4, arg5);
+		default:
+			break;
+		}
+		break;
+	case I_CFG_INHERIT:
+		return cfg_inherit_set(arg3, arg4, arg5);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int __copy_task_isol(struct task_struct *tsk)
+{
+	struct task_isol_info *info, *new_info;
+
+	info = current->task_isol_info;
+	if (!(info->inherit_mask & (ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE)))
+		return 0;
+
+	new_info = task_isol_alloc_context();
+	if (IS_ERR(new_info))
+		return PTR_ERR(new_info);
+
+	new_info->inherit_mask = info->inherit_mask;
+
+	if (info->inherit_mask & ISOL_INHERIT_CONF) {
+		new_info->quiesce_mask = info->quiesce_mask;
+		new_info->conf_mask = info->conf_mask;
+		new_info->oneshot_mask = info->oneshot_mask;
+	}
+
+	if (info->inherit_mask & ISOL_INHERIT_ACTIVE)
+		new_info->active_mask = info->active_mask;
+
+	tsk->task_isol_info = new_info;
+
+	return 0;
+}
+
+int prctl_task_isol_activate_set(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	struct task_isol_info *task_isol_info;
+	u64 active_mask;
+	const void __user *addr_mask = (const void __user *)arg2;
+
+	ret = -EFAULT;
+	if (copy_from_user(&active_mask, addr_mask, sizeof(u64)))
+		goto out;
+
+	ret = -EINVAL;
+	if (active_mask != ISOL_F_QUIESCE && active_mask != 0)
+		return ret;
+
+	task_isol_info = current->task_isol_info;
+	if (!task_isol_info)
+		return ret;
+
+	task_isol_info->active_mask = active_mask;
+	ret = 0;
+
+out:
+	return ret;
+}
+
+int prctl_task_isol_activate_get(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5)
+{
+	struct task_isol_info *task_isol_info;
+	void __user *addr_mask = (void __user *)arg2;
+
+	task_isol_info = current->task_isol_info;
+	if (!task_isol_info)
+		return -EINVAL;
+
+	if (copy_to_user(addr_mask, &task_isol_info->active_mask, sizeof(u64)))
+		return -EFAULT;
+
+	return 0;
+}
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -675,6 +675,22 @@ config CPU_ISOLATION
 
 	  Say Y if unsure.
 
+config TASK_ISOLATION
+	bool "Task isolation prctl()"
+	depends on GENERIC_ENTRY
+	default n
+	help
+	  Certain types of applications benefit from running uninterrupted by
+	  background OS activities. Realtime systems and high-bandwidth networking
+	  applications with user-space drivers can fall into the category.
+
+	  To create an OS noise free environment for the application, this option
+	  enables the task isolation prctl interface, which allows userspace to
+	  inform the kernel the start and end of the latency sensitive application
+	  section (with configurable system behaviour for that section).
+
+	  Say N if unsure.
+
 source "kernel/rcu/Kconfig"
 
 config BUILD_BIN2C



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 04/13] add prctl task isolation prctl docs and samples
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 03/13] add basic task isolation prctl interface Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Add documentation and userspace sample code for prctl
task isolation interface.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v8:
- Document the possibility for ISOL_F_QUIESCE_ONE, to configure
individual features                                              (Frederic Weisbecker).
- Fix PR_ISOL_CFG_GET typo in documentation                      (Frederic Weisbecker).
- Rebased against linux-2.6.git.

v7:
-  No changes
v6:
 - Update docs and samples regarding oneshot mode (Frederic Weisbecker).
 - Update docs and samples regarding more extensibility of
   CFG_SET of ISOL_F_QUIESCE                   (Frederic Weisbecker).

v5:
 - Fix documentation typos		      (Frederic Weisbecker).
 - Fix oneshot example comment

v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and use drop
   "internal configuration" prctls            (Frederic Weisbecker).
 - Add summary of terms to documentation      (Frederic Weisbecker).
 - Examples for compute and one-shot modes    (Thomas G/Christoph L).

v3:
 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
-  Remove "system default" notion (chisol should
   make it obsolete).
 - Update documentation: add new section with explanation
   about configuration/activation and code example.
 - Update samples.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).
- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

 Documentation/userspace-api/task_isolation.rst |  370 +++++++++++++++++++++++++
 samples/Kconfig                                |    7 
 samples/Makefile                               |    1 
 samples/task_isolation/Makefile                |   11 
 samples/task_isolation/task_isol.c             |   92 ++++++
 samples/task_isolation/task_isol.h             |    9 
 samples/task_isolation/task_isol_computation.c |   89 ++++++
 samples/task_isolation/task_isol_oneshot.c     |  104 +++++++
 samples/task_isolation/task_isol_userloop.c    |   54 +++
 9 files changed, 737 insertions(+)

Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,379 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Task isolation prctl interface
+===============================
+
+Certain types of applications benefit from running uninterrupted by
+background OS activities. Realtime systems and high-bandwidth networking
+applications with user-space drivers can fall into the category.
+
+To create an OS noise free environment for the application, this
+interface allows userspace to inform the kernel the start and
+end of the latency sensitive application section (with configurable
+system behaviour for that section).
+
+Note: the prctl interface is independent of nohz_full=.
+
+The prctl options are:
+
+
+        - PR_ISOL_FEAT_GET: Retrieve supported features.
+        - PR_ISOL_CFG_GET: Retrieve task isolation configuration.
+        - PR_ISOL_CFG_SET: Set task isolation configuration.
+        - PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
+        - PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
+
+Summary of terms:
+
+
+- feature:
+
+        A distinct attribute or aspect of task isolation. Examples of
+        features could be logging, new operating modes (eg: syscalls disallowed),
+        userspace notifications, etc. The only feature currently available is quiescing.
+
+- configuration:
+
+        A specific choice from a given set
+        of possible choices that dictate how the particular feature
+        in question should behave.
+
+- activation state:
+
+        The activation state (whether active/inactive) of the task
+        isolation features (features must be configured before
+        being activated).
+
+Inheritance of the isolation parameters and state, across
+fork(2) and clone(2), can be changed via
+PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
+
+
+At a high-level, task isolation is divided in two steps:
+
+1. Configuration.
+2. Activation.
+
+Section "Userspace support" describes how to use
+task isolation.
+
+In terms of the interface, the sequence of steps to activate
+task isolation are:
+
+1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
+2. Configure task isolation features (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
+3. Activate or deactivate task isolation features (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
+
+This interface is based on ideas and code from the
+task isolation patchset from Alex Belits:
+https://lwn.net/Articles/816298/
+
+Note: if the need arises to configure an individual quiesce feature
+with its own extensible structure, please add ISOL_F_QUIESCE_ONE
+to PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on
+multiple features per syscall currently).
+
+--------------------
+Feature description
+--------------------
+
+        - ``ISOL_F_QUIESCE``
+
+        This feature allows quiescing selected kernel activities on
+        return from system calls.
+
+---------------------
+Interface description
+---------------------
+
+**PR_ISOL_FEAT**:
+
+        Returns the supported features and feature
+        capabilities, as a bitmask::
+
+                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies whether to return
+        supported features (if zero), or feature capabilities
+        (if not zero). Possible values for 'feat' are:
+
+
+        - ``0``:
+               Return the bitmask of supported features, in the location
+               pointed  to  by  ``(int *)arg3``. The buffer should allow space
+               for 8 bytes.
+
+        - ``ISOL_F_QUIESCE``:
+
+               Return a structure containing which kernel
+               activities are supported for quiescing, in the location
+               pointed to by ``(int *)arg3``::
+
+                        struct task_isol_quiesce_extensions {
+                                __u64 flags;
+                                __u64 supported_quiesce_bits;
+                                __u64 pad[6];
+                        };
+
+               Where:
+
+               *flags*: Additional flags (should be zero).
+
+               *supported_quiesce_bits*: Bitmask indicating
+                which features are supported for quiescing.
+
+               *pad*: Additional space for future enhancements.
+
+
+        Features and its capabilities are defined at
+        include/uapi/linux/task_isolation.h.
+
+**PR_ISOL_CFG_GET**:
+
+        Retrieve task isolation configuration.
+        The general format is::
+
+                prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
+
+        The 'what' argument specifies what to configure. Possible values are:
+
+        - ``I_CFG_FEAT``:
+
+                Return configuration of task isolation features. The 'arg3' argument specifies
+                whether to return configured features (if zero), or individual
+                feature configuration (if not zero), as follows.
+
+                - ``0``:
+
+                        Return the bitmask of configured features, in the location
+                        pointed  to  by  ``(int *)arg4``. The buffer should allow space
+                        for 8 bytes.
+
+                - ``ISOL_F_QUIESCE``:
+
+                        If arg4 is QUIESCE_CONTROL, return the control structure for
+                        quiescing of background kernel activities, in the location
+                        pointed to by ``(int *)arg5``::
+
+                         struct task_isol_quiesce_control {
+                                __u64 flags;
+                                __u64 quiesce_mask;
+                                __u64 quiesce_oneshot_mask;
+                                __u64 pad[5];
+                         };
+
+                        See PR_ISOL_CFG_SET description for meaning of fields.
+
+        - ``I_CFG_INHERIT``:
+
+                Retrieve inheritance configuration across fork/clone.
+
+                Return the structure which configures inheritance
+                across fork/clone, in the location pointed to
+                by ``(int *)arg4``::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                See PR_ISOL_CFG_SET description for meaning of fields.
+
+**PR_ISOL_CFG_SET**:
+
+        Set task isolation configuration.
+        The general format is::
+
+                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
+
+        The 'what' argument specifies what to configure. Possible values are:
+
+        - ``I_CFG_FEAT``:
+
+                Set configuration of task isolation features. 'arg3' specifies
+                the feature. Possible values are:
+
+                - ``ISOL_F_QUIESCE``:
+
+                        If arg4 is QUIESCE_CONTROL, set the control structure
+                        for quiescing of background kernel activities, from
+                        the location pointed to by ``(int *)arg5``::
+
+                         struct task_isol_quiesce_control {
+                                __u64 flags;
+                                __u64 quiesce_mask;
+                                __u64 quiesce_oneshot_mask;
+                                __u64 pad[5];
+                         };
+
+                        Where:
+
+                        *flags*: Additional flags (should be zero).
+
+                        *quiesce_mask*: A bitmask containing which kernel
+                        activities to quiesce.
+
+                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
+                        activities should behave in oneshot mode, that is, quiescing
+                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
+                        on return of subsequent system calls. The corresponding bit(s)
+                        must also be set at quiesce_mask.
+
+                        *pad*: Additional space for future enhancements.
+
+                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
+
+                        - ``ISOL_F_QUIESCE_VMSTATS``
+
+                        VM statistics are maintained in per-CPU counters to
+                        improve performance. When a CPU modifies a VM statistic,
+                        this modification is kept in the per-CPU counter.
+                        Certain activities require a global count, which
+                        involves requesting each CPU to flush its local counters
+                        to the global VM counters.
+
+                        This flush is implemented via a workqueue item, which
+                        might schedule a workqueue on isolated CPUs.
+
+                        To avoid this interruption, task isolation can be
+                        configured to, upon return from system calls, synchronize
+                        the per-CPU counters to global counters, thus avoiding
+                        the interruption.
+
+        - ``I_CFG_INHERIT``:
+                Set inheritance configuration when a new task
+                is created via fork and clone.
+
+                The ``(int *)arg4`` argument is a pointer to::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                inherit_mask is a bitmask that specifies which part
+                of task isolation should be inherited:
+
+                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
+                  This is the state written via prctl(PR_ISOL_CFG_SET, ...).
+
+                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
+                  (requires ISOL_INHERIT_CONF to be set). The new task
+                  should behave, after fork/clone, in the same manner
+                  as the parent task after it executed:
+
+                        prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);
+
+                Note: the inheritance propagates to all the descendants and not
+                just the immediate children, unless the inheritance is explicitly
+                reconfigured by some children.
+
+**PR_ISOL_ACTIVATE_GET**:
+
+        Retrieve task isolation activation state.
+
+        The general format is::
+
+                prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);
+
+        'pmask' specifies the location of a feature mask, where
+        the current active mask will be copied. See PR_ISOL_ACTIVATE_SET
+        for description of individual bits.
+
+
+**PR_ISOL_ACTIVATE_SET**:
+
+        Set task isolation activation state (activates/deactivates
+        task isolation).
+
+        The general format is::
+
+                prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);
+
+
+        The 'pmask' argument specifies the location of an 8 byte mask
+        containing which features should be activated. Features whose
+        bits are cleared will be deactivated. The possible
+        bits for this mask are:
+
+                - ``ISOL_F_QUIESCE``:
+
+                Activate quiescing of background kernel activities.
+                Quiescing happens on return to userspace from this
+                system call, and on return from subsequent
+                system calls (unless quiesce_oneshot_mask has been set at
+                PR_ISOL_CFG_SET time).
+
+        Quiescing can be adjusted (while active) by
+        prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).
+
+
+==================
+Userspace support
+==================
+
+Task isolation is divided in two main steps: configuration and activation.
+
+Each step can be performed by an external tool or the latency sensitive
+application itself. util-linux contains the "chisol" tool for this
+purpose.
+
+This results in three combinations:
+
+1. Both configuration and activation performed by the
+latency sensitive application.
+Allows fine grained control of what task isolation
+features are enabled and when (see samples section below).
+
+2. Only activation can be performed by the latency sensitive app
+(and configuration performed by chisol).
+This allows the admin/user to control task isolation parameters,
+and applications have to be modified only once.
+
+3. Configuration and activation performed by an external tool.
+This allows unmodified applications to take advantage of
+task isolation. Activation is performed by the "-a" option
+of chisol.
+
+========
+Examples
+========
+
+The ``samples/task_isolation/`` directory contains 3 examples:
+
+* task_isol_userloop.c:
+
+        Example of program with a loop on userspace scenario.
+
+* task_isol_computation.c:
+
+        Example of program that enters task isolated mode,
+        performs an amount of computation, exits task
+        isolated mode, and writes the computation to disk.
+
+* task_isol_oneshot.c:
+
+        Example of program that enables one-shot
+        mode for quiescing, enters a processing loop, then upon an external
+        event performs a number of syscalls to handle that event.
+
+This is a snippet of code to activate task isolation if
+it has been previously configured (by chisol for example)::
+
+        #include <sys/prctl.h>
+        #include <linux/types.h>
+
+        #ifdef PR_ISOL_CFG_GET
+        unsigned long long fmask;
+
+        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+        if (ret != -1 && fmask != 0) {
+                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
+                if (ret == -1) {
+                        perror("prctl PR_ISOL_ACTIVATE_SET");
+                        return ret;
+                }
+        }
+        #endif
+
Index: linux-2.6/samples/task_isolation/task_isol.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include <errno.h>
+#include "task_isol.h"
+
+#ifdef PR_ISOL_FEAT_GET
+int task_isol_setup(int oneshot)
+{
+	int ret;
+	int errnosv;
+	unsigned long long fmask;
+	struct task_isol_quiesce_extensions qext;
+	struct task_isol_quiesce_control qctrl;
+
+	/* Retrieve supported task isolation features */
+	ret = prctl(PR_ISOL_FEAT_GET, 0, &fmask, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT");
+		return ret;
+	}
+	printf("supported features bitmask: 0x%llx\n", fmask);
+
+	/* Retrieve supported ISOL_F_QUIESCE bits */
+	ret = prctl(PR_ISOL_FEAT_GET, ISOL_F_QUIESCE, &qext, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE)");
+		return ret;
+	}
+	printf("supported ISOL_F_QUIESCE bits: 0x%llx\n",
+		qext.supported_quiesce_bits);
+
+	fmask = 0;
+	ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+	errnosv = errno;
+	if (ret != -1 && fmask != 0) {
+		printf("Task isolation parameters already configured!\n");
+		return ret;
+	}
+	if (ret == -1 && errnosv != ENODATA) {
+		perror("prctl PR_ISOL_GET");
+		return ret;
+	}
+	memset(&qctrl, 0, sizeof(struct task_isol_quiesce_control));
+	qctrl.quiesce_mask = ISOL_F_QUIESCE_VMSTATS;
+	if (oneshot)
+		qctrl.quiesce_oneshot_mask = ISOL_F_QUIESCE_VMSTATS;
+
+	ret = prctl(PR_ISOL_CFG_SET, I_CFG_FEAT, ISOL_F_QUIESCE,
+		    QUIESCE_CONTROL, &qctrl);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CFG_SET");
+		return ret;
+	}
+	return ISOL_F_QUIESCE;
+}
+
+int task_isol_activate_set(unsigned long long mask)
+{
+	int ret;
+
+	ret = prctl(PR_ISOL_ACTIVATE_SET, &mask, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_ACTIVATE_SET");
+		return -1;
+	}
+
+	return 0;
+}
+
+#else
+
+int task_isol_setup(void)
+{
+	return 0;
+}
+
+int task_isol_activate_set(unsigned long long mask)
+{
+	return 0;
+}
+#endif
+
+
Index: linux-2.6/samples/task_isolation/task_isol.h
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TASK_ISOL_H
+#define __TASK_ISOL_H
+
+int task_isol_setup(int oneshot);
+
+int task_isol_activate_set(unsigned long long mask);
+
+#endif /* __TASK_ISOL_H */
Index: linux-2.6/samples/task_isolation/task_isol_userloop.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_userloop.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(0);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+	/* enable quiescing on system call return, oneshot */
+	ret = task_isol_activate_set(mask);
+	if (ret)
+		return EXIT_FAILURE;
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+	/* busy loop */
+	while (ret < NR_LOOPS)  {
+		memset(buf, 0, 4096);
+		ret = ret+1;
+		if (!(ret % NR_PRINT))
+			printf("loops=%d of %d\n", ret, NR_LOOPS);
+	}
+
+
+	ret = task_isol_activate_set(mask & ~ISOL_F_QUIESCE);
+	if (ret)
+		return EXIT_FAILURE;
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/Kconfig
===================================================================
--- linux-2.6.orig/samples/Kconfig
+++ linux-2.6/samples/Kconfig
@@ -250,6 +250,13 @@ config SAMPLE_CORESIGHT_SYSCFG
 	  This demonstrates how a user may create their own CoreSight
 	  configurations and easily load them into the system at runtime.
 
+config SAMPLE_TASK_ISOLATION
+	bool "task isolation sample"
+	depends on CC_CAN_LINK && HEADERS_INSTALL
+	help
+	  Build example userspace program to use prctl task isolation
+	  interface.
+
 endif # SAMPLES
 
 config HAVE_SAMPLE_FTRACE_DIRECT
Index: linux-2.6/samples/Makefile
===================================================================
--- linux-2.6.orig/samples/Makefile
+++ linux-2.6/samples/Makefile
@@ -33,3 +33,4 @@ subdir-$(CONFIG_SAMPLE_WATCHDOG)	+= watc
 subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)	+= watch_queue
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)	+= kmemleak/
 obj-$(CONFIG_SAMPLE_CORESIGHT_SYSCFG)	+= coresight/
+subdir-$(CONFIG_SAMPLE_TASK_ISOLATION)	+= task_isolation
Index: linux-2.6/samples/task_isolation/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+userprogs-always-y += task_isol_userloop task_isol_computation task_isol_oneshot
+task_isol_userloop-objs := task_isol.o task_isol_userloop.o
+task_isol_computation-objs := task_isol.o task_isol_computation.o
+task_isol_oneshot-objs := task_isol.o task_isol_oneshot.o
+
+userccflags += -I usr/include
+
+
+#$(CC) $^ -o $@
Index: linux-2.6/samples/task_isolation/task_isol_computation.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_computation.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Example of task isolation prctl interface with a loop:
+ *
+ *	do {
+ *		enable quiescing of kernel activities
+ *		perform computation
+ *		disable quiescing of kernel activities
+ *		write computation results to disk
+ *	} while (condition);
+ *
+ */
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret, fd, write_loops;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	fd = open("/tmp/comp_output.data", O_RDWR|O_CREAT);
+	if (fd == -1) {
+		perror("open");
+		return EXIT_FAILURE;
+	}
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(0);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+
+	write_loops = 0;
+	do {
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+		/* enable quiescing on system call return */
+		ret = task_isol_activate_set(mask);
+		if (ret)
+			return EXIT_FAILURE;
+
+		/* busy loop */
+		while (ret < NR_LOOPS)  {
+			memset(buf, 0xf, 4096);
+			ret = ret+1;
+			if (!(ret % NR_PRINT))
+				printf("wloop=%d loops=%d of %d\n", write_loops,
+					ret, NR_LOOPS);
+		}
+		/* disable quiescing on system call return */
+		ret = task_isol_activate_set(mask & ~ISOL_F_QUIESCE);
+		if (ret)
+			return EXIT_FAILURE;
+
+		/*
+		 * write computed data to disk, this would be
+		 * multiple writes on a real application, so
+		 * disabling quiescing is advantageous
+		 */
+		ret = write(fd, buf, 4096);
+		if (ret == -1) {
+			perror("write");
+			return EXIT_FAILURE;
+		}
+
+		write_loops += 1;
+	} while (write_loops < 5);
+
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/task_isolation/task_isol_oneshot.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_oneshot.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Example of task isolation prctl interface using
+ * oneshot mode for quiescing.
+ *
+ *
+ *      enable oneshot quiescing of kernel activities
+ *	do {
+ *		process data (no system calls)
+ *		if (event) {
+ *			process event with syscalls
+ *			enable oneshot quiescing of kernel activities
+ *		}
+ *	} while (!exit_condition);
+ *
+ */
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret, fd, cnt;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	fd = open("/dev/zero", O_RDONLY);
+	if (fd == -1) {
+		perror("open");
+		return EXIT_FAILURE;
+	}
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(1);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+
+	/* enable quiescing on system call return, oneshot */
+	ret = task_isol_activate_set(mask);
+	if (ret)
+		return EXIT_FAILURE;
+	/* busy loop */
+	cnt = 0;
+	while (cnt < NR_LOOPS)  {
+		memset(buf, 0xf, 4096);
+		cnt = cnt+1;
+		if (!(cnt % NR_PRINT)) {
+			int i, r;
+
+			/* this could be considered handling an external
+			 * event: with one-shot mode, system calls
+			 * after prctl(PR_SET_ACTIVATE) will not incur
+			 * the penalty of quiescing
+			 */
+			printf("loops=%d of %d\n", cnt, NR_LOOPS);
+			for (i = 0; i < 100; i++) {
+				r = read(fd, buf, 4096);
+				if (r == -1) {
+					perror("read");
+					return EXIT_FAILURE;
+				}
+			}
+
+			ret = munlock(buf, 4096);
+			if (ret) {
+				perror("munlock");
+				return EXIT_FAILURE;
+			}
+
+			ret = mlock(buf, 4096);
+			if (ret) {
+				perror("mlock");
+				return EXIT_FAILURE;
+			}
+
+			/* enable quiescing on system call return */
+			ret = task_isol_activate_set(mask);
+			if (ret)
+				return EXIT_FAILURE;
+		}
+	}
+
+	return EXIT_SUCCESS;
+}
+



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 05/13] task isolation: sync vmstats on return to userspace
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-07 14:57   ` Frederic Weisbecker
  2022-02-04 17:35 ` [patch v11 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, use the task isolation prctl interface to quiesce 
deferred actions when returning to userspace.

This patch adds hooks to fork and exit code paths.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v11: fold patch to add task_isol_exit hooks (Frederic)
     Use _TIF_TASK_ISOL bit on thread flags (Frederic)
     
v6: modify exit_to_user_mode_loop to cover exceptions and interrupts
v5: no changes
v4: add oneshot mode support

 include/linux/task_isolation.h |   16 ++++++++++++++++
 include/linux/vmstat.h         |    8 ++++++++
 kernel/entry/common.c          |   15 +++++++++++----
 kernel/task_isolation.c        |   21 +++++++++++++++++++++
 mm/vmstat.c                    |   21 +++++++++++++++++++++
 5 files changed, 77 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -27,6 +27,13 @@ static inline void task_isol_free(struct
 		__task_isol_free(tsk);
 }
 
+void __task_isol_exit(struct task_struct *tsk);
+static inline void task_isol_exit(struct task_struct *tsk)
+{
+	if (tsk->task_isol_info)
+		__task_isol_exit(tsk);
+}
+
 int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
 			     unsigned long arg4, unsigned long arg5);
 int prctl_task_isol_cfg_get(unsigned long arg2, unsigned long arg3,
@@ -40,12 +47,22 @@ int prctl_task_isol_activate_set(unsigne
 
 int __copy_task_isol(struct task_struct *tsk);
 
+void task_isol_exit_to_user_mode(void);
+
 #else
 
+static inline void task_isol_exit_to_user_mode(void)
+{
+}
+
 static inline void task_isol_free(struct task_struct *tsk)
 {
 }
 
+static inline void task_isol_exit(struct task_struct *tsk)
+{
+}
+
 static inline int prctl_task_isol_feat_get(unsigned long arg2,
 					   unsigned long arg3,
 					   unsigned long arg4,
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
 		void *buffer, size_t *length, loff_t *ppos);
 #endif
 
+#if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
+void sync_vmstat(void);
+#else
+static inline void sync_vmstat(void)
+{
+}
+#endif
+
 struct reclaim_stat {
 	unsigned nr_dirty;
 	unsigned nr_unqueued_dirty;
Index: linux-2.6/kernel/entry/common.c
===================================================================
--- linux-2.6.orig/kernel/entry/common.c
+++ linux-2.6/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/task_isolation.h>
 
 #include "common.h"
 
@@ -174,6 +175,9 @@ static unsigned long exit_to_user_mode_l
 		if (ti_work & _TIF_NOTIFY_RESUME)
 			tracehook_notify_resume(regs);
 
+		if (ti_work & _TIF_TASK_ISOL)
+			task_isol_exit_to_user_mode();
+
 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);
 
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -18,6 +18,12 @@
 #include <linux/sysfs.h>
 #include <linux/init.h>
 #include <linux/sched/task.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+
+void __task_isol_exit(struct task_struct *tsk)
+{
+}
 
 void __task_isol_free(struct task_struct *tsk)
 {
@@ -251,6 +257,9 @@ static int cfg_feat_quiesce_set(unsigned
 	task_isol_info->quiesce_mask = i_qctrl->quiesce_mask;
 	task_isol_info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
 	task_isol_info->conf_mask |= ISOL_F_QUIESCE;
+	if (task_isol_info->quiesce_mask & ISOL_F_QUIESCE_VMSTATS)
+		set_thread_flag(TIF_TASK_ISOL);
+
 	ret = 0;
 
 out_free:
@@ -303,6 +312,7 @@ int __copy_task_isol(struct task_struct
 		new_info->active_mask = info->active_mask;
 
 	tsk->task_isol_info = new_info;
+	set_ti_thread_flag(task_thread_info(tsk), TIF_TASK_ISOL);
 
 	return 0;
 }
@@ -328,6 +338,7 @@ int prctl_task_isol_activate_set(unsigne
 		return ret;
 
 	task_isol_info->active_mask = active_mask;
+	set_thread_flag(TIF_TASK_ISOL);
 	ret = 0;
 
 out:
@@ -349,3 +360,24 @@ int prctl_task_isol_activate_get(unsigne
 
 	return 0;
 }
+
+void task_isol_exit_to_user_mode(void)
+{
+	struct task_isol_info *i;
+
+	clear_thread_flag(TIF_TASK_ISOL);
+
+	i = current->task_isol_info;
+	if (!i)
+		return;
+
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return;
+
+	if (i->quiesce_mask & ISOL_F_QUIESCE_VMSTATS) {
+		sync_vmstat();
+		if (i->oneshot_mask & ISOL_F_QUIESCE_VMSTATS)
+			i->quiesce_mask &= ~ISOL_F_QUIESCE_VMSTATS;
+	}
+}
+EXPORT_SYMBOL_GPL(task_isol_exit_to_user_mode);
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -2018,6 +2018,29 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void sync_vmstat(void)
+{
+	int cpu;
+
+	cpu = get_cpu();
+
+	refresh_cpu_vm_stats(false);
+	put_cpu();
+
+	/*
+	 * If task is migrated to another CPU between put_cpu
+	 * and cancel_delayed_work_sync, the code below might
+	 * cancel vmstat_update work for a different cpu
+	 * (than the one from which the vmstats were flushed).
+	 *
+	 * However, vmstat shepherd will re-enable it later,
+	 * so its harmless.
+	 */
+	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+}
+#endif
+
 static void __init start_shepherd_timer(void)
 {
 	int cpu;
Index: linux-2.6/include/linux/entry-common.h
===================================================================
--- linux-2.6.orig/include/linux/entry-common.h
+++ linux-2.6/include/linux/entry-common.h
@@ -60,7 +60,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_TASK_ISOL | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_check_user_regs - Architecture specific sanity check for user mode regs
Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -64,6 +64,7 @@
 #include <linux/compat.h>
 #include <linux/io_uring.h>
 #include <linux/kprobes.h>
+#include <linux/task_isolation.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -759,6 +760,7 @@ void __noreturn do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	io_uring_files_cancel();
+	task_isol_exit(tsk);
 	exit_signals(tsk);  /* sets PF_EXITING */
 
 	/* sync mm's RSS info before statistics gathering */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -2417,6 +2417,7 @@ bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
 bad_fork_cleanup_task_isol:
+	task_isol_exit(p);
 	task_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 06/13] procfs: add per-pid task isolation state
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

Add /proc/pid/task_isolation file, to query the state of
task isolation configuration.

---
 fs/proc/base.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -97,6 +97,8 @@
 #include <linux/time_namespace.h>
 #include <linux/resctrl.h>
 #include <linux/cn_proc.h>
+#include <linux/prctl.h>
+#include <linux/task_isolation.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -665,6 +667,69 @@ static int proc_pid_syscall(struct seq_f
 }
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
+#ifdef CONFIG_TASK_ISOLATION
+
+struct qoptions {
+	unsigned long mask;
+	char *name;
+};
+
+static struct qoptions iopts[] = {
+	{ISOL_F_QUIESCE, "quiesce"},
+};
+#define ILEN (sizeof(iopts) / sizeof(struct qoptions))
+
+static struct qoptions qopts[] = {
+	{ISOL_F_QUIESCE_VMSTATS, "vmstat_sync"},
+};
+#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
+
+static void show_isolation_state(struct seq_file *m,
+				 struct qoptions *iopt,
+				 int mask,
+				 const char *hdr)
+{
+	int i;
+
+	seq_printf(m, hdr);
+	for (i = 0; i < ILEN; i++) {
+		if (mask & iopt->mask)
+			seq_printf(m, "%s ", iopt->name);
+		iopt++;
+	}
+	if (mask == 0)
+		seq_printf(m, "none ");
+	seq_printf(m, "\n");
+}
+
+int proc_pid_task_isolation(struct seq_file *m, struct pid_namespace *ns,
+			    struct pid *pid, struct task_struct *t)
+{
+	int active_mask, quiesce_mask, conf_mask;
+	struct task_isol_info *task_isol_info;
+	struct inode *inode = m->private;
+	struct task_struct *task = get_proc_task(inode);
+
+	task_isol_info = t->task_isol_info;
+	if (!task_isol_info)
+		active_mask = quiesce_mask = conf_mask = 0;
+	else {
+		active_mask = task_isol_info->active_mask;
+		quiesce_mask = task_isol_info->quiesce_mask;
+		conf_mask = task_isol_info->conf_mask;
+	}
+
+	show_isolation_state(m, iopts, conf_mask, "Configured state: ");
+	show_isolation_state(m, iopts, active_mask, "Active state: ");
+	show_isolation_state(m, qopts, quiesce_mask, "Quiescing: ");
+
+	put_task_struct(task);
+
+	return 0;
+}
+
+#endif /* CONFIG_TASK_ISOLATION */
+
 /************************************************************************/
 /*                       Here the fs part begins                        */
 /************************************************************************/
@@ -3286,6 +3351,9 @@ static const struct pid_entry tgid_base_
 #ifdef CONFIG_SECCOMP_CACHE_DEBUG
 	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	ONE("task_isolation", S_IRUSR, proc_pid_task_isolation),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 07/13] task isolation: sync vmstats conditional on changes
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-07 15:19   ` Frederic Weisbecker
  2022-02-07 15:29   ` Frederic Weisbecker
  2022-02-04 17:35 ` [patch v11 08/13] task isolation: enable return to userspace processing Marcelo Tosatti
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Rather than syncing VM-stats on every return to userspace
(or VM-entry), keep track of changes through a per-CPU bool.

This improves performance when enabling task isolated
for vcpu VMs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v11:
- Add TIF_TASK_ISOL bit to thread info flags and use it
  to decide whether to perform task isolation work on
  return to userspace     

 include/linux/vmstat.h |   13 ++++++++++++-
 mm/vmstat.c            |   29 ++++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -22,7 +22,18 @@ int sysctl_vm_numa_stat_handler(struct c
 #endif
 
 #if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
-void sync_vmstat(void);
+DECLARE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+extern struct static_key vmstat_sync_enabled;
+
+void __sync_vmstat(void);
+static inline void sync_vmstat(void)
+{
+	if (static_key_false(&vmstat_sync_enabled))
+		__sync_vmstat();
+}
+
+void init_sync_vmstat(void);
 #else
 static inline void sync_vmstat(void)
 {
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -334,6 +334,31 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+struct static_key vmstat_sync_enabled;
+DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+static inline void mark_vmstat_dirty(void)
+{
+	if (!static_key_false(&vmstat_sync_enabled))
+		return;
+
+	raw_cpu_write(vmstat_dirty, true);
+	set_thread_flag(TIF_TASK_ISOL);
+}
+
+void init_sync_vmstat(void)
+{
+	raw_cpu_write(vmstat_dirty, true);
+	set_thread_flag(TIF_TASK_ISOL);
+}
+EXPORT_SYMBOL_GPL(vmstat_dirty);
+#else
+static inline void mark_vmstat_dirty(void)
+{
+}
+#endif
+
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -366,6 +391,7 @@ void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
@@ -404,6 +430,7 @@ void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
@@ -602,6 +629,7 @@ static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	mark_vmstat_dirty();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -670,6 +698,7 @@ static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	mark_vmstat_dirty();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -1087,6 +1116,7 @@ static void fill_contig_page_info(struct
 			info->free_blocks_suitable += blocks <<
 						(order - suitable_order);
 	}
+	mark_vmstat_dirty();
 }
 
 /*
@@ -1443,6 +1473,7 @@ static void walk_zones_in_node(struct se
 		if (!nolock)
 			spin_unlock_irqrestore(&zone->lock, flags);
 	}
+	mark_vmstat_dirty();
 }
 #endif
 
@@ -1512,6 +1543,7 @@ static void pagetypeinfo_showfree_print(
 		}
 		seq_putc(m, '\n');
 	}
+	mark_vmstat_dirty();
 }
 
 /* Print out the free pages at each order for each migatetype */
@@ -1932,6 +1964,7 @@ static void vmstat_update(struct work_st
 				this_cpu_ptr(&vmstat_work),
 				round_jiffies_relative(sysctl_stat_interval));
 	}
+	mark_vmstat_dirty();
 }
 
 /*
@@ -2019,13 +2052,14 @@ static void vmstat_shepherd(struct work_
 }
 
 #ifdef CONFIG_TASK_ISOLATION
-void sync_vmstat(void)
+void __sync_vmstat(void)
 {
 	int cpu;
 
 	cpu = get_cpu();
 
 	refresh_cpu_vm_stats(false);
+	raw_cpu_write(vmstat_dirty, false);
 	put_cpu();
 
 	/*



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 08/13] task isolation: enable return to userspace processing
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Enable processing of pending task isolation work if per-CPU vmstats
are out of sync with global vmstats.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

v11:
   Fold patch to add task_isol_exit hooks into
  "sync vmstats on return to userspace" patch.          (Frederic W. Weisbecker)

 include/linux/task_isolation.h |   22 +++++++++++++++++++++-
 kernel/task_isolation.c        |   13 +++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -10,7 +10,6 @@
  */
 
 #include <linux/sched.h>
-#include <linux/task_isolation.h>
 #include <linux/prctl.h>
 #include <linux/slab.h>
 #include <linux/kobject.h>
@@ -20,9 +19,17 @@
 #include <linux/sched/task.h>
 #include <linux/mm.h>
 #include <linux/vmstat.h>
+#include <linux/task_isolation.h>
 
 void __task_isol_exit(struct task_struct *tsk)
 {
+	struct task_isol_info *i;
+
+	i = tsk->task_isol_info;
+	if (!i)
+		return;
+
+	static_key_slow_dec(&vmstat_sync_enabled);
 }
 
 void __task_isol_free(struct task_struct *tsk)
@@ -41,6 +48,12 @@ static struct task_isol_info *task_isol_
 	if (unlikely(!info))
 		return ERR_PTR(-ENOMEM);
 
+	preempt_disable();
+	init_sync_vmstat();
+	preempt_enable();
+
+	static_key_slow_inc(&vmstat_sync_enabled);
+
 	return info;
 }
 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 08/13] task isolation: enable return to userspace processing Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-07 15:36   ` Frederic Weisbecker
  2022-02-04 17:35 ` [patch v11 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

If a thread has task isolation activated, is preempted by thread B,
which marks vmstat information dirty, and is preempted back in,
one might return to userspace with vmstat dirty information on the 
CPU in question.

To address this problem, add a preempt notifier that transfers vmstat dirty
information to TIF_TASK_ISOL thread flag.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -19,6 +19,7 @@
 #include <linux/sched/task.h>
 #include <linux/mm.h>
 #include <linux/vmstat.h>
+#include <linux/preempt.h>
 #include <linux/task_isolation.h>
 
 void __task_isol_exit(struct task_struct *tsk)
@@ -30,6 +31,9 @@ void __task_isol_exit(struct task_struct
 		return;
 
 	static_key_slow_dec(&vmstat_sync_enabled);
+
+	preempt_notifier_unregister(&i->preempt_notifier);
+	preempt_notifier_dec();
 }
 
 void __task_isol_free(struct task_struct *tsk)
@@ -40,6 +44,21 @@ void __task_isol_free(struct task_struct
 	tsk->task_isol_info = NULL;
 }
 
+static void task_isol_sched_in(struct preempt_notifier *pn, int cpu)
+{
+	vmstat_dirty_to_thread_flag();
+}
+
+static void task_isol_sched_out(struct preempt_notifier *pn,
+				struct task_struct *next)
+{
+}
+
+static __read_mostly struct preempt_ops task_isol_preempt_ops = {
+	.sched_in	= task_isol_sched_in,
+	.sched_out	= task_isol_sched_out,
+};
+
 static struct task_isol_info *task_isol_alloc_context(void)
 {
 	struct task_isol_info *info;
@@ -48,6 +67,10 @@ static struct task_isol_info *task_isol_
 	if (unlikely(!info))
 		return ERR_PTR(-ENOMEM);
 
+	preempt_notifier_inc();
+	preempt_notifier_init(&info->preempt_notifier, &task_isol_preempt_ops);
+	preempt_notifier_register(&info->preempt_notifier);
+
 	preempt_disable();
 	init_sync_vmstat();
 	preempt_enable();
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -17,6 +17,8 @@ struct task_isol_info {
 	u64 oneshot_mask;
 
 	u8 inherit_mask;
+
+	struct preempt_notifier preempt_notifier;
 };
 
 extern void __task_isol_free(struct task_struct *tsk);
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -34,10 +34,16 @@ static inline void sync_vmstat(void)
 }
 
 void init_sync_vmstat(void);
+
+void vmstat_dirty_to_thread_flag(void);
 #else
 static inline void sync_vmstat(void)
 {
 }
+
+static inline void vmstat_dirty_to_thread_flag(void)
+{
+}
 #endif
 
 struct reclaim_stat {
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -353,6 +353,13 @@ void init_sync_vmstat(void)
 	set_thread_flag(TIF_TASK_ISOL);
 }
 EXPORT_SYMBOL_GPL(vmstat_dirty);
+
+void vmstat_dirty_to_thread_flag(void)
+{
+	if (raw_cpu_read(vmstat_dirty) == true)
+		set_thread_flag(TIF_TASK_ISOL);
+}
+EXPORT_SYMBOL_GPL(vmstat_dirty_to_thread_flag);
 #else
 static inline void mark_vmstat_dirty(void)
 {



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 10/13] KVM: x86: process isolation work from VM-entry code path
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-07 15:47   ` Frederic Weisbecker
  2022-02-04 17:35 ` [patch v11 11/13] mm: vmstat: move need_update Marcelo Tosatti
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

VM-entry code path is an entry point similar to userspace return
when task isolation is concerned.

Call isolation_exit_to_user_mode before VM-enter.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v11
- Add TIF_TASK_ISOL bit to thread info flags and use it
  to decide whether to perform task isolation work on
  return to userspace                                   (Frederic W. Weisbecker)

 include/linux/entry-kvm.h |    4 +++-
 kernel/entry/kvm.c        |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/entry/kvm.c
===================================================================
--- linux-2.6.orig/kernel/entry/kvm.c
+++ linux-2.6/kernel/entry/kvm.c
@@ -2,6 +2,7 @@
 
 #include <linux/entry-kvm.h>
 #include <linux/kvm_host.h>
+#include <linux/task_isolation.h>
 
 static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 {
@@ -22,6 +23,9 @@ static int xfer_to_guest_mode_work(struc
 		if (ti_work & _TIF_NOTIFY_RESUME)
 			tracehook_notify_resume(NULL);
 
+		if (ti_work & _TIF_TASK_ISOL)
+			task_isol_exit_to_user_mode();
+
 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
 		if (ret)
 			return ret;



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 11/13] mm: vmstat: move need_update
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

Move need_update() function up in vmstat.c, needed by next patch.
No code changes.

Remove a duplicate comment while at it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   58 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1894,6 +1894,35 @@ static const struct seq_operations vmsta
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	pg_data_t *last_pgdat = NULL;
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		struct per_cpu_nodestat *n;
+
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 */
+		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
+			return true;
+
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
+		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
+		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
+			return true;
+	}
+	return false;
+}
+
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1975,35 +2004,6 @@ static void vmstat_update(struct work_st
 }
 
 /*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
-			return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 11/13] mm: vmstat: move need_update Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-04 17:35 ` [patch v11 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

It is not necessary to queue work item to run refresh_vm_stats 
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.

This fixes sosreport hang (which uses vmstat_refresh) with 
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   49 ++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 44 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1924,6 +1924,31 @@ static bool need_update(int cpu)
 }
 
 #ifdef CONFIG_PROC_FS
+static bool need_drain_remote_zones(int cpu)
+{
+#ifdef CONFIG_NUMA
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		if (!pcp->count)
+			continue;
+
+		if (!pcp->expire)
+			continue;
+
+		if (zone_to_nid(zone) == cpu_to_node(cpu))
+			continue;
+
+		return true;
+	}
+#endif
+
+	return false;
+}
+
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
@@ -1933,8 +1958,12 @@ int vmstat_refresh(struct ctl_table *tab
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	long val;
-	int err;
-	int i;
+	int i, cpu;
+	struct work_struct __percpu *works;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1948,9 +1977,19 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+
+		INIT_WORK(work, refresh_vm_stats);
+		if (need_update(cpu) || need_drain_remote_zones(cpu))
+			schedule_work_on(cpu, work);
+	}
+	for_each_online_cpu(cpu)
+		flush_work(per_cpu_ptr(works, cpu));
+	cpus_read_unlock();
+	free_percpu(works);
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v11 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (11 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
@ 2022-02-04 17:35 ` Marcelo Tosatti
  2022-02-07 15:38   ` Frederic Weisbecker
  2022-02-19  8:02 ` [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Oscar Shiang
  2022-03-08  7:20 ` Oscar Shiang
  14 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-04 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang, Marcelo Tosatti

This avoids processing of TIF_TASK_ISOL, when returning to userspace,
for tasks which do not have task isolation configured.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -5,6 +5,8 @@
 
 #ifdef CONFIG_TASK_ISOLATION
 
+#include <uapi/linux/prctl.h>
+
 struct task_isol_info {
 	/* Which features have been configured */
 	u64 conf_mask;
@@ -51,6 +53,24 @@ int __copy_task_isol(struct task_struct
 
 void task_isol_exit_to_user_mode(void);
 
+static inline bool task_isol_quiesce_activated(struct task_struct *tsk,
+					       u64 quiesce_mask)
+{
+	struct task_isol_info *i;
+
+	i = tsk->task_isol_info;
+	if (!i)
+		return false;
+
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return false;
+
+	if ((i->quiesce_mask & quiesce_mask) == quiesce_mask)
+		return true;
+
+	return false;
+}
+
 #else
 
 static inline void task_isol_exit_to_user_mode(void)
@@ -105,6 +125,12 @@ static inline int prctl_task_isol_activa
 	return -EOPNOTSUPP;
 }
 
+static inline bool task_isol_quiesce_activated(struct task_struct *tsk,
+					       u64 quiesce_mask)
+{
+	return false;
+}
+
 #endif /* CONFIG_TASK_ISOLATION */
 
 #endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/task_isolation.h>
 
 #include "internal.h"
 
@@ -344,7 +345,8 @@ static inline void mark_vmstat_dirty(voi
 		return;
 
 	raw_cpu_write(vmstat_dirty, true);
-	set_thread_flag(TIF_TASK_ISOL);
+	if (task_isol_quiesce_activated(current, ISOL_F_QUIESCE_VMSTATS))
+		set_thread_flag(TIF_TASK_ISOL);
 }
 
 void init_sync_vmstat(void)



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 01/13] s390: add support for TIF_TASK_ISOL
  2022-02-04 17:35 ` [patch v11 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
@ 2022-02-07 11:40   ` Sven Schnelle
  0 siblings, 0 replies; 30+ messages in thread
From: Sven Schnelle @ 2022-02-07 11:40 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira, Oscar Shiang, linux-s390

Marcelo Tosatti <mtosatti@redhat.com> writes:

> Add TIF_TASK_ISOL handling for s390.
>
> Cc: linux-s390@vger.kernel.org
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Acked-by: Sven Schnelle <svens@linux.ibm.com>

> Index: linux-2.6/arch/s390/include/asm/thread_info.h
> ===================================================================
> --- linux-2.6.orig/arch/s390/include/asm/thread_info.h
> +++ linux-2.6/arch/s390/include/asm/thread_info.h
> @@ -73,6 +73,7 @@ void arch_setup_new_exec(void);
>  #define TIF_ISOLATE_BP		8	/* Run process with isolated BP */
>  #define TIF_ISOLATE_BP_GUEST	9	/* Run KVM guests with isolated BP */
>  #define TIF_PER_TRAP		10	/* Need to handle PER trap on exit to usermode */
> +#define TIF_TASK_ISOL		11	/* task isolation work pending */
>  
>  #define TIF_31BIT		16	/* 32bit process */
>  #define TIF_MEMDIE		17	/* is terminating due to OOM killer */
> @@ -97,6 +98,7 @@ void arch_setup_new_exec(void);
>  #define _TIF_ISOLATE_BP		BIT(TIF_ISOLATE_BP)
>  #define _TIF_ISOLATE_BP_GUEST	BIT(TIF_ISOLATE_BP_GUEST)
>  #define _TIF_PER_TRAP		BIT(TIF_PER_TRAP)
> +#define _TIF_TASK_ISOL		BIT(TIF_TASK_ISOL)
>  
>  #define _TIF_31BIT		BIT(TIF_31BIT)
>  #define _TIF_SINGLE_STEP	BIT(TIF_SINGLE_STEP)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 05/13] task isolation: sync vmstats on return to userspace
  2022-02-04 17:35 ` [patch v11 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2022-02-07 14:57   ` Frederic Weisbecker
  2022-02-07 15:16     ` Marcelo Tosatti
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2022-02-07 14:57 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Fri, Feb 04, 2022 at 02:35:42PM -0300, Marcelo Tosatti wrote:
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
> 
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
> 
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
> 
> To fix this, use the task isolation prctl interface to quiesce 
> deferred actions when returning to userspace.
> 
> This patch adds hooks to fork and exit code paths.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
> v11: fold patch to add task_isol_exit hooks (Frederic)
>      Use _TIF_TASK_ISOL bit on thread flags (Frederic)
>      
> v6: modify exit_to_user_mode_loop to cover exceptions and interrupts
> v5: no changes
> v4: add oneshot mode support
> 
>  include/linux/task_isolation.h |   16 ++++++++++++++++
>  include/linux/vmstat.h         |    8 ++++++++
>  kernel/entry/common.c          |   15 +++++++++++----
>  kernel/task_isolation.c        |   21 +++++++++++++++++++++
>  mm/vmstat.c                    |   21 +++++++++++++++++++++
>  5 files changed, 77 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/task_isolation.h
> ===================================================================
> --- linux-2.6.orig/include/linux/task_isolation.h
> +++ linux-2.6/include/linux/task_isolation.h
> @@ -27,6 +27,13 @@ static inline void task_isol_free(struct
>  		__task_isol_free(tsk);
>  }
>  
> +void __task_isol_exit(struct task_struct *tsk);
> +static inline void task_isol_exit(struct task_struct *tsk)
> +{
> +	if (tsk->task_isol_info)
> +		__task_isol_exit(tsk);
> +}
> +
>  int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
>  			     unsigned long arg4, unsigned long arg5);
>  int prctl_task_isol_cfg_get(unsigned long arg2, unsigned long arg3,
> @@ -40,12 +47,22 @@ int prctl_task_isol_activate_set(unsigne
>  
>  int __copy_task_isol(struct task_struct *tsk);
>  
> +void task_isol_exit_to_user_mode(void);
> +
>  #else
>  
> +static inline void task_isol_exit_to_user_mode(void)
> +{
> +}
> +
>  static inline void task_isol_free(struct task_struct *tsk)
>  {
>  }
>  
> +static inline void task_isol_exit(struct task_struct *tsk)
> +{
> +}
> +
>  static inline int prctl_task_isol_feat_get(unsigned long arg2,
>  					   unsigned long arg3,
>  					   unsigned long arg4,
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h
> +++ linux-2.6/include/linux/vmstat.h
> @@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
>  		void *buffer, size_t *length, loff_t *ppos);
>  #endif
>  
> +#if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
> +void sync_vmstat(void);
> +#else
> +static inline void sync_vmstat(void)
> +{
> +}
> +#endif
> +
>  struct reclaim_stat {
>  	unsigned nr_dirty;
>  	unsigned nr_unqueued_dirty;
> Index: linux-2.6/kernel/entry/common.c
> ===================================================================
> --- linux-2.6.orig/kernel/entry/common.c
> +++ linux-2.6/kernel/entry/common.c
> @@ -6,6 +6,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/audit.h>
>  #include <linux/tick.h>
> +#include <linux/task_isolation.h>
>  
>  #include "common.h"
>  
> @@ -174,6 +175,9 @@ static unsigned long exit_to_user_mode_l
>  		if (ti_work & _TIF_NOTIFY_RESUME)
>  			tracehook_notify_resume(regs);
>  
> +		if (ti_work & _TIF_TASK_ISOL)
> +			task_isol_exit_to_user_mode();
> +
>  		/* Architecture specific TIF work */
>  		arch_exit_to_user_mode_work(regs, ti_work);
>  
> Index: linux-2.6/kernel/task_isolation.c
> ===================================================================
> --- linux-2.6.orig/kernel/task_isolation.c
> +++ linux-2.6/kernel/task_isolation.c
> @@ -18,6 +18,12 @@
>  #include <linux/sysfs.h>
>  #include <linux/init.h>
>  #include <linux/sched/task.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +
> +void __task_isol_exit(struct task_struct *tsk)
> +{
> +}
>  
>  void __task_isol_free(struct task_struct *tsk)
>  {
> @@ -251,6 +257,9 @@ static int cfg_feat_quiesce_set(unsigned
>  	task_isol_info->quiesce_mask = i_qctrl->quiesce_mask;
>  	task_isol_info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
>  	task_isol_info->conf_mask |= ISOL_F_QUIESCE;
> +	if (task_isol_info->quiesce_mask & ISOL_F_QUIESCE_VMSTATS)
> +		set_thread_flag(TIF_TASK_ISOL);

Should you check if (i->active_mask == ISOL_F_QUIESCE) before setting the
flag?

> +
>  	ret = 0;
>  
>  out_free:
> @@ -303,6 +312,7 @@ int __copy_task_isol(struct task_struct
>  		new_info->active_mask = info->active_mask;
>  
>  	tsk->task_isol_info = new_info;
> +	set_ti_thread_flag(task_thread_info(tsk), TIF_TASK_ISOL);

Same here?

>  
>  	return 0;

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 05/13] task isolation: sync vmstats on return to userspace
  2022-02-07 14:57   ` Frederic Weisbecker
@ 2022-02-07 15:16     ` Marcelo Tosatti
  0 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-07 15:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Mon, Feb 07, 2022 at 03:57:18PM +0100, Frederic Weisbecker wrote:
> On Fri, Feb 04, 2022 at 02:35:42PM -0300, Marcelo Tosatti wrote:
> > The logic to disable vmstat worker thread, when entering
> > nohz full, does not cover all scenarios. For example, it is possible
> > for the following to happen:
> > 
> > 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> > 2) app runs mlock, which increases counters for mlock'ed pages.
> > 3) start -RT loop
> > 
> > Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> > the mlock, vmstat shepherd can restart vmstat worker thread on
> > the CPU in question.
> > 
> > To fix this, use the task isolation prctl interface to quiesce 
> > deferred actions when returning to userspace.
> > 
> > This patch adds hooks to fork and exit code paths.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > ---
> > v11: fold patch to add task_isol_exit hooks (Frederic)
> >      Use _TIF_TASK_ISOL bit on thread flags (Frederic)
> >      
> > v6: modify exit_to_user_mode_loop to cover exceptions and interrupts
> > v5: no changes
> > v4: add oneshot mode support
> > 
> >  include/linux/task_isolation.h |   16 ++++++++++++++++
> >  include/linux/vmstat.h         |    8 ++++++++
> >  kernel/entry/common.c          |   15 +++++++++++----
> >  kernel/task_isolation.c        |   21 +++++++++++++++++++++
> >  mm/vmstat.c                    |   21 +++++++++++++++++++++
> >  5 files changed, 77 insertions(+), 4 deletions(-)
> > 
> > Index: linux-2.6/include/linux/task_isolation.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/task_isolation.h
> > +++ linux-2.6/include/linux/task_isolation.h
> > @@ -27,6 +27,13 @@ static inline void task_isol_free(struct
> >  		__task_isol_free(tsk);
> >  }
> >  
> > +void __task_isol_exit(struct task_struct *tsk);
> > +static inline void task_isol_exit(struct task_struct *tsk)
> > +{
> > +	if (tsk->task_isol_info)
> > +		__task_isol_exit(tsk);
> > +}
> > +
> >  int prctl_task_isol_feat_get(unsigned long arg2, unsigned long arg3,
> >  			     unsigned long arg4, unsigned long arg5);
> >  int prctl_task_isol_cfg_get(unsigned long arg2, unsigned long arg3,
> > @@ -40,12 +47,22 @@ int prctl_task_isol_activate_set(unsigne
> >  
> >  int __copy_task_isol(struct task_struct *tsk);
> >  
> > +void task_isol_exit_to_user_mode(void);
> > +
> >  #else
> >  
> > +static inline void task_isol_exit_to_user_mode(void)
> > +{
> > +}
> > +
> >  static inline void task_isol_free(struct task_struct *tsk)
> >  {
> >  }
> >  
> > +static inline void task_isol_exit(struct task_struct *tsk)
> > +{
> > +}
> > +
> >  static inline int prctl_task_isol_feat_get(unsigned long arg2,
> >  					   unsigned long arg3,
> >  					   unsigned long arg4,
> > Index: linux-2.6/include/linux/vmstat.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/vmstat.h
> > +++ linux-2.6/include/linux/vmstat.h
> > @@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
> >  		void *buffer, size_t *length, loff_t *ppos);
> >  #endif
> >  
> > +#if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
> > +void sync_vmstat(void);
> > +#else
> > +static inline void sync_vmstat(void)
> > +{
> > +}
> > +#endif
> > +
> >  struct reclaim_stat {
> >  	unsigned nr_dirty;
> >  	unsigned nr_unqueued_dirty;
> > Index: linux-2.6/kernel/entry/common.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/entry/common.c
> > +++ linux-2.6/kernel/entry/common.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/livepatch.h>
> >  #include <linux/audit.h>
> >  #include <linux/tick.h>
> > +#include <linux/task_isolation.h>
> >  
> >  #include "common.h"
> >  
> > @@ -174,6 +175,9 @@ static unsigned long exit_to_user_mode_l
> >  		if (ti_work & _TIF_NOTIFY_RESUME)
> >  			tracehook_notify_resume(regs);
> >  
> > +		if (ti_work & _TIF_TASK_ISOL)
> > +			task_isol_exit_to_user_mode();
> > +
> >  		/* Architecture specific TIF work */
> >  		arch_exit_to_user_mode_work(regs, ti_work);
> >  
> > Index: linux-2.6/kernel/task_isolation.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/task_isolation.c
> > +++ linux-2.6/kernel/task_isolation.c
> > @@ -18,6 +18,12 @@
> >  #include <linux/sysfs.h>
> >  #include <linux/init.h>
> >  #include <linux/sched/task.h>
> > +#include <linux/mm.h>
> > +#include <linux/vmstat.h>
> > +
> > +void __task_isol_exit(struct task_struct *tsk)
> > +{
> > +}
> >  
> >  void __task_isol_free(struct task_struct *tsk)
> >  {
> > @@ -251,6 +257,9 @@ static int cfg_feat_quiesce_set(unsigned
> >  	task_isol_info->quiesce_mask = i_qctrl->quiesce_mask;
> >  	task_isol_info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
> >  	task_isol_info->conf_mask |= ISOL_F_QUIESCE;
> > +	if (task_isol_info->quiesce_mask & ISOL_F_QUIESCE_VMSTATS)
> > +		set_thread_flag(TIF_TASK_ISOL);
> 
> Should you check if (i->active_mask == ISOL_F_QUIESCE) before setting the
> flag?
> 
> > +
> >  	ret = 0;
> >  
> >  out_free:
> > @@ -303,6 +312,7 @@ int __copy_task_isol(struct task_struct
> >  		new_info->active_mask = info->active_mask;
> >  
> >  	tsk->task_isol_info = new_info;
> > +	set_ti_thread_flag(task_thread_info(tsk), TIF_TASK_ISOL);
> 
> Same here?

Yes, should fix that.

Will wait for more comments before resending -v12.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 07/13] task isolation: sync vmstats conditional on changes
  2022-02-04 17:35 ` [patch v11 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
@ 2022-02-07 15:19   ` Frederic Weisbecker
  2022-03-10 16:30     ` Marcelo Tosatti
  2022-02-07 15:29   ` Frederic Weisbecker
  1 sibling, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2022-02-07 15:19 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Fri, Feb 04, 2022 at 02:35:44PM -0300, Marcelo Tosatti wrote:
> Rather than syncing VM-stats on every return to userspace
> (or VM-entry), keep track of changes through a per-CPU bool.
> 
> This improves performance when enabling task isolated
> for vcpu VMs.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
> v11:
> - Add TIF_TASK_ISOL bit to thread info flags and use it
>   to decide whether to perform task isolation work on
>   return to userspace     
> 
>  include/linux/vmstat.h |   13 ++++++++++++-
>  mm/vmstat.c            |   29 ++++++++++++++++++++++++++++-
>  2 files changed, 40 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h
> +++ linux-2.6/include/linux/vmstat.h
> @@ -22,7 +22,18 @@ int sysctl_vm_numa_stat_handler(struct c
>  #endif
>  
>  #if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
> -void sync_vmstat(void);
> +DECLARE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +
> +extern struct static_key vmstat_sync_enabled;
> +
> +void __sync_vmstat(void);
> +static inline void sync_vmstat(void)
> +{
> +	if (static_key_false(&vmstat_sync_enabled))
> +		__sync_vmstat();
> +}
> +
> +void init_sync_vmstat(void);
>  #else
>  static inline void sync_vmstat(void)
>  {
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c
> +++ linux-2.6/mm/vmstat.c
> @@ -334,6 +334,31 @@ void set_pgdat_percpu_threshold(pg_data_
>  	}
>  }
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +struct static_key vmstat_sync_enabled;
> +DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +
> +static inline void mark_vmstat_dirty(void)
> +{
> +	if (!static_key_false(&vmstat_sync_enabled))
> +		return;
> +
> +	raw_cpu_write(vmstat_dirty, true);

Why not __this_cpu_write() ? Shouldn't we make sure we are not
preemptible and not mark the wrong CPU?

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 07/13] task isolation: sync vmstats conditional on changes
  2022-02-04 17:35 ` [patch v11 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
  2022-02-07 15:19   ` Frederic Weisbecker
@ 2022-02-07 15:29   ` Frederic Weisbecker
  1 sibling, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2022-02-07 15:29 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Fri, Feb 04, 2022 at 02:35:44PM -0300, Marcelo Tosatti wrote:
> Rather than syncing VM-stats on every return to userspace
> (or VM-entry), keep track of changes through a per-CPU bool.
> 
> This improves performance when enabling task isolated
> for vcpu VMs.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
> v11:
> - Add TIF_TASK_ISOL bit to thread info flags and use it
>   to decide whether to perform task isolation work on
>   return to userspace     
> 
>  include/linux/vmstat.h |   13 ++++++++++++-
>  mm/vmstat.c            |   29 ++++++++++++++++++++++++++++-
>  2 files changed, 40 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h
> +++ linux-2.6/include/linux/vmstat.h
> @@ -22,7 +22,18 @@ int sysctl_vm_numa_stat_handler(struct c
>  #endif
>  
>  #if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
> -void sync_vmstat(void);
> +DECLARE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +
> +extern struct static_key vmstat_sync_enabled;
> +
> +void __sync_vmstat(void);
> +static inline void sync_vmstat(void)
> +{
> +	if (static_key_false(&vmstat_sync_enabled))
> +		__sync_vmstat();
> +}
> +
> +void init_sync_vmstat(void);
>  #else
>  static inline void sync_vmstat(void)
>  {
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c
> +++ linux-2.6/mm/vmstat.c
> @@ -334,6 +334,31 @@ void set_pgdat_percpu_threshold(pg_data_
>  	}
>  }
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +struct static_key vmstat_sync_enabled;
> +DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +
> +static inline void mark_vmstat_dirty(void)
> +{
> +	if (!static_key_false(&vmstat_sync_enabled))
> +		return;
> +
> +	raw_cpu_write(vmstat_dirty, true);
> +	set_thread_flag(TIF_TASK_ISOL);

This will force the return-to-user slow path on all CPUs,
whether the current task is isolated or not, right?

Also if we context switch to an isolated TASK B with a dirty vmstat,
we miss the TIF_TASK_ISOL.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info
  2022-02-04 17:35 ` [patch v11 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
@ 2022-02-07 15:36   ` Frederic Weisbecker
  0 siblings, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2022-02-07 15:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Fri, Feb 04, 2022 at 02:35:46PM -0300, Marcelo Tosatti wrote:
> If a thread has task isolation activated, is preempted by thread B,
> which marks vmstat information dirty, and is preempted back in,
> one might return to userspace with vmstat dirty information on the 
> CPU in question.
> 
> To address this problem, add a preempt notifier that transfers vmstat dirty
> information to TIF_TASK_ISOL thread flag.

Ah ok that answers my previous question :)

> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c
> +++ linux-2.6/mm/vmstat.c
> @@ -353,6 +353,13 @@ void init_sync_vmstat(void)
>  	set_thread_flag(TIF_TASK_ISOL);
>  }
>  EXPORT_SYMBOL_GPL(vmstat_dirty);
> +
> +void vmstat_dirty_to_thread_flag(void)
> +{
> +	if (raw_cpu_read(vmstat_dirty) == true)

__this_cpu_read()?

> +		set_thread_flag(TIF_TASK_ISOL);
> +}
> +EXPORT_SYMBOL_GPL(vmstat_dirty_to_thread_flag);
>  #else
>  static inline void mark_vmstat_dirty(void)
>  {
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled
  2022-02-04 17:35 ` [patch v11 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
@ 2022-02-07 15:38   ` Frederic Weisbecker
  0 siblings, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2022-02-07 15:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Fri, Feb 04, 2022 at 02:35:50PM -0300, Marcelo Tosatti wrote:
> This avoids processing of TIF_TASK_ISOL, when returning to userspace,
> for tasks which do not have task isolation configured.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Ok that answers another of my questions :-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 10/13] KVM: x86: process isolation work from VM-entry code path
  2022-02-04 17:35 ` [patch v11 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
@ 2022-02-07 15:47   ` Frederic Weisbecker
  2022-03-10 16:35     ` Marcelo Tosatti
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2022-02-07 15:47 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Fri, Feb 04, 2022 at 02:35:47PM -0300, Marcelo Tosatti wrote:
> VM-entry code path is an entry point similar to userspace return
> when task isolation is concerned.
> 
> Call isolation_exit_to_user_mode before VM-enter.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
> v11
> - Add TIF_TASK_ISOL bit to thread info flags and use it
>   to decide whether to perform task isolation work on
>   return to userspace                                   (Frederic W. Weisbecker)
> 
>  include/linux/entry-kvm.h |    4 +++-
>  kernel/entry/kvm.c        |   18 ++++++++++++++----
>  2 files changed, 17 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/kernel/entry/kvm.c
> ===================================================================
> --- linux-2.6.orig/kernel/entry/kvm.c
> +++ linux-2.6/kernel/entry/kvm.c
> @@ -2,6 +2,7 @@
>  
>  #include <linux/entry-kvm.h>
>  #include <linux/kvm_host.h>
> +#include <linux/task_isolation.h>
>  
>  static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
>  {
> @@ -22,6 +23,9 @@ static int xfer_to_guest_mode_work(struc
>  		if (ti_work & _TIF_NOTIFY_RESUME)
>  			tracehook_notify_resume(NULL);
>  
> +		if (ti_work & _TIF_TASK_ISOL)
> +			task_isol_exit_to_user_mode();
> +
>  		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
>  		if (ret)
>  			return ret;
> 
> 

Do you need this?

diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 07c878d6e323..3588d6071caf 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@
 
 #define XFER_TO_GUEST_MODE_WORK						\
 	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
-	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+	 _TIF_NOTIFY_RESUME | _TIF_TASK_ISOL | ARCH_XFER_TO_GUEST_MODE_WORK)
 
 struct kvm_vcpu;
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (12 preceding siblings ...)
  2022-02-04 17:35 ` [patch v11 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
@ 2022-02-19  8:02 ` Oscar Shiang
  2022-02-23 17:31   ` Marcelo Tosatti
  2022-03-08  7:20 ` Oscar Shiang
  14 siblings, 1 reply; 30+ messages in thread
From: Oscar Shiang @ 2022-02-19  8:02 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira

Hi Marcelo,

I tried to apply your patches to kernel v5.15.18-rt28 and measured
the latencies through oslat [1].

It turns out that the peak latency (around 100us) can drop to about 90us.
The result is impressive since I only changed the guest's kernel
instead of installing the patched kernel to both host and guest.

However, I am still curious about:
1) Why did I catch a bigger maximum latency in almost each of the
   results of applying task isolation patches? Or does it come from
   other reasons?
2) Why did we only get a 10us improvement on quiescing vmstat?

[1]: The result and the test scripts I used can be found at
https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat

Thanks,
Oscar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync
  2022-02-19  8:02 ` [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Oscar Shiang
@ 2022-02-23 17:31   ` Marcelo Tosatti
  2022-03-08  6:32     ` Oscar Shiang
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2022-02-23 17:31 UTC (permalink / raw)
  To: Oscar Shiang
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira

Hi Oscar,

On Sat, Feb 19, 2022 at 04:02:10PM +0800, Oscar Shiang wrote:
> Hi Marcelo,
> 
> I tried to apply your patches to kernel v5.15.18-rt28 and measured
> the latencies through oslat [1].
> 
> It turns out that the peak latency (around 100us) can drop to about 90us.
> The result is impressive since I only changed the guest's kernel
> instead of installing the patched kernel to both host and guest.
> 
> However, I am still curious about:
> 1) Why did I catch a bigger maximum latency in almost each of the
>    results of applying task isolation patches? Or does it come from
>    other reasons?

There are a number of things that need to be done in order to have an 
"well enough" isolated CPU so you can measure latency reliably:

* Boot a kernel with isolated CPU (or better, use realtime-virtual-host profile of
https://github.com/redhat-performance/tuned.git, which does a bunch of
other things to avoid interruptions to isolated CPUs).
* Apply the userspace patches at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
to util-linux and rt-tests.

Run oslat with chisol:

chisol -q vmstat_sync -I conf oslat -c ...

Where chisol is from patched util-linux and oslat from patched rt-tests.

If you had "-f 1" (FIFO priority), on oslat, then the vmstat work would be hung.

Are you doing those things?

> 2) Why did we only get a 10us improvement on quiescing vmstat?

If you did not have FIFO priority on oslat, then other daemons 
could be interrupting it, so better make sure the 10us improvement 
you see is due to vmstat_flush workqueue work not executing anymore.

The testcase i use is: 

Stock kernel:

terminal 1: 
# oslat -f 1 -c X ...

terminal 2:
# echo 1 > /proc/sys/vm/stat_refresh
(hang)

Patched kernel:

terminal 1: 
# chisol -q vmstat_sync -I conf oslat -f 1 -c X ...

terminal 2:
# echo 1 > /proc/sys/vm/stat_refresh
# 

> [1]: The result and the test scripts I used can be found at
> https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat

OK, you seem to be doing everything necessary for chisol 
to work. Does /proc/pid/task_isolation of the oslat worker thread
(note its not the same pid as the main oslat thread) show "vmstat"
configured and activated for quiesce?

However 100us is really high. You should be able to get < 10us with
realtime-virtual-host (i see 4us on an idle system).

The answer might be: because 10us is what it takes to execute
vmstat_worker on the isolated CPU (you can verify with tracepoints).

That time depends on the number of per-CPU vmstat variables that need flushing, 
i suppose...



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync
  2022-02-23 17:31   ` Marcelo Tosatti
@ 2022-03-08  6:32     ` Oscar Shiang
  2022-03-08 13:12       ` Marcelo Tosatti
  0 siblings, 1 reply; 30+ messages in thread
From: Oscar Shiang @ 2022-03-08  6:32 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira

On Feb 24, 2022, at 1:31 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> Hi Oscar,
> 
> On Sat, Feb 19, 2022 at 04:02:10PM +0800, Oscar Shiang wrote:
> > Hi Marcelo,
> > 
> > I tried to apply your patches to kernel v5.15.18-rt28 and measured
> > the latencies through oslat [1].
> > 
> > It turns out that the peak latency (around 100us) can drop to about 90us.
> > The result is impressive since I only changed the guest's kernel
> > instead of installing the patched kernel to both host and guest.
> > 
> > However, I am still curious about:
> > 1) Why did I catch a bigger maximum latency in almost each of the
> >   results of applying task isolation patches? Or does it come from
> >   other reasons?
> 
> There are a number of things that need to be done in order to have an 
> "well enough" isolated CPU so you can measure latency reliably:
> 
> * Boot a kernel with isolated CPU (or better, use realtime-virtual-host profile of
> https://github.com/redhat-performance/tuned.git, which does a bunch of
> other things to avoid interruptions to isolated CPUs).
> * Apply the userspace patches at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> to util-linux and rt-tests.
> 
> Run oslat with chisol:
> 
> chisol -q vmstat_sync -I conf oslat -c ...
> 
> Where chisol is from patched util-linux and oslat from patched rt-tests.
> 
> If you had "-f 1" (FIFO priority), on oslat, then the vmstat work would be hung.
> 
> Are you doing those things?
> 
> > 2) Why did we only get a 10us improvement on quiescing vmstat?
> 
> If you did not have FIFO priority on oslat, then other daemons 
> could be interrupting it, so better make sure the 10us improvement 
> you see is due to vmstat_flush workqueue work not executing anymore.
> 
> The testcase i use is: 
> 
> Stock kernel:
> 
> terminal 1: 
> # oslat -f 1 -c X ...
> 
> terminal 2:
> # echo 1 > /proc/sys/vm/stat_refresh
> (hang)
> 
> Patched kernel:
> 
> terminal 1: 
> # chisol -q vmstat_sync -I conf oslat -f 1 -c X ...
> 
> terminal 2:
> # echo 1 > /proc/sys/vm/stat_refresh
> # 

Sure, I did see the terminal hung during oslat with FIFO priority.

BTW, thanks for providing this test case. I used to run all workload stuff to just
verify the improvement of task isolation. It is a more straightr- forward way to do.

> > [1]: The result and the test scripts I used can be found at
> > https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat
> 
> OK, you seem to be doing everything necessary for chisol 
> to work. Does /proc/pid/task_isolation of the oslat worker thread
> (note its not the same pid as the main oslat thread) show "vmstat"
> configured and activated for quiesce?

The status of task_isolation seems to be set properly with "vmstat" and activated

> However 100us is really high. You should be able to get < 10us with
> realtime-virtual-host (i see 4us on an idle system).
> 
> The answer might be: because 10us is what it takes to execute
> vmstat_worker on the isolated CPU (you can verify with tracepoints).
> 
> That time depends on the number of per-CPU vmstat variables that need flushing, 
> i suppose...

Considering the interferences outside of the KVM, I have redone the measurements
directly on my x86_64 computer [1].

As result, most of the latencies are down to 60us (and below). There are still
some latencies larger than 80us, I am working on and trying to figure out the reason.

[1]: https://gist.github.com/OscarShiang/202eb691e649557fe3eaa5ec67a5aa82

Thanks,
Oscar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync
  2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (13 preceding siblings ...)
  2022-02-19  8:02 ` [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Oscar Shiang
@ 2022-03-08  7:20 ` Oscar Shiang
  14 siblings, 0 replies; 30+ messages in thread
From: Oscar Shiang @ 2022-03-08  7:20 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira

Hi Marcelo,

I also tried to enable task isolation on arm64 with the following changes.

Maybe you can consider these in next version :)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 6623c99f0984..c1257bca1763 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -72,6 +72,7 @@ int arch_dup_task_struct(struct task_struct *dst,
 #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
 #define TIF_SECCOMP		11	/* syscall secure computing */
 #define TIF_SYSCALL_EMU		12	/* syscall emulation active */
+#define TIF_TASK_ISOL		13      /* task isolation work pending */
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -85,6 +86,7 @@ int arch_dup_task_struct(struct task_struct *dst,
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
+#define _TIF_TASK_ISOL		(1 << TIF_TASK_ISOL)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index c1257bca1763..c136850d623c 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -103,7 +103,7 @@ int arch_dup_task_struct(struct task_struct *dst,
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
 				 _TIF_UPROBE | _TIF_MTE_ASYNC_FAULT | \
-				 _TIF_NOTIFY_SIGNAL)
+				 _TIF_NOTIFY_SIGNAL | _TIF_TASK_ISOL)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index c287b9407f28..8308f6dc5d4b 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/task_isolation.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -945,6 +946,9 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
 
 			if (thread_flags & _TIF_FOREIGN_FPSTATE)
 				fpsimd_restore_current_state();
+
+			if (thread_flags & _TIF_TASK_ISOL)
+				task_isol_exit_to_user_mode();
 		}
 
 		local_daif_mask();

Thanks,
Oscar

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync
  2022-03-08  6:32     ` Oscar Shiang
@ 2022-03-08 13:12       ` Marcelo Tosatti
  2022-03-09 15:31         ` Oscar Shiang
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2022-03-08 13:12 UTC (permalink / raw)
  To: Oscar Shiang
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira

On Tue, Mar 08, 2022 at 02:32:46PM +0800, Oscar Shiang wrote:
> On Feb 24, 2022, at 1:31 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > Hi Oscar,
> > 
> > On Sat, Feb 19, 2022 at 04:02:10PM +0800, Oscar Shiang wrote:
> > > Hi Marcelo,
> > > 
> > > I tried to apply your patches to kernel v5.15.18-rt28 and measured
> > > the latencies through oslat [1].
> > > 
> > > It turns out that the peak latency (around 100us) can drop to about 90us.
> > > The result is impressive since I only changed the guest's kernel
> > > instead of installing the patched kernel to both host and guest.
> > > 
> > > However, I am still curious about:
> > > 1) Why did I catch a bigger maximum latency in almost each of the
> > >   results of applying task isolation patches? Or does it come from
> > >   other reasons?
> > 
> > There are a number of things that need to be done in order to have an 
> > "well enough" isolated CPU so you can measure latency reliably:
> > 
> > * Boot a kernel with isolated CPU (or better, use realtime-virtual-host profile of
> > https://github.com/redhat-performance/tuned.git, which does a bunch of
> > other things to avoid interruptions to isolated CPUs).
> > * Apply the userspace patches at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> > to util-linux and rt-tests.
> > 
> > Run oslat with chisol:
> > 
> > chisol -q vmstat_sync -I conf oslat -c ...
> > 
> > Where chisol is from patched util-linux and oslat from patched rt-tests.
> > 
> > If you had "-f 1" (FIFO priority), on oslat, then the vmstat work would be hung.
> > 
> > Are you doing those things?
> > 
> > > 2) Why did we only get a 10us improvement on quiescing vmstat?
> > 
> > If you did not have FIFO priority on oslat, then other daemons 
> > could be interrupting it, so better make sure the 10us improvement 
> > you see is due to vmstat_flush workqueue work not executing anymore.
> > 
> > The testcase i use is: 
> > 
> > Stock kernel:
> > 
> > terminal 1: 
> > # oslat -f 1 -c X ...
> > 
> > terminal 2:
> > # echo 1 > /proc/sys/vm/stat_refresh
> > (hang)
> > 
> > Patched kernel:
> > 
> > terminal 1: 
> > # chisol -q vmstat_sync -I conf oslat -f 1 -c X ...
> > 
> > terminal 2:
> > # echo 1 > /proc/sys/vm/stat_refresh
> > # 
> 
> Sure, I did see the terminal hung during oslat with FIFO priority.
> 
> BTW, thanks for providing this test case. I used to run all workload stuff to just
> verify the improvement of task isolation. It is a more straightr- forward way to do.
> 
> > > [1]: The result and the test scripts I used can be found at
> > > https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat
> > 
> > OK, you seem to be doing everything necessary for chisol 
> > to work. Does /proc/pid/task_isolation of the oslat worker thread
> > (note its not the same pid as the main oslat thread) show "vmstat"
> > configured and activated for quiesce?
> 
> The status of task_isolation seems to be set properly with "vmstat" and activated
> 
> > However 100us is really high. You should be able to get < 10us with
> > realtime-virtual-host (i see 4us on an idle system).
> > 
> > The answer might be: because 10us is what it takes to execute
> > vmstat_worker on the isolated CPU (you can verify with tracepoints).
> > 
> > That time depends on the number of per-CPU vmstat variables that need flushing, 
> > i suppose...
> 
> Considering the interferences outside of the KVM, I have redone the measurements
> directly on my x86_64 computer [1].
> 
> As result, most of the latencies are down to 60us (and below). There are still
> some latencies larger than 80us, I am working on and trying to figure out the reason.
> 
> [1]: https://gist.github.com/OscarShiang/202eb691e649557fe3eaa5ec67a5aa82

Oscar,

Did you confirm with hwlatdetect that the BIOS does not have long
running SMIs?

Also, for the software part, you could save time by using the
realtime-virtual-host profile (check /usr/lib/tuned/realtime-virtual-host/
to see what its doing in addition to isolcpus=).


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync
  2022-03-08 13:12       ` Marcelo Tosatti
@ 2022-03-09 15:31         ` Oscar Shiang
  0 siblings, 0 replies; 30+ messages in thread
From: Oscar Shiang @ 2022-03-09 15:31 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner,
	Daniel Bristot de Oliveira

On Mar 8, 2022, at 9:12 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Mar 08, 2022 at 02:32:46PM +0800, Oscar Shiang wrote:
> > On Feb 24, 2022, at 1:31 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > Hi Oscar,
> > > 
> > > On Sat, Feb 19, 2022 at 04:02:10PM +0800, Oscar Shiang wrote:
> > > > Hi Marcelo,
> > > > 
> > > > I tried to apply your patches to kernel v5.15.18-rt28 and measured
> > > > the latencies through oslat [1].
> > > > 
> > > > It turns out that the peak latency (around 100us) can drop to about 90us.
> > > > The result is impressive since I only changed the guest's kernel
> > > > instead of installing the patched kernel to both host and guest.
> > > > 
> > > > However, I am still curious about:
> > > > 1) Why did I catch a bigger maximum latency in almost each of the
> > > >   results of applying task isolation patches? Or does it come from
> > > >   other reasons?
> > > 
> > > There are a number of things that need to be done in order to have an 
> > > "well enough" isolated CPU so you can measure latency reliably:
> > > 
> > > * Boot a kernel with isolated CPU (or better, use realtime-virtual-host profile of
> > > https://github.com/redhat-performance/tuned.git, which does a bunch of
> > > other things to avoid interruptions to isolated CPUs).
> > > * Apply the userspace patches at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> > > to util-linux and rt-tests.
> > > 
> > > Run oslat with chisol:
> > > 
> > > chisol -q vmstat_sync -I conf oslat -c ...
> > > 
> > > Where chisol is from patched util-linux and oslat from patched rt-tests.
> > > 
> > > If you had "-f 1" (FIFO priority), on oslat, then the vmstat work would be hung.
> > > 
> > > Are you doing those things?
> > > 
> > > > 2) Why did we only get a 10us improvement on quiescing vmstat?
> > > 
> > > If you did not have FIFO priority on oslat, then other daemons 
> > > could be interrupting it, so better make sure the 10us improvement 
> > > you see is due to vmstat_flush workqueue work not executing anymore.
> > > 
> > > The testcase i use is: 
> > > 
> > > Stock kernel:
> > > 
> > > terminal 1: 
> > > # oslat -f 1 -c X ...
> > > 
> > > terminal 2:
> > > # echo 1 > /proc/sys/vm/stat_refresh
> > > (hang)
> > > 
> > > Patched kernel:
> > > 
> > > terminal 1: 
> > > # chisol -q vmstat_sync -I conf oslat -f 1 -c X ...
> > > 
> > > terminal 2:
> > > # echo 1 > /proc/sys/vm/stat_refresh
> > > # 
> > 
> > Sure, I did see the terminal hung during oslat with FIFO priority.
> > 
> > BTW, thanks for providing this test case. I used to run all workload stuff to just
> > verify the improvement of task isolation. It is a more straightr- forward way to do.
> > 
> > > > [1]: The result and the test scripts I used can be found at
> > > > https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat
> > > 
> > > OK, you seem to be doing everything necessary for chisol 
> > > to work. Does /proc/pid/task_isolation of the oslat worker thread
> > > (note its not the same pid as the main oslat thread) show "vmstat"
> > > configured and activated for quiesce?
> > 
> > The status of task_isolation seems to be set properly with "vmstat" and activated
> > 
> > > However 100us is really high. You should be able to get < 10us with
> > > realtime-virtual-host (i see 4us on an idle system).
> > > 
> > > The answer might be: because 10us is what it takes to execute
> > > vmstat_worker on the isolated CPU (you can verify with tracepoints).
> > > 
> > > That time depends on the number of per-CPU vmstat variables that need flushing, 
> > > i suppose...
> > 
> > Considering the interferences outside of the KVM, I have redone the measurements
> > directly on my x86_64 computer [1].
> > 
> > As result, most of the latencies are down to 60us (and below). There are still
> > some latencies larger than 80us, I am working on and trying to figure out the reason.
> > 
> > [1]: https://gist.github.com/OscarShiang/202eb691e649557fe3eaa5ec67a5aa82
> 
> Oscar,
> 
> Did you confirm with hwlatdetect that the BIOS does not have long
> running SMIs?

Marcelo,

I have run hwlatdetect and the result is shown below:

hwlatdetect:  test duration 900 seconds
   detector: tracer
   parameters:
        Latency threshold: 0us
        Sample window:     1000000us
        Sample width:      500000us
     Non-sampling period:  500000us
        Output File:       test.report

Starting test
test finished
Max Latency: 48us
Samples recorded: 37
Samples exceeding threshold: 37
SMIs during run: 0
report saved to test.report (37 samples)
ts: 1646837340.346151918, inner:5, outer:9
ts: 1646837351.550312752, inner:46, outer:45
ts: 1646837381.549331178, inner:45, outer:0
ts: 1646837400.008623200, inner:0, outer:9
ts: 1646837425.578093371, inner:0, outer:8
ts: 1646837429.587363003, inner:0, outer:1
ts: 1646837436.549050243, inner:45, outer:45
ts: 1646837580.005173999, inner:0, outer:9
ts: 1646837605.591017161, inner:7, outer:8
ts: 1646837635.552410329, inner:0, outer:1
ts: 1646837639.246489489, inner:9, outer:5
ts: 1646837645.426611917, inner:9, outer:5
ts: 1646837651.550721975, inner:0, outer:1
ts: 1646837728.549928137, inner:40, outer:47
ts: 1646837756.281606376, inner:1, outer:8
ts: 1646837757.492693661, inner:0, outer:1
ts: 1646837759.355807689, inner:13, outer:13
ts: 1646837761.590570928, inner:0, outer:1
ts: 1646837762.382475433, inner:5, outer:9
ts: 1646837764.185172836, inner:0, outer:9
ts: 1646837768.675668348, inner:1, outer:0
ts: 1646837776.485184319, inner:0, outer:1
ts: 1646837777.544517878, inner:45, outer:41
ts: 1646837855.549140154, inner:45, outer:0
ts: 1646837886.492523509, inner:1, outer:0
ts: 1646837897.544172247, inner:45, outer:0
ts: 1646837933.550015925, inner:48, outer:48
ts: 1646837981.546557947, inner:45, outer:45
ts: 1646837998.051615598, inner:0, outer:1
ts: 1646838030.550099263, inner:45, outer:0
ts: 1646838072.549664559, inner:0, outer:46
ts: 1646838106.571038324, inner:0, outer:1
ts: 1646838107.547228682, inner:1, outer:0
ts: 1646838114.549686904, inner:45, outer:45
ts: 1646838162.549477219, inner:46, outer:47
ts: 1646838180.014465679, inner:5, outer:9
ts: 1646838237.486873064, inner:0, outer:1

It seems that there is no SMI occurring (but some other latencies around 40us?)

> Also, for the software part, you could save time by using the
> realtime-virtual-host profile (check /usr/lib/tuned/realtime-virtual-host/
> to see what its doing in addition to isolcpus=).

Yes, I have switched to realtime-virtual-host profile with its kernel cmdline args.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 07/13] task isolation: sync vmstats conditional on changes
  2022-02-07 15:19   ` Frederic Weisbecker
@ 2022-03-10 16:30     ` Marcelo Tosatti
  0 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-03-10 16:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Mon, Feb 07, 2022 at 04:19:15PM +0100, Frederic Weisbecker wrote:
> On Fri, Feb 04, 2022 at 02:35:44PM -0300, Marcelo Tosatti wrote:
> > Rather than syncing VM-stats on every return to userspace
> > (or VM-entry), keep track of changes through a per-CPU bool.
> > 
> > This improves performance when enabling task isolated
> > for vcpu VMs.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > ---
> > v11:
> > - Add TIF_TASK_ISOL bit to thread info flags and use it
> >   to decide whether to perform task isolation work on
> >   return to userspace     
> > 
> >  include/linux/vmstat.h |   13 ++++++++++++-
> >  mm/vmstat.c            |   29 ++++++++++++++++++++++++++++-
> >  2 files changed, 40 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6/include/linux/vmstat.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/vmstat.h
> > +++ linux-2.6/include/linux/vmstat.h
> > @@ -22,7 +22,18 @@ int sysctl_vm_numa_stat_handler(struct c
> >  #endif
> >  
> >  #if defined(CONFIG_SMP) && defined(CONFIG_TASK_ISOLATION)
> > -void sync_vmstat(void);
> > +DECLARE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> > +
> > +extern struct static_key vmstat_sync_enabled;
> > +
> > +void __sync_vmstat(void);
> > +static inline void sync_vmstat(void)
> > +{
> > +	if (static_key_false(&vmstat_sync_enabled))
> > +		__sync_vmstat();
> > +}
> > +
> > +void init_sync_vmstat(void);
> >  #else
> >  static inline void sync_vmstat(void)
> >  {
> > Index: linux-2.6/mm/vmstat.c
> > ===================================================================
> > --- linux-2.6.orig/mm/vmstat.c
> > +++ linux-2.6/mm/vmstat.c
> > @@ -334,6 +334,31 @@ void set_pgdat_percpu_threshold(pg_data_
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_TASK_ISOLATION
> > +struct static_key vmstat_sync_enabled;
> > +DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> > +
> > +static inline void mark_vmstat_dirty(void)
> > +{
> > +	if (!static_key_false(&vmstat_sync_enabled))
> > +		return;
> > +
> > +	raw_cpu_write(vmstat_dirty, true);
> 
> Why not __this_cpu_write() ? Shouldn't we make sure we are not
> preemptible and not mark the wrong CPU?

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
/*
 * If we have cmpxchg_local support then we do not need to incur the overhead
 * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
 *
 * mod_state() modifies the zone counter state through atomic per cpu
 * operations.
 *
 * Overstep mode specifies how overstep should handled:
 *     0       No overstepping
 *     1       Overstepping half of threshold
 *     -1      Overstepping minus half of threshold
*/
static inline void mod_zone_state(struct zone *zone,
       enum zone_stat_item item, long delta, int overstep_mode)
{
        struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
        s8 __percpu *p = pcp->vm_stat_diff + item;
        long o, n, t, z;

        do {
                z = 0;  /* overflow to zone counters */

Perhaps one can

                        n = -os;
                }
        } while (this_cpu_cmpxchg(*p, o, n) != o);
							<-- migrate
        if (z)
                zone_page_state_add(z, zone, item);
        mark_vmstat_dirty();


But we assume the task which is using task isolation is pinned to a single CPU,
so that should not happen.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v11 10/13] KVM: x86: process isolation work from VM-entry code path
  2022-02-07 15:47   ` Frederic Weisbecker
@ 2022-03-10 16:35     ` Marcelo Tosatti
  0 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2022-03-10 16:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Oscar Shiang

On Mon, Feb 07, 2022 at 04:47:07PM +0100, Frederic Weisbecker wrote:
> On Fri, Feb 04, 2022 at 02:35:47PM -0300, Marcelo Tosatti wrote:
> > VM-entry code path is an entry point similar to userspace return
> > when task isolation is concerned.
> > 
> > Call isolation_exit_to_user_mode before VM-enter.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > ---
> > v11
> > - Add TIF_TASK_ISOL bit to thread info flags and use it
> >   to decide whether to perform task isolation work on
> >   return to userspace                                   (Frederic W. Weisbecker)
> > 
> >  include/linux/entry-kvm.h |    4 +++-
> >  kernel/entry/kvm.c        |   18 ++++++++++++++----
> >  2 files changed, 17 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6/kernel/entry/kvm.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/entry/kvm.c
> > +++ linux-2.6/kernel/entry/kvm.c
> > @@ -2,6 +2,7 @@
> >  
> >  #include <linux/entry-kvm.h>
> >  #include <linux/kvm_host.h>
> > +#include <linux/task_isolation.h>
> >  
> >  static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
> >  {
> > @@ -22,6 +23,9 @@ static int xfer_to_guest_mode_work(struc
> >  		if (ti_work & _TIF_NOTIFY_RESUME)
> >  			tracehook_notify_resume(NULL);
> >  
> > +		if (ti_work & _TIF_TASK_ISOL)
> > +			task_isol_exit_to_user_mode();
> > +
> >  		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
> >  		if (ret)
> >  			return ret;
> > 
> > 
> 
> Do you need this?

Yes, dropped it somehow, thanks.

> 
> diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
> index 07c878d6e323..3588d6071caf 100644
> --- a/include/linux/entry-kvm.h
> +++ b/include/linux/entry-kvm.h
> @@ -18,7 +18,7 @@
>  
>  #define XFER_TO_GUEST_MODE_WORK						\
>  	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
> -	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> +	 _TIF_NOTIFY_RESUME | _TIF_TASK_ISOL | ARCH_XFER_TO_GUEST_MODE_WORK)
>  
>  struct kvm_vcpu;
>  
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-03-10 17:00 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04 17:35 [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
2022-02-07 11:40   ` Sven Schnelle
2022-02-04 17:35 ` [patch v11 02/13] x86: " Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 03/13] add basic task isolation prctl interface Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2022-02-07 14:57   ` Frederic Weisbecker
2022-02-07 15:16     ` Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
2022-02-07 15:19   ` Frederic Weisbecker
2022-03-10 16:30     ` Marcelo Tosatti
2022-02-07 15:29   ` Frederic Weisbecker
2022-02-04 17:35 ` [patch v11 08/13] task isolation: enable return to userspace processing Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
2022-02-07 15:36   ` Frederic Weisbecker
2022-02-04 17:35 ` [patch v11 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
2022-02-07 15:47   ` Frederic Weisbecker
2022-03-10 16:35     ` Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 11/13] mm: vmstat: move need_update Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
2022-02-04 17:35 ` [patch v11 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
2022-02-07 15:38   ` Frederic Weisbecker
2022-02-19  8:02 ` [patch v11 00/13] extensible prctl task isolation interface and vmstat sync Oscar Shiang
2022-02-23 17:31   ` Marcelo Tosatti
2022-03-08  6:32     ` Oscar Shiang
2022-03-08 13:12       ` Marcelo Tosatti
2022-03-09 15:31         ` Oscar Shiang
2022-03-08  7:20 ` Oscar Shiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.