All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch v7 00/10] extensible prctl task isolation interface and vmstat sync
@ 2021-11-12 12:35 Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 01/10] add basic task isolation prctl interface Marcelo Tosatti
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.

The patchset is based on ideas and code from the
task isolation patchset from Alex Belits:
https://lwn.net/Articles/816298/

Please refer to Documentation/userspace-api/task_isolation.rst
(patch 1) for details. Its attached at the end of this message
in .txt format as well.

Note: the prctl interface is independent of nohz_full=.

The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/

- qemu-task-isolation.patch: activate task isolation from CPU execution loop
- rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
- util-linux-chisol.patch: add chisol tool to util-linux.

---

v7
 - no changes (resending series without changelog corruption).

v6
 - Move oneshot mode enablement to configuration time (Frederic Weisbecker).
 - Allow more extensions to CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
 - Update docs and samples regarding oneshot mode     (Frederic Weisbecker).
 - Update docs and samples regarding more extensibility of
   CFG_SET of ISOL_F_QUIESCE                          (Frederic Weisbecker).
 - prctl_task_isolation_activate_get should copy active_mask
   to address in arg2.
 - modify exit_to_user_mode_loop to cover exceptions
   and interrupts.
 - split exit hooks into its own patch

v5
  - Add changelogs to individual patches                (Peter Zijlstra).
  - Add documentation to patchset intro                 (Peter Zijlstra).

v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and use drop
   "internal configuration" prctls            (Frederic Weisbecker).
 - Add summary of terms to documentation      (Frederic Weisbecker).
 - Examples for compute and one-shot modes    (Thomas G/Christoph L).

v3:

 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
 - Remove "system default" notion (chisol should
   make it obsolete).
 - Update documentation: add new section with explanation
   about configuration/activation and code example.
 - Update samples.
 - Report configuration/activation state at
   /proc/pid/task_isolation.
 - Condense dirty information of per-CPU vmstats counters
   in a bool.
 - In-kernel KVM support.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).

- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).



v5 can be found at:
https://lore.kernel.org/lkml/20211019152431.885037499@fedora.localdomain/

v4 can be found at:
https://lore.kernel.org/all/20211007192346.731667417@fedora.localdomain/

v3 can be found at:
https://lore.kernel.org/lkml/20210824152423.300346181@fuller.cnet/

v2 can be found at:
https://lore.kernel.org/patchwork/project/lkml/list/?series=510225

---

 Documentation/userspace-api/task_isolation.rst |  370 +++++++++++++++++++++++
 fs/proc/base.c                                 |   68 ++++
 include/linux/entry-kvm.h                      |    4 
 include/linux/sched.h                          |    5 
 include/linux/task_isolation.h                 |  138 ++++++++
 include/linux/vmstat.h                         |   19 +
 include/uapi/linux/prctl.h                     |   43 ++
 init/init_task.c                               |    3 
 kernel/Makefile                                |    2 
 kernel/entry/common.c                          |   15 
 kernel/entry/kvm.c                             |   18 -
 kernel/exit.c                                  |    2 
 kernel/fork.c                                  |   23 +
 kernel/sys.c                                   |   16 +
 kernel/task_isolation.c                        |  388 +++++++++++++++++++++++++
 mm/vmstat.c                                    |  155 +++++++--
 samples/Kconfig                                |    7 
 samples/Makefile                               |    1 
 samples/task_isolation/Makefile                |   11 
 samples/task_isolation/task_isol.c             |   92 +++++
 samples/task_isolation/task_isol.h             |    9 
 samples/task_isolation/task_isol_computation.c |   89 +++++
 samples/task_isolation/task_isol_oneshot.c     |  104 ++++++
 samples/task_isolation/task_isol_userloop.c    |   54 +++
 24 files changed, 1591 insertions(+), 45 deletions(-)

---


Task isolation prctl interface
******************************

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth
networking applications with user-space drivers can fall into the
category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and end of
the latency sensitive application section (with configurable system
behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:

   * PR_ISOL_FEAT_GET: Retrieve supported features.

   * PR_ISOL_CFG_GET: Retrieve task isolation configuration.

   * PR_ISOL_CFG_SET: Set task isolation configuration.

   * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.

   * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.

Summary of terms:

* feature:

     A distinct attribute or aspect of task isolation. Examples of
     features could be logging, new operating modes (eg: syscalls
     disallowed), userspace notifications, etc. The only feature
     currently available is quiescing.

* configuration:

     A specific choice from a given set of possible choices that
     dictate how the particular feature in question should behave.

* activation state:

     The activation state (whether active/inactive) of the task
     isolation features (features must be configured before being
     activated).

Inheritance of the isolation parameters and state, across fork(2) and
clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.

At a high-level, task isolation is divided in two steps:

1. Configuration.

2. Activation.

Section "Userspace support" describes how to use task isolation.

In terms of the interface, the sequence of steps to activate task
isolation are:

1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).

2. Configure task isolation features
   (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).

3. Activate or deactivate task isolation features
   (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).

This interface is based on ideas and code from the task isolation
patchset from Alex Belits: https://lwn.net/Articles/816298/


Feature description
===================

   * "ISOL_F_QUIESCE"

   This feature allows quiescing selected kernel activities on return
   from system calls.


Interface description
=====================

**PR_ISOL_FEAT**:

   Returns the supported features and feature capabilities, as a
   bitmask:

      prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);

   The 'feat' argument specifies whether to return supported features
   (if zero), or feature capabilities (if not zero). Possible values
   for 'feat' are:

   * "0":

        Return the bitmask of supported features, in the location
        pointed  to  by  "(int *)arg3". The buffer should allow space
        for 8 bytes.

   * "ISOL_F_QUIESCE":

        Return a structure containing which kernel activities are
        supported for quiescing, in the location pointed to by "(int
        *)arg3":

           struct task_isol_quiesce_extensions {
                   __u64 flags;
                   __u64 supported_quiesce_bits;
                   __u64 pad[6];
           };

        Where:

        *flags*: Additional flags (should be zero).

        *supported_quiesce_bits*: Bitmask indicating
           which features are supported for quiescing.

        *pad*: Additional space for future enhancements.

   Features and its capabilities are defined at
   include/uapi/linux/task_isolation.h.

**PR_ISOL_CFG_GET**:

   Retrieve task isolation configuration. The general format is:

      prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);

   The 'what' argument specifies what to configure. Possible values
   are:

   * "I_CFG_FEAT":

        Return configuration of task isolation features. The 'arg3'
        argument specifies whether to return configured features (if
        zero), or individual feature configuration (if not zero), as
        follows.

        * "0":

             Return the bitmask of configured features, in the
             location pointed  to  by  "(int *)arg4". The buffer
             should allow space for 8 bytes.

        * "ISOL_F_QUIESCE":

             If arg4 is QUIESCE_CONTROL, return the control structure
             for quiescing of background kernel activities, in the
             location pointed to by "(int *)arg5":

                struct task_isol_quiesce_control {
                       __u64 flags;
                       __u64 quiesce_mask;
                       __u64 quiesce_oneshot_mask;
                       __u64 pad[5];
                };

             See PR_ISOL_CFG_GET description for meaning of fields.

   * "I_CFG_INHERIT":

        Retrieve inheritance configuration across fork/clone.

        Return the structure which configures inheritance across
        fork/clone, in the location pointed to by "(int *)arg4":

           struct task_isol_inherit_control {
                   __u8    inherit_mask;
                   __u8    pad[7];
           };

        See PR_ISOL_CFG_SET description for meaning of fields.

**PR_ISOL_CFG_SET**:

   Set task isolation configuration. The general format is:

      prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);

   The 'what' argument specifies what to configure. Possible values
   are:

   * "I_CFG_FEAT":

        Set configuration of task isolation features. 'arg3' specifies
        the feature. Possible values are:

        * "ISOL_F_QUIESCE":

             If arg4 is QUIESCE_CONTROL, set the control structure for
             quiescing of background kernel activities, from the
             location pointed to by "(int *)arg5":

                struct task_isol_quiesce_control {
                       __u64 flags;
                       __u64 quiesce_mask;
                       __u64 quiesce_oneshot_mask;
                       __u64 pad[5];
                };

             Where:

             *flags*: Additional flags (should be zero).

             *quiesce_mask*: A bitmask containing which kernel
             activities to quiesce.

             *quiesce_oneshot_mask*: A bitmask indicating which kernel
             activities should behave in oneshot mode, that is,
             quiescing will happen on return from
             prctl(PR_ISOL_ACTIVATE_SET), but not on return of
             subsequent system calls. The corresponding bit(s) must
             also be set at quiesce_mask.

             *pad*: Additional space for future enhancements.

             For quiesce_mask (and quiesce_oneshot_mask), possible bit
             sets are:

             * "ISOL_F_QUIESCE_VMSTATS"

             VM statistics are maintained in per-CPU counters to
             improve performance. When a CPU modifies a VM statistic,
             this modification is kept in the per-CPU counter. Certain
             activities require a global count, which involves
             requesting each CPU to flush its local counters to the
             global VM counters.

             This flush is implemented via a workqueue item, which
             might schedule a workqueue on isolated CPUs.

             To avoid this interruption, task isolation can be
             configured to, upon return from system calls, synchronize
             the per-CPU counters to global counters, thus avoiding
             the interruption.

   * "I_CFG_INHERIT":

        Set inheritance configuration when a new task is created via
        fork and clone.

        The "(int *)arg4" argument is a pointer to:

           struct task_isol_inherit_control {
                   __u8    inherit_mask;
                   __u8    pad[7];
           };

        inherit_mask is a bitmask that specifies which part of task
        isolation should be inherited:

        * Bit ISOL_INHERIT_CONF: Inherit task isolation
          configuration. This is the state written via
          prctl(PR_ISOL_CFG_SET, ...).

        * Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
          (requires ISOL_INHERIT_CONF to be set). The new task should
          behave, after fork/clone, in the same manner as the parent
          task after it executed:

             prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);

**PR_ISOL_ACTIVATE_GET**:

   Retrieve task isolation activation state.

   The general format is:

      prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);

   'pmask' specifies the location of a feature mask, where the current
   active mask will be copied. See PR_ISOL_ACTIVATE_SET for
   description of individual bits.

**PR_ISOL_ACTIVATE_SET**:

   Set task isolation activation state (activates/deactivates task
   isolation).

   The general format is:

      prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);

   The 'pmask' argument specifies the location of an 8 byte mask
   containing which features should be activated. Features whose bits
   are cleared will be deactivated. The possible bits for this mask
   are:

      * "ISOL_F_QUIESCE":

      Activate quiescing of background kernel activities. Quiescing
      happens on return to userspace from this system call, and on
      return from subsequent system calls (unless quiesce_oneshot_mask
      has been set at PR_ISOL_CFG_SET time).

   Quiescing can be adjusted (while active) by
   prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).


Userspace support
*****************

Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose.

This results in three combinations:

1. Both configuration and activation performed by the latency
sensitive application. Allows fine grained control of what task
isolation features are enabled and when (see samples section below).

2. Only activation can be performed by the latency sensitive app (and
configuration performed by chisol). This allows the admin/user to
control task isolation parameters, and applications have to be
modified only once.

3. Configuration and activation performed by an external tool. This
allows unmodified applications to take advantage of task isolation.
Activation is performed by the "-a" option of chisol.


Examples
********

The "samples/task_isolation/" directory contains 3 examples:

* task_isol_userloop.c:

     Example of program with a loop on userspace scenario.

* task_isol_computation.c:

     Example of program that enters task isolated mode, performs an
     amount of computation, exits task isolated mode, and writes the
     computation to disk.

* task_isol_oneshot.c:

     Example of program that enables one-shot mode for quiescing,
     enters a processing loop, then upon an external event performs a
     number of syscalls to handle that event.

This is a snippet of code to activate task isolation if it has been
previously configured (by chisol for example):

   #include <sys/prctl.h>
   #include <linux/types.h>

   #ifdef PR_ISOL_CFG_GET
   unsigned long long fmask;

   ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
   if (ret != -1 && fmask != 0) {
           ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
           if (ret == -1) {
                   perror("prctl PR_ISOL_ACTIVATE_SET");
                   return ret;
           }
   }
   #endif




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 01/10] add basic task isolation prctl interface
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 02/10] add prctl task isolation prctl docs and samples Marcelo Tosatti
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

Add basic prctl task isolation interface, which allows
informing the kernel that application is executing 
latency sensitive code (where interruptions are undesired).

Interface is described by task_isolation.rst (added by
next patch).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v6:
 - Move oneshot mode enablement to configuration time (Frederic Weisbecker).
 - Allow more extensions to CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
 - prctl_task_isolation_activate_get should copy active_mask 
   to address in arg2.
v5:
 - No changes
v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and drop
   "internal configuration" prctls            (Frederic Weisbecker).

v3:

 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
-  Remove "system default" notion (chisol should
   make it obsolete).

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).
- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

 include/linux/sched.h          |    5 
 include/linux/task_isolation.h |   91 ++++++++++
 include/uapi/linux/prctl.h     |   43 +++++
 init/init_task.c               |    3 
 kernel/Makefile                |    2 
 kernel/fork.c                  |   22 ++
 kernel/sys.c                   |   16 +
 kernel/task_isolation.c        |  350 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 530 insertions(+), 2 deletions(-)

Index: linux-2.6/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/prctl.h
+++ linux-2.6/include/uapi/linux/prctl.h
@@ -269,4 +269,47 @@ struct prctl_mm_map {
 # define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
 # define PR_SCHED_CORE_MAX		4
 
+#define PR_ISOL_FEAT_GET		63
+#define PR_ISOL_CFG_GET			64
+#define PR_ISOL_CFG_SET			65
+
+/* arg2 to CFG_GET/CFG_SET */
+# define I_CFG_FEAT			1
+# define I_CFG_INHERIT			2
+
+#define PR_ISOL_ACTIVATE_GET		66
+#define PR_ISOL_ACTIVATE_SET		67
+
+# define ISOL_F_QUIESCE			(1UL << 0)
+#  define ISOL_F_QUIESCE_VMSTATS	(1UL << 0)
+
+struct task_isol_quiesce_extensions {
+	__u64 flags;
+	__u64 supported_quiesce_bits;
+	__u64 pad[6];
+};
+
+/*
+ * This structure provides control over
+ * inheritance of task isolation across
+ * clone and fork.
+ */
+struct task_isol_inherit_control {
+	__u8	inherit_mask;
+	__u8	flags;
+	__u8	pad[6];
+};
+
+# define ISOL_INHERIT_CONF		(1UL << 0)
+# define ISOL_INHERIT_ACTIVE		(1UL << 1)
+
+struct task_isol_quiesce_control {
+	__u64 flags;
+	__u64 quiesce_mask;
+	__u64 quiesce_oneshot_mask;
+	__u64 pad[5];
+};
+
+# define QUIESCE_CONTROL		(1UL << 0)
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile
+++ linux-2.6/kernel/Makefile
@@ -132,6 +132,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_CPU_ISOLATION) += task_isolation.o
+
 CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
 obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -58,6 +58,7 @@
 #include <linux/sched/coredump.h>
 #include <linux/sched/task.h>
 #include <linux/sched/cputime.h>
+#include <linux/task_isolation.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
@@ -2530,6 +2531,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
 #endif
+	case PR_ISOL_FEAT_GET:
+		error = prctl_task_isolation_feat_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CFG_GET:
+		error = prctl_task_isolation_cfg_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CFG_SET:
+		error = prctl_task_isolation_cfg_set(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_ACTIVATE_GET:
+		error = prctl_task_isolation_activate_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_ACTIVATE_SET:
+		error = prctl_task_isolation_activate_set(arg2, arg3, arg4, arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct isol_info;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1488,6 +1489,10 @@ struct task_struct {
 	struct callback_head		l1d_flush_kill;
 #endif
 
+#ifdef CONFIG_CPU_ISOLATION
+	struct isol_info		*isol_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -214,6 +214,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_CPU_ISOLATION
+	.isol_info	= NULL,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/task_isolation.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -746,6 +747,7 @@ void __put_task_struct(struct task_struc
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	tsk_isol_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk, true);
@@ -1585,6 +1587,15 @@ static int copy_io(unsigned long clone_f
 	return 0;
 }
 
+static int copy_task_isolation(struct task_struct *tsk)
+{
+#ifdef CONFIG_CPU_ISOLATION
+	if (current->isol_info)
+		return __copy_task_isolation(tsk);
+#endif
+	return 0;
+}
+
 static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 {
 	struct sighand_struct *sig;
@@ -2159,7 +2170,9 @@ static __latent_entropy struct task_stru
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
-
+#ifdef CONFIG_CPU_ISOLATION
+	p->isol_info = NULL;
+#endif
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
@@ -2203,6 +2216,9 @@ static __latent_entropy struct task_stru
 	retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
+	retval = copy_task_isolation(p);
+	if (retval)
+		goto bad_fork_cleanup_thread;
 
 	stackleak_task_init(p);
 
@@ -2211,7 +2227,7 @@ static __latent_entropy struct task_stru
 				args->set_tid_size);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
-			goto bad_fork_cleanup_thread;
+			goto bad_fork_cleanup_task_isolation;
 		}
 	}
 
@@ -2429,6 +2445,8 @@ bad_fork_put_pidfd:
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
+bad_fork_cleanup_task_isolation:
+	tsk_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);
 bad_fork_cleanup_io:
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/task_isolation.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_TASK_ISOL_H
+#define __LINUX_TASK_ISOL_H
+
+#ifdef CONFIG_CPU_ISOLATION
+
+struct isol_info {
+	/* Which features have been configured */
+	u64 conf_mask;
+	/* Which features are active */
+	u64 active_mask;
+	/* Quiesce mask */
+	u64 quiesce_mask;
+
+	/* Oneshot mask */
+	u64 oneshot_mask;
+
+	u8 inherit_mask;
+};
+
+extern void __tsk_isol_free(struct task_struct *tsk);
+
+static inline void tsk_isol_free(struct task_struct *tsk)
+{
+	if (tsk->isol_info)
+		__tsk_isol_free(tsk);
+}
+
+int prctl_task_isolation_feat_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_cfg_get(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_cfg_set(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_activate_get(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_activate_set(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5);
+
+int __copy_task_isolation(struct task_struct *tsk);
+
+#else
+
+static inline void tsk_isol_free(struct task_struct *tsk)
+{
+}
+
+static inline int prctl_task_isolation_feat_get(unsigned long arg2,
+						unsigned long arg3,
+						unsigned long arg4,
+						unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_cfg_get(unsigned long arg2,
+					       unsigned long arg3,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_cfg_set(unsigned long arg2,
+					       unsigned long arg3,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_activate_get(unsigned long arg2,
+						    unsigned long arg3,
+						    unsigned long arg4,
+						    unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_activate_set(unsigned long arg2,
+						    unsigned long arg3,
+						    unsigned long arg4,
+						    unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
+#endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/task_isolation.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@belits.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ *   Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/task_isolation.h>
+#include <linux/prctl.h>
+#include <linux/slab.h>
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <linux/sched/task.h>
+
+void __tsk_isol_free(struct task_struct *tsk)
+{
+	if (!tsk->isol_info)
+		return;
+	kfree(tsk->isol_info);
+	tsk->isol_info = NULL;
+}
+
+static struct isol_info *tsk_isol_alloc_context(void)
+{
+	struct isol_info *info;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (unlikely(!info))
+		return ERR_PTR(-ENOMEM);
+
+	return info;
+}
+
+int prctl_task_isolation_feat_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	void __user *addr = (void __user *) arg3;
+
+	switch (arg2) {
+	case 0: {
+		u64 supported_fmask = ISOL_F_QUIESCE;
+
+		ret = 0;
+		if (copy_to_user(addr, &supported_fmask, sizeof(u64)))
+			ret = -EFAULT;
+
+		return ret;
+	}
+	case ISOL_F_QUIESCE: {
+		struct task_isol_quiesce_extensions *q_ext;
+
+		q_ext = kzalloc(sizeof(struct task_isol_quiesce_extensions),
+			 GFP_KERNEL);
+		if (!q_ext)
+			return -ENOMEM;
+
+		q_ext->supported_quiesce_bits = ISOL_F_QUIESCE_VMSTATS;
+
+		ret = 0;
+		if (copy_to_user(addr, q_ext, sizeof(*q_ext)))
+			ret = -EFAULT;
+		kfree(q_ext);
+		return ret;
+	}
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int cfg_inherit_get(unsigned long arg3, unsigned long arg4,
+			   unsigned long arg5)
+{
+	struct task_isol_inherit_control *i_ctrl;
+	int ret;
+	void __user *addr = (void __user *) arg3;
+
+	if (!current->isol_info)
+		return -EINVAL;
+
+	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+			 GFP_KERNEL);
+	if (!i_ctrl)
+		return -ENOMEM;
+
+	i_ctrl->inherit_mask = current->isol_info->inherit_mask;
+
+	ret = 0;
+	if (copy_to_user(addr, i_ctrl, sizeof(*i_ctrl)))
+		ret = -EFAULT;
+	kfree(i_ctrl);
+
+	return ret;
+}
+
+static int cfg_feat_get(unsigned long arg3, unsigned long arg4,
+			unsigned long arg5)
+{
+	int ret = 0;
+
+	switch (arg3) {
+	case 0: {
+		void __user *addr = (void __user *)arg4;
+		u64 cfg_mask = 0;
+
+		if (current->isol_info)
+			cfg_mask = current->isol_info->conf_mask;
+
+		if (copy_to_user(addr, &cfg_mask, sizeof(u64)))
+			ret = -EFAULT;
+
+		return ret;
+	}
+	case ISOL_F_QUIESCE: {
+		struct task_isol_quiesce_control *i_qctrl;
+		void __user *addr = (void __user *)arg5;
+
+		if (arg4 != QUIESCE_CONTROL)
+			return -EINVAL;
+
+		i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
+				  GFP_KERNEL);
+		if (!i_qctrl)
+			return -ENOMEM;
+
+		if (current->isol_info)
+			i_qctrl->quiesce_mask = current->isol_info->quiesce_mask;
+
+		if (copy_to_user(addr, i_qctrl, sizeof(*i_qctrl)))
+			ret = -EFAULT;
+
+		kfree(i_qctrl);
+		return ret;
+	}
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int prctl_task_isolation_cfg_get(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	switch (arg2) {
+	case I_CFG_FEAT:
+		return cfg_feat_get(arg3, arg4, arg5);
+	case I_CFG_INHERIT:
+		return cfg_inherit_get(arg3, arg4, arg5);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int cfg_inherit_set(unsigned long arg3, unsigned long arg4,
+			   unsigned long arg5)
+{
+	int ret = 0;
+	struct task_isol_inherit_control *i_ctrl;
+	const void __user *addr = (const void __user *)arg3;
+
+	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+			 GFP_KERNEL);
+	if (!i_ctrl)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(i_ctrl, addr, sizeof(*i_ctrl)))
+		goto out_free;
+
+	ret = -EINVAL;
+	if (i_ctrl->inherit_mask & ~(ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE))
+		goto out_free;
+
+	if (i_ctrl->inherit_mask & ISOL_INHERIT_ACTIVE)
+		if (!(i_ctrl->inherit_mask & ISOL_INHERIT_CONF))
+			goto out_free;
+
+	if (!current->isol_info) {
+		struct isol_info *isol_info;
+
+		isol_info = tsk_isol_alloc_context();
+		if (IS_ERR(isol_info)) {
+			ret = PTR_ERR(isol_info);
+			goto out_free;
+		}
+		current->isol_info = isol_info;
+	}
+
+	ret = 0;
+	current->isol_info->inherit_mask = i_ctrl->inherit_mask;
+
+out_free:
+	kfree(i_ctrl);
+
+	return ret;
+}
+
+static int cfg_feat_quiesce_set(unsigned long arg4, unsigned long arg5)
+{
+	struct isol_info *isol_info;
+	struct task_isol_quiesce_control *i_qctrl;
+	int ret = 0;
+	const void __user *addr = (const void __user *)arg5;
+
+	if (arg4 != QUIESCE_CONTROL)
+		return -EINVAL;
+
+	i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
+			 GFP_KERNEL);
+	if (!i_qctrl)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(i_qctrl, addr, sizeof(*i_qctrl)))
+		goto out_free;
+
+	ret = -EINVAL;
+	if (i_qctrl->flags != 0)
+		goto out_free;
+
+	if (i_qctrl->quiesce_mask != ISOL_F_QUIESCE_VMSTATS &&
+	    i_qctrl->quiesce_mask != 0)
+		goto out_free;
+
+	if ((~i_qctrl->quiesce_mask & i_qctrl->quiesce_oneshot_mask) != 0)
+		goto out_free;
+
+	/* current->isol_info is only allocated/freed from task
+	 * context.
+	 */
+	if (!current->isol_info) {
+		isol_info = tsk_isol_alloc_context();
+		if (IS_ERR(isol_info)) {
+			ret = PTR_ERR(isol_info);
+			goto out_free;
+		}
+		current->isol_info = isol_info;
+	}
+
+	isol_info = current->isol_info;
+
+	isol_info->quiesce_mask = i_qctrl->quiesce_mask;
+	isol_info->oneshot_mask = i_qctrl->quiesce_oneshot_mask;
+	isol_info->conf_mask |= ISOL_F_QUIESCE;
+	ret = 0;
+
+out_free:
+	kfree(i_qctrl);
+
+	return ret;
+}
+
+int prctl_task_isolation_cfg_set(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	switch (arg2) {
+	case I_CFG_FEAT:
+		switch (arg3) {
+		case ISOL_F_QUIESCE:
+			return cfg_feat_quiesce_set(arg4, arg5);
+		default:
+			break;
+		}
+		break;
+	case I_CFG_INHERIT:
+		return cfg_inherit_set(arg3, arg4, arg5);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int __copy_task_isolation(struct task_struct *tsk)
+{
+	struct isol_info *info, *new_info;
+
+	info = current->isol_info;
+	if (!(info->inherit_mask & (ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE)))
+		return 0;
+
+	new_info = tsk_isol_alloc_context();
+	if (IS_ERR(new_info))
+		return PTR_ERR(new_info);
+
+	new_info->inherit_mask = info->inherit_mask;
+
+	if (info->inherit_mask & ISOL_INHERIT_CONF) {
+		new_info->quiesce_mask = info->quiesce_mask;
+		new_info->conf_mask = info->conf_mask;
+	}
+
+	if (info->inherit_mask & ISOL_INHERIT_ACTIVE)
+		new_info->active_mask = info->active_mask;
+
+	tsk->isol_info = new_info;
+
+	return 0;
+}
+
+int prctl_task_isolation_activate_set(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	struct isol_info *isol_info;
+	u64 active_mask;
+	const void __user *addr_mask = (const void __user *)arg2;
+
+	ret = -EFAULT;
+	if (copy_from_user(&active_mask, addr_mask, sizeof(u64)))
+		goto out;
+
+	ret = -EINVAL;
+	if (active_mask != ISOL_F_QUIESCE && active_mask != 0)
+		return ret;
+
+	isol_info = current->isol_info;
+	if (!isol_info)
+		return ret;
+
+	isol_info->active_mask = active_mask;
+	ret = 0;
+
+out:
+	return ret;
+}
+
+int prctl_task_isolation_activate_get(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5)
+{
+	struct isol_info *isol_info;
+	void __user *addr_mask = (void __user *)arg2;
+
+	isol_info = current->isol_info;
+	if (!isol_info)
+		return -EINVAL;
+
+	if (copy_to_user(addr_mask, &isol_info->active_mask, sizeof(u64)))
+		return -EFAULT;
+
+	return 0;
+}



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 01/10] add basic task isolation prctl interface Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-23 12:36   ` Frederic Weisbecker
  2021-11-23 14:37   ` Frederic Weisbecker
  2021-11-12 12:35 ` [patch v7 03/10] task isolation: sync vmstats on return to userspace Marcelo Tosatti
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

Add documentation and userspace sample code for prctl
task isolation interface.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v6:
 - Update docs and samples regarding oneshot mode (Frederic Weisbecker).
 - Update docs and samples regarding more extensibility of
   CFG_SET of ISOL_F_QUIESCE                   (Frederic Weisbecker).

v5:
 - Fix documentation typos		      (Frederic Weisbecker).
 - Fix oneshot example comment

v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and use drop
   "internal configuration" prctls            (Frederic Weisbecker).
 - Add summary of terms to documentation      (Frederic Weisbecker).
 - Examples for compute and one-shot modes    (Thomas G/Christoph L).

v3:
 - Split in smaller patches              (Nitesh Lal).
 - Misc cleanups                         (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications
   can activate externally configured
   task isolation parameters.
-  Remove "system default" notion (chisol should
   make it obsolete).
 - Update documentation: add new section with explanation
   about configuration/activation and code example.
 - Update samples.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).
- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

 Documentation/userspace-api/task_isolation.rst |  370 +++++++++++++++++++++++++
 samples/Kconfig                                |    7 
 samples/Makefile                               |    1 
 samples/task_isolation/Makefile                |   11 
 samples/task_isolation/task_isol.c             |   92 ++++++
 samples/task_isolation/task_isol.h             |    9 
 samples/task_isolation/task_isol_computation.c |   89 ++++++
 samples/task_isolation/task_isol_oneshot.c     |  104 +++++++
 samples/task_isolation/task_isol_userloop.c    |   54 +++
 9 files changed, 737 insertions(+)

Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,370 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Task isolation prctl interface
+===============================
+
+Certain types of applications benefit from running uninterrupted by
+background OS activities. Realtime systems and high-bandwidth networking
+applications with user-space drivers can fall into the category.
+
+To create an OS noise free environment for the application, this
+interface allows userspace to inform the kernel the start and
+end of the latency sensitive application section (with configurable
+system behaviour for that section).
+
+Note: the prctl interface is independent of nohz_full=.
+
+The prctl options are:
+
+
+        - PR_ISOL_FEAT_GET: Retrieve supported features.
+        - PR_ISOL_CFG_GET: Retrieve task isolation configuration.
+        - PR_ISOL_CFG_SET: Set task isolation configuration.
+        - PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
+        - PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
+
+Summary of terms:
+
+
+- feature:
+
+        A distinct attribute or aspect of task isolation. Examples of
+        features could be logging, new operating modes (eg: syscalls disallowed),
+        userspace notifications, etc. The only feature currently available is quiescing.
+
+- configuration:
+
+        A specific choice from a given set
+        of possible choices that dictate how the particular feature
+        in question should behave.
+
+- activation state:
+
+        The activation state (whether active/inactive) of the task
+        isolation features (features must be configured before
+        being activated).
+
+Inheritance of the isolation parameters and state, across
+fork(2) and clone(2), can be changed via
+PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
+
+
+At a high-level, task isolation is divided in two steps:
+
+1. Configuration.
+2. Activation.
+
+Section "Userspace support" describes how to use
+task isolation.
+
+In terms of the interface, the sequence of steps to activate
+task isolation are:
+
+1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
+2. Configure task isolation features (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
+3. Activate or deactivate task isolation features (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
+
+This interface is based on ideas and code from the
+task isolation patchset from Alex Belits:
+https://lwn.net/Articles/816298/
+
+--------------------
+Feature description
+--------------------
+
+        - ``ISOL_F_QUIESCE``
+
+        This feature allows quiescing selected kernel activities on
+        return from system calls.
+
+---------------------
+Interface description
+---------------------
+
+**PR_ISOL_FEAT**:
+
+        Returns the supported features and feature
+        capabilities, as a bitmask::
+
+                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies whether to return
+        supported features (if zero), or feature capabilities
+        (if not zero). Possible values for 'feat' are:
+
+
+        - ``0``:
+               Return the bitmask of supported features, in the location
+               pointed  to  by  ``(int *)arg3``. The buffer should allow space
+               for 8 bytes.
+
+        - ``ISOL_F_QUIESCE``:
+
+               Return a structure containing which kernel
+               activities are supported for quiescing, in the location
+               pointed to by ``(int *)arg3``::
+
+                        struct task_isol_quiesce_extensions {
+                                __u64 flags;
+                                __u64 supported_quiesce_bits;
+                                __u64 pad[6];
+                        };
+
+               Where:
+
+               *flags*: Additional flags (should be zero).
+
+               *supported_quiesce_bits*: Bitmask indicating
+                which features are supported for quiescing.
+
+               *pad*: Additional space for future enhancements.
+
+
+        Features and its capabilities are defined at
+        include/uapi/linux/task_isolation.h.
+
+**PR_ISOL_CFG_GET**:
+
+        Retrieve task isolation configuration.
+        The general format is::
+
+                prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
+
+        The 'what' argument specifies what to configure. Possible values are:
+
+        - ``I_CFG_FEAT``:
+
+                Return configuration of task isolation features. The 'arg3' argument specifies
+                whether to return configured features (if zero), or individual
+                feature configuration (if not zero), as follows.
+
+                - ``0``:
+
+                        Return the bitmask of configured features, in the location
+                        pointed  to  by  ``(int *)arg4``. The buffer should allow space
+                        for 8 bytes.
+
+                - ``ISOL_F_QUIESCE``:
+
+                        If arg4 is QUIESCE_CONTROL, return the control structure for
+                        quiescing of background kernel activities, in the location
+                        pointed to by ``(int *)arg5``::
+
+                         struct task_isol_quiesce_control {
+                                __u64 flags;
+                                __u64 quiesce_mask;
+                                __u64 quiesce_oneshot_mask;
+                                __u64 pad[5];
+                         };
+
+                        See PR_ISOL_CFG_GET description for meaning of fields.
+
+        - ``I_CFG_INHERIT``:
+
+                Retrieve inheritance configuration across fork/clone.
+
+                Return the structure which configures inheritance
+                across fork/clone, in the location pointed to
+                by ``(int *)arg4``::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                See PR_ISOL_CFG_SET description for meaning of fields.
+
+**PR_ISOL_CFG_SET**:
+
+        Set task isolation configuration.
+        The general format is::
+
+                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
+
+        The 'what' argument specifies what to configure. Possible values are:
+
+        - ``I_CFG_FEAT``:
+
+                Set configuration of task isolation features. 'arg3' specifies
+                the feature. Possible values are:
+
+                - ``ISOL_F_QUIESCE``:
+
+                        If arg4 is QUIESCE_CONTROL, set the control structure
+                        for quiescing of background kernel activities, from
+                        the location pointed to by ``(int *)arg5``::
+
+                         struct task_isol_quiesce_control {
+                                __u64 flags;
+                                __u64 quiesce_mask;
+                                __u64 quiesce_oneshot_mask;
+                                __u64 pad[5];
+                         };
+
+                        Where:
+
+                        *flags*: Additional flags (should be zero).
+
+                        *quiesce_mask*: A bitmask containing which kernel
+                        activities to quiesce.
+
+                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
+                        activities should behave in oneshot mode, that is, quiescing
+                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
+                        on return of subsequent system calls. The corresponding bit(s)
+                        must also be set at quiesce_mask.
+
+                        *pad*: Additional space for future enhancements.
+
+                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
+
+                        - ``ISOL_F_QUIESCE_VMSTATS``
+
+                        VM statistics are maintained in per-CPU counters to
+                        improve performance. When a CPU modifies a VM statistic,
+                        this modification is kept in the per-CPU counter.
+                        Certain activities require a global count, which
+                        involves requesting each CPU to flush its local counters
+                        to the global VM counters.
+
+                        This flush is implemented via a workqueue item, which
+                        might schedule a workqueue on isolated CPUs.
+
+                        To avoid this interruption, task isolation can be
+                        configured to, upon return from system calls, synchronize
+                        the per-CPU counters to global counters, thus avoiding
+                        the interruption.
+
+        - ``I_CFG_INHERIT``:
+                Set inheritance configuration when a new task
+                is created via fork and clone.
+
+                The ``(int *)arg4`` argument is a pointer to::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                inherit_mask is a bitmask that specifies which part
+                of task isolation should be inherited:
+
+                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
+                  This is the state written via prctl(PR_ISOL_CFG_SET, ...).
+
+                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
+                  (requires ISOL_INHERIT_CONF to be set). The new task
+                  should behave, after fork/clone, in the same manner
+                  as the parent task after it executed:
+
+                        prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);
+
+**PR_ISOL_ACTIVATE_GET**:
+
+        Retrieve task isolation activation state.
+
+        The general format is::
+
+                prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);
+
+        'pmask' specifies the location of a feature mask, where
+        the current active mask will be copied. See PR_ISOL_ACTIVATE_SET
+        for description of individual bits.
+
+
+**PR_ISOL_ACTIVATE_SET**:
+
+        Set task isolation activation state (activates/deactivates
+        task isolation).
+
+        The general format is::
+
+                prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);
+
+
+        The 'pmask' argument specifies the location of an 8 byte mask
+        containing which features should be activated. Features whose
+        bits are cleared will be deactivated. The possible
+        bits for this mask are:
+
+                - ``ISOL_F_QUIESCE``:
+
+                Activate quiescing of background kernel activities.
+                Quiescing happens on return to userspace from this
+                system call, and on return from subsequent
+                system calls (unless quiesce_oneshot_mask has been set at
+                PR_ISOL_CFG_SET time).
+
+        Quiescing can be adjusted (while active) by
+        prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).
+
+
+==================
+Userspace support
+==================
+
+Task isolation is divided in two main steps: configuration and activation.
+
+Each step can be performed by an external tool or the latency sensitive
+application itself. util-linux contains the "chisol" tool for this
+purpose.
+
+This results in three combinations:
+
+1. Both configuration and activation performed by the
+latency sensitive application.
+Allows fine grained control of what task isolation
+features are enabled and when (see samples section below).
+
+2. Only activation can be performed by the latency sensitive app
+(and configuration performed by chisol).
+This allows the admin/user to control task isolation parameters,
+and applications have to be modified only once.
+
+3. Configuration and activation performed by an external tool.
+This allows unmodified applications to take advantage of
+task isolation. Activation is performed by the "-a" option
+of chisol.
+
+========
+Examples
+========
+
+The ``samples/task_isolation/`` directory contains 3 examples:
+
+* task_isol_userloop.c:
+
+        Example of program with a loop on userspace scenario.
+
+* task_isol_computation.c:
+
+        Example of program that enters task isolated mode,
+        performs an amount of computation, exits task
+        isolated mode, and writes the computation to disk.
+
+* task_isol_oneshot.c:
+
+        Example of program that enables one-shot
+        mode for quiescing, enters a processing loop, then upon an external
+        event performs a number of syscalls to handle that event.
+
+This is a snippet of code to activate task isolation if
+it has been previously configured (by chisol for example)::
+
+        #include <sys/prctl.h>
+        #include <linux/types.h>
+
+        #ifdef PR_ISOL_CFG_GET
+        unsigned long long fmask;
+
+        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+        if (ret != -1 && fmask != 0) {
+                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
+                if (ret == -1) {
+                        perror("prctl PR_ISOL_ACTIVATE_SET");
+                        return ret;
+                }
+        }
+        #endif
+
Index: linux-2.6/samples/task_isolation/task_isol.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include <errno.h>
+#include "task_isol.h"
+
+#ifdef PR_ISOL_FEAT_GET
+int task_isol_setup(int oneshot)
+{
+	int ret;
+	int errnosv;
+	unsigned long long fmask;
+	struct task_isol_quiesce_extensions qext;
+	struct task_isol_quiesce_control qctrl;
+
+	/* Retrieve supported task isolation features */
+	ret = prctl(PR_ISOL_FEAT_GET, 0, &fmask, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT");
+		return ret;
+	}
+	printf("supported features bitmask: 0x%llx\n", fmask);
+
+	/* Retrieve supported ISOL_F_QUIESCE bits */
+	ret = prctl(PR_ISOL_FEAT_GET, ISOL_F_QUIESCE, &qext, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE)");
+		return ret;
+	}
+	printf("supported ISOL_F_QUIESCE bits: 0x%llx\n",
+		qext.supported_quiesce_bits);
+
+	fmask = 0;
+	ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+	errnosv = errno;
+	if (ret != -1 && fmask != 0) {
+		printf("Task isolation parameters already configured!\n");
+		return ret;
+	}
+	if (ret == -1 && errnosv != ENODATA) {
+		perror("prctl PR_ISOL_GET");
+		return ret;
+	}
+	memset(&qctrl, 0, sizeof(struct task_isol_quiesce_control));
+	qctrl.quiesce_mask = ISOL_F_QUIESCE_VMSTATS;
+	if (oneshot)
+		qctrl.quiesce_oneshot_mask = ISOL_F_QUIESCE_VMSTATS;
+
+	ret = prctl(PR_ISOL_CFG_SET, I_CFG_FEAT, ISOL_F_QUIESCE,
+		    QUIESCE_CONTROL, &qctrl);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CFG_SET");
+		return ret;
+	}
+	return ISOL_F_QUIESCE;
+}
+
+int task_isol_activate_set(unsigned long long mask)
+{
+	int ret;
+
+	ret = prctl(PR_ISOL_ACTIVATE_SET, &mask, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_ACTIVATE_SET");
+		return -1;
+	}
+
+	return 0;
+}
+
+#else
+
+int task_isol_setup(void)
+{
+	return 0;
+}
+
+int task_isol_activate_set(unsigned long long mask)
+{
+	return 0;
+}
+#endif
+
+
Index: linux-2.6/samples/task_isolation/task_isol.h
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TASK_ISOL_H
+#define __TASK_ISOL_H
+
+int task_isol_setup(int oneshot);
+
+int task_isol_activate_set(unsigned long long mask);
+
+#endif /* __TASK_ISOL_H */
Index: linux-2.6/samples/task_isolation/task_isol_userloop.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_userloop.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(0);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+	/* enable quiescing on system call return, oneshot */
+	ret = task_isol_activate_set(mask);
+	if (ret)
+		return EXIT_FAILURE;
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+	/* busy loop */
+	while (ret < NR_LOOPS)  {
+		memset(buf, 0, 4096);
+		ret = ret+1;
+		if (!(ret % NR_PRINT))
+			printf("loops=%d of %d\n", ret, NR_LOOPS);
+	}
+
+
+	ret = task_isol_activate_set(mask & ~ISOL_F_QUIESCE);
+	if (ret)
+		return EXIT_FAILURE;
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/Kconfig
===================================================================
--- linux-2.6.orig/samples/Kconfig
+++ linux-2.6/samples/Kconfig
@@ -223,4 +223,11 @@ config SAMPLE_WATCH_QUEUE
 	  Build example userspace program to use the new mount_notify(),
 	  sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
 
+config SAMPLE_TASK_ISOLATION
+	bool "task isolation sample"
+	depends on CC_CAN_LINK && HEADERS_INSTALL
+	help
+	  Build example userspace program to use prctl task isolation
+	  interface.
+
 endif # SAMPLES
Index: linux-2.6/samples/Makefile
===================================================================
--- linux-2.6.orig/samples/Makefile
+++ linux-2.6/samples/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
 subdir-$(CONFIG_SAMPLE_WATCHDOG)	+= watchdog
 subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)	+= watch_queue
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)	+= kmemleak/
+subdir-$(CONFIG_SAMPLE_TASK_ISOLATION)	+= task_isolation
Index: linux-2.6/samples/task_isolation/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+userprogs-always-y += task_isol_userloop task_isol_computation task_isol_oneshot
+task_isol_userloop-objs := task_isol.o task_isol_userloop.o
+task_isol_computation-objs := task_isol.o task_isol_computation.o
+task_isol_oneshot-objs := task_isol.o task_isol_oneshot.o
+
+userccflags += -I usr/include
+
+
+#$(CC) $^ -o $@
Index: linux-2.6/samples/task_isolation/task_isol_computation.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_computation.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Example of task isolation prctl interface with a loop:
+ *
+ *	do {
+ *		enable quiescing of kernel activities
+ *		perform computation
+ *		disable quiescing of kernel activities
+ *		write computation results to disk
+ *	} while (condition);
+ *
+ */
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret, fd, write_loops;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	fd = open("/tmp/comp_output.data", O_RDWR|O_CREAT);
+	if (fd == -1) {
+		perror("open");
+		return EXIT_FAILURE;
+	}
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(0);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+
+	write_loops = 0;
+	do {
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+		/* enable quiescing on system call return */
+		ret = task_isol_activate_set(mask);
+		if (ret)
+			return EXIT_FAILURE;
+
+		/* busy loop */
+		while (ret < NR_LOOPS)  {
+			memset(buf, 0xf, 4096);
+			ret = ret+1;
+			if (!(ret % NR_PRINT))
+				printf("wloop=%d loops=%d of %d\n", write_loops,
+					ret, NR_LOOPS);
+		}
+		/* disable quiescing on system call return */
+		ret = task_isol_activate_set(mask & ~ISOL_F_QUIESCE);
+		if (ret)
+			return EXIT_FAILURE;
+
+		/*
+		 * write computed data to disk, this would be
+		 * multiple writes on a real application, so
+		 * disabling quiescing is advantageous
+		 */
+		ret = write(fd, buf, 4096);
+		if (ret == -1) {
+			perror("write");
+			return EXIT_FAILURE;
+		}
+
+		write_loops += 1;
+	} while (write_loops < 5);
+
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/task_isolation/task_isol_oneshot.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_oneshot.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Example of task isolation prctl interface using
+ * oneshot mode for quiescing.
+ *
+ *
+ *      enable oneshot quiescing of kernel activities
+ *	do {
+ *		process data (no system calls)
+ *		if (event) {
+ *			process event with syscalls
+ *			enable oneshot quiescing of kernel activities
+ *		}
+ *	} while (!exit_condition);
+ *
+ */
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret, fd, cnt;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	fd = open("/dev/zero", O_RDONLY);
+	if (fd == -1) {
+		perror("open");
+		return EXIT_FAILURE;
+	}
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup(1);
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+
+	/* enable quiescing on system call return, oneshot */
+	ret = task_isol_activate_set(mask);
+	if (ret)
+		return EXIT_FAILURE;
+	/* busy loop */
+	cnt = 0;
+	while (cnt < NR_LOOPS)  {
+		memset(buf, 0xf, 4096);
+		cnt = cnt+1;
+		if (!(cnt % NR_PRINT)) {
+			int i, r;
+
+			/* this could be considered handling an external
+			 * event: with one-shot mode, system calls
+			 * after prctl(PR_SET_ACTIVATE) will not incur
+			 * the penalty of quiescing
+			 */
+			printf("loops=%d of %d\n", cnt, NR_LOOPS);
+			for (i = 0; i < 100; i++) {
+				r = read(fd, buf, 4096);
+				if (r == -1) {
+					perror("read");
+					return EXIT_FAILURE;
+				}
+			}
+
+			ret = munlock(buf, 4096);
+			if (ret) {
+				perror("munlock");
+				return EXIT_FAILURE;
+			}
+
+			ret = mlock(buf, 4096);
+			if (ret) {
+				perror("mlock");
+				return EXIT_FAILURE;
+			}
+
+			/* enable quiescing on system call return */
+			ret = task_isol_activate_set(mask);
+			if (ret)
+				return EXIT_FAILURE;
+		}
+	}
+
+	return EXIT_SUCCESS;
+}
+



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 03/10] task isolation: sync vmstats on return to userspace
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 01/10] add basic task isolation prctl interface Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 02/10] add prctl task isolation prctl docs and samples Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 04/10] procfs: add per-pid task isolation state Marcelo Tosatti
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, use the task isolation prctl interface to quiesce 
deferred actions when returning to userspace.

Keep task_isol_has_work returning 0 until all elements
are in place.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
v6: modify exit_to_user_mode_loop to cover exceptions and interrupts
v5: no changes
v4: add oneshot mode support

 include/linux/task_isolation.h |   16 ++++++++++++++++
 include/linux/vmstat.h         |    8 ++++++++
 kernel/entry/common.c          |   15 +++++++++++----
 kernel/task_isolation.c        |   21 +++++++++++++++++++++
 mm/vmstat.c                    |   21 +++++++++++++++++++++
 5 files changed, 77 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -40,8 +40,19 @@ int prctl_task_isolation_activate_set(un
 
 int __copy_task_isolation(struct task_struct *tsk);
 
+void isolation_exit_to_user_mode(void);
+
+static inline int task_isol_has_work(void)
+{
+	return 0;
+}
+
 #else
 
+static void isolation_exit_to_user_mode(void)
+{
+}
+
 static inline void tsk_isol_free(struct task_struct *tsk)
 {
 }
@@ -86,6 +97,11 @@ static inline int prctl_task_isolation_a
 	return -EOPNOTSUPP;
 }
 
+static inline int task_isol_has_work(void)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CPU_ISOLATION */
 
 #endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
 		void *buffer, size_t *length, loff_t *ppos);
 #endif
 
+#ifdef CONFIG_SMP
+void sync_vmstat(void);
+#else
+static inline void sync_vmstat(void)
+{
+}
+#endif
+
 struct reclaim_stat {
 	unsigned nr_dirty;
 	unsigned nr_unqueued_dirty;
Index: linux-2.6/kernel/entry/common.c
===================================================================
--- linux-2.6.orig/kernel/entry/common.c
+++ linux-2.6/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/task_isolation.h>
 
 #include "common.h"
 
@@ -149,13 +150,14 @@ static void handle_signal_work(struct pt
 }
 
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-					    unsigned long ti_work)
+					    unsigned long ti_work,
+					    unsigned long tsk_isol_work)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
-	while (ti_work & EXIT_TO_USER_MODE_WORK) {
+	while ((ti_work & EXIT_TO_USER_MODE_WORK) || tsk_isol_work) {
 
 		local_irq_enable_exit_to_user(ti_work);
 
@@ -177,6 +179,9 @@ static unsigned long exit_to_user_mode_l
 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);
 
+		if (tsk_isol_work)
+			isolation_exit_to_user_mode();
+
 		/*
 		 * Disable interrupts and reevaluate the work flags as they
 		 * might have changed while interrupts and preemption was
@@ -188,6 +193,7 @@ static unsigned long exit_to_user_mode_l
 		tick_nohz_user_enter_prepare();
 
 		ti_work = READ_ONCE(current_thread_info()->flags);
+		tsk_isol_work = task_isol_has_work();
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
@@ -197,14 +203,15 @@ static unsigned long exit_to_user_mode_l
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long tsk_isol_work = task_isol_has_work();
 
 	lockdep_assert_irqs_disabled();
 
 	/* Flush pending rcuog wakeup before the last need_resched() check */
 	tick_nohz_user_enter_prepare();
 
-	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+	if (unlikely((ti_work & EXIT_TO_USER_MODE_WORK) || tsk_isol_work))
+		ti_work = exit_to_user_mode_loop(regs, ti_work, tsk_isol_work);
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -18,6 +18,8 @@
 #include <linux/sysfs.h>
 #include <linux/init.h>
 #include <linux/sched/task.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
 
 void __tsk_isol_free(struct task_struct *tsk)
 {
@@ -348,3 +350,22 @@ int prctl_task_isolation_activate_get(un
 
 	return 0;
 }
+
+void isolation_exit_to_user_mode(void)
+{
+	struct isol_info *i;
+
+	i = current->isol_info;
+	if (!i)
+		return;
+
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return;
+
+	if (i->quiesce_mask & ISOL_F_QUIESCE_VMSTATS) {
+		sync_vmstat();
+		if (i->oneshot_mask & ISOL_F_QUIESCE_VMSTATS)
+			i->active_mask &= ~ISOL_F_QUIESCE_VMSTATS;
+	}
+}
+EXPORT_SYMBOL_GPL(isolation_exit_to_user_mode);
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -2003,6 +2003,27 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+void sync_vmstat(void)
+{
+	int cpu;
+
+	cpu = get_cpu();
+
+	refresh_cpu_vm_stats(false);
+	put_cpu();
+
+	/*
+	 * If task is migrated to another CPU between put_cpu
+	 * and cancel_delayed_work_sync, the code below might
+	 * cancel vmstat_update work for a different cpu
+	 * (than the one from which the vmstats were flushed).
+	 *
+	 * However, vmstat shepherd will re-enable it later,
+	 * so its harmless.
+	 */
+	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+}
+
 static void __init start_shepherd_timer(void)
 {
 	int cpu;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 04/10] procfs: add per-pid task isolation state
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2021-11-12 12:35 ` [patch v7 03/10] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 05/10] task isolation: add hook to task exit Marcelo Tosatti
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

Add /proc/pid/task_isolation file, to query the state of
task isolation configuration.

---
 fs/proc/base.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -96,6 +96,8 @@
 #include <linux/time_namespace.h>
 #include <linux/resctrl.h>
 #include <linux/cn_proc.h>
+#include <linux/prctl.h>
+#include <linux/task_isolation.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -662,6 +664,69 @@ static int proc_pid_syscall(struct seq_f
 }
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
+#ifdef CONFIG_CPU_ISOLATION
+
+struct qoptions {
+	unsigned long mask;
+	char *name;
+};
+
+static struct qoptions iopts[] = {
+	{ISOL_F_QUIESCE, "quiesce"},
+};
+#define ILEN (sizeof(iopts) / sizeof(struct qoptions))
+
+static struct qoptions qopts[] = {
+	{ISOL_F_QUIESCE_VMSTATS, "vmstat_sync"},
+};
+#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
+
+static void show_isolation_state(struct seq_file *m,
+				 struct qoptions *iopt,
+				 int mask,
+				 const char *hdr)
+{
+	int i;
+
+	seq_printf(m, hdr);
+	for (i = 0; i < ILEN; i++) {
+		if (mask & iopt->mask)
+			seq_printf(m, "%s ", iopt->name);
+		iopt++;
+	}
+	if (mask == 0)
+		seq_printf(m, "none ");
+	seq_printf(m, "\n");
+}
+
+int proc_pid_task_isolation(struct seq_file *m, struct pid_namespace *ns,
+			    struct pid *pid, struct task_struct *t)
+{
+	int active_mask, quiesce_mask, conf_mask;
+	struct isol_info *isol_info;
+	struct inode *inode = m->private;
+	struct task_struct *task = get_proc_task(inode);
+
+	isol_info = t->isol_info;
+	if (!isol_info)
+		active_mask = quiesce_mask = conf_mask = 0;
+	else {
+		active_mask = isol_info->active_mask;
+		quiesce_mask = isol_info->quiesce_mask;
+		conf_mask = isol_info->conf_mask;
+	}
+
+	show_isolation_state(m, iopts, conf_mask, "Configured state: ");
+	show_isolation_state(m, iopts, active_mask, "Active state: ");
+	show_isolation_state(m, qopts, quiesce_mask, "Quiescing: ");
+
+	put_task_struct(task);
+
+	return 0;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
 /************************************************************************/
 /*                       Here the fs part begins                        */
 /************************************************************************/
@@ -3281,6 +3346,9 @@ static const struct pid_entry tgid_base_
 #ifdef CONFIG_SECCOMP_CACHE_DEBUG
 	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
 #endif
+#ifdef CONFIG_CPU_ISOLATION
+	ONE("task_isolation", S_IRUSR, proc_pid_task_isolation),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 05/10] task isolation: add hook to task exit
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2021-11-12 12:35 ` [patch v7 04/10] procfs: add per-pid task isolation state Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 06/10] task isolation: sync vmstats conditional on changes Marcelo Tosatti
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

Add task isolation specific hook to task exit routines.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/task_isolation.h |   11 +++++++++++
 kernel/exit.c                  |    2 ++
 kernel/fork.c                  |    1 +
 kernel/task_isolation.c        |    4 ++++
 4 files changed, 18 insertions(+)

Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -64,6 +64,7 @@
 #include <linux/rcuwait.h>
 #include <linux/compat.h>
 #include <linux/io_uring.h>
+#include <linux/task_isolation.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -778,6 +779,7 @@ void __noreturn do_exit(long code)
 	}
 
 	io_uring_files_cancel();
+	tsk_isol_exit(tsk);
 	exit_signals(tsk);  /* sets PF_EXITING */
 
 	/* sync mm's RSS info before statistics gathering */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -2446,6 +2446,7 @@ bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
 bad_fork_cleanup_task_isolation:
+	tsk_isol_exit(p);
 	tsk_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -21,6 +21,10 @@
 #include <linux/mm.h>
 #include <linux/vmstat.h>
 
+void __tsk_isol_exit(struct task_struct *tsk)
+{
+}
+
 void __tsk_isol_free(struct task_struct *tsk)
 {
 	if (!tsk->isol_info)
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -27,6 +27,13 @@ static inline void tsk_isol_free(struct
 		__tsk_isol_free(tsk);
 }
 
+void __tsk_isol_exit(struct task_struct *tsk);
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+	if (tsk->isol_info)
+		__tsk_isol_exit(tsk);
+}
+
 int prctl_task_isolation_feat_get(unsigned long arg2, unsigned long arg3,
 				  unsigned long arg4, unsigned long arg5);
 int prctl_task_isolation_cfg_get(unsigned long arg2, unsigned long arg3,
@@ -57,6 +64,10 @@ static inline void tsk_isol_free(struct
 {
 }
 
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+}
+
 static inline int prctl_task_isolation_feat_get(unsigned long arg2,
 						unsigned long arg3,
 						unsigned long arg4,



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 06/10] task isolation: sync vmstats conditional on changes
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2021-11-12 12:35 ` [patch v7 05/10] task isolation: add hook to task exit Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 07/10] task isolation: enable return to userspace processing Marcelo Tosatti
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

Rather than syncing VM-stats on every return to userspace
(or VM-entry), keep track of changes through a per-CPU bool.

This improves performance when enabling task isolated
for vcpu VMs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/vmstat.h |   13 ++++++++++++-
 mm/vmstat.c            |   29 ++++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -22,7 +22,18 @@ int sysctl_vm_numa_stat_handler(struct c
 #endif
 
 #ifdef CONFIG_SMP
-void sync_vmstat(void);
+DECLARE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+extern struct static_key vmstat_sync_enabled;
+
+void __sync_vmstat(void);
+static inline void sync_vmstat(void)
+{
+	if (static_key_false(&vmstat_sync_enabled))
+		__sync_vmstat();
+}
+
+void init_sync_vmstat(void);
 #else
 static inline void sync_vmstat(void)
 {
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -306,6 +306,24 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+struct static_key vmstat_sync_enabled;
+DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+static inline void mark_vmstat_dirty(void)
+{
+	if (!static_key_false(&vmstat_sync_enabled))
+		return;
+
+	raw_cpu_write(vmstat_dirty, true);
+}
+
+void init_sync_vmstat(void)
+{
+	raw_cpu_write(vmstat_dirty, true);
+}
+
+EXPORT_SYMBOL_GPL(vmstat_dirty);
+
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -338,6 +356,7 @@ void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
@@ -376,6 +395,7 @@ void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
@@ -574,6 +594,7 @@ static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	mark_vmstat_dirty();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -642,6 +663,7 @@ static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	mark_vmstat_dirty();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -1082,6 +1104,7 @@ static void fill_contig_page_info(struct
 			info->free_blocks_suitable += blocks <<
 						(order - suitable_order);
 	}
+	mark_vmstat_dirty();
 }
 
 /*
@@ -1434,6 +1457,7 @@ static void walk_zones_in_node(struct se
 		if (!nolock)
 			spin_unlock_irqrestore(&zone->lock, flags);
 	}
+	mark_vmstat_dirty();
 }
 #endif
 
@@ -1499,6 +1523,7 @@ static void pagetypeinfo_showfree_print(
 		}
 		seq_putc(m, '\n');
 	}
+	mark_vmstat_dirty();
 }
 
 /* Print out the free pages at each order for each migatetype */
@@ -1917,6 +1942,7 @@ static void vmstat_update(struct work_st
 				this_cpu_ptr(&vmstat_work),
 				round_jiffies_relative(sysctl_stat_interval));
 	}
+	mark_vmstat_dirty();
 }
 
 /*
@@ -2003,13 +2029,14 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
-void sync_vmstat(void)
+void __sync_vmstat(void)
 {
 	int cpu;
 
 	cpu = get_cpu();
 
 	refresh_cpu_vm_stats(false);
+	raw_cpu_write(vmstat_dirty, false);
 	put_cpu();
 
 	/*



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 07/10] task isolation: enable return to userspace processing
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2021-11-12 12:35 ` [patch v7 06/10] task isolation: sync vmstats conditional on changes Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 08/10] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

Enable processing of pending task isolation work if per-CPU vmstats
are out of sync with global vmstats.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/task_isolation.h |   22 +++++++++++++++++++++-
 kernel/task_isolation.c        |   13 +++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -5,6 +5,9 @@
 
 #ifdef CONFIG_CPU_ISOLATION
 
+#include <linux/vmstat.h>
+#include <uapi/linux/prctl.h>
+
 struct isol_info {
 	/* Which features have been configured */
 	u64 conf_mask;
@@ -51,7 +54,24 @@ void isolation_exit_to_user_mode(void);
 
 static inline int task_isol_has_work(void)
 {
-	return 0;
+	int cpu, ret;
+	struct isol_info *i;
+
+	if (likely(current->isol_info == NULL))
+		return 0;
+
+	i = current->isol_info;
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return 0;
+
+	if (!(i->quiesce_mask & ISOL_F_QUIESCE_VMSTATS))
+		return 0;
+
+	cpu = get_cpu();
+	ret = per_cpu(vmstat_dirty, cpu);
+	put_cpu();
+
+	return ret;
 }
 
 #else
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -23,6 +23,13 @@
 
 void __tsk_isol_exit(struct task_struct *tsk)
 {
+	struct isol_info *i;
+
+	i = tsk->isol_info;
+	if (!i)
+		return;
+
+	static_key_slow_dec(&vmstat_sync_enabled);
 }
 
 void __tsk_isol_free(struct task_struct *tsk)
@@ -41,6 +48,12 @@ static struct isol_info *tsk_isol_alloc_
 	if (unlikely(!info))
 		return ERR_PTR(-ENOMEM);
 
+	preempt_disable();
+	init_sync_vmstat();
+	preempt_enable();
+
+	static_key_slow_inc(&vmstat_sync_enabled);
+
 	return info;
 }
 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 08/10] KVM: x86: process isolation work from VM-entry code path
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2021-11-12 12:35 ` [patch v7 07/10] task isolation: enable return to userspace processing Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 09/10] mm: vmstat: move need_update Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 10/10] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

VM-entry code path is an entry point similar to userspace return
when task isolation is concerned.

Call isolation_exit_to_user_mode before VM-enter.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/entry-kvm.h |    4 +++-
 kernel/entry/kvm.c        |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/entry/kvm.c
===================================================================
--- linux-2.6.orig/kernel/entry/kvm.c
+++ linux-2.6/kernel/entry/kvm.c
@@ -2,8 +2,11 @@
 
 #include <linux/entry-kvm.h>
 #include <linux/kvm_host.h>
+#include <linux/task_isolation.h>
 
-static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
+static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
+				   unsigned long ti_work,
+				   unsigned long tsk_isol_work)
 {
 	do {
 		int ret;
@@ -26,14 +29,20 @@ static int xfer_to_guest_mode_work(struc
 		if (ret)
 			return ret;
 
+		if (tsk_isol_work)
+			isolation_exit_to_user_mode();
+
 		ti_work = READ_ONCE(current_thread_info()->flags);
-	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
+		tsk_isol_work = task_isol_has_work();
+	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched() ||
+		 tsk_isol_work);
 	return 0;
 }
 
 int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
 {
 	unsigned long ti_work;
+	unsigned long tsk_isol_work;
 
 	/*
 	 * This is invoked from the outer guest loop with interrupts and
@@ -44,9 +53,10 @@ int xfer_to_guest_mode_handle_work(struc
 	 * to disable interrupts here.
 	 */
 	ti_work = READ_ONCE(current_thread_info()->flags);
-	if (!(ti_work & XFER_TO_GUEST_MODE_WORK))
+	tsk_isol_work = task_isol_has_work();
+	if (!((ti_work & XFER_TO_GUEST_MODE_WORK) || tsk_isol_work))
 		return 0;
 
-	return xfer_to_guest_mode_work(vcpu, ti_work);
+	return xfer_to_guest_mode_work(vcpu, ti_work, tsk_isol_work);
 }
 EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
Index: linux-2.6/include/linux/entry-kvm.h
===================================================================
--- linux-2.6.orig/include/linux/entry-kvm.h
+++ linux-2.6/include/linux/entry-kvm.h
@@ -8,6 +8,7 @@
 #include <linux/seccomp.h>
 #include <linux/sched.h>
 #include <linux/tick.h>
+#include <linux/task_isolation.h>
 
 /* Transfer to guest mode work */
 #ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
@@ -76,8 +77,9 @@ static inline void xfer_to_guest_mode_pr
 static inline bool __xfer_to_guest_mode_work_pending(void)
 {
 	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long tsk_isol_work = task_isol_has_work();
 
-	return !!(ti_work & XFER_TO_GUEST_MODE_WORK);
+	return !!((ti_work & XFER_TO_GUEST_MODE_WORK) || tsk_isol_work);
 }
 
 /**



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 09/10] mm: vmstat: move need_update
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2021-11-12 12:35 ` [patch v7 08/10] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  2021-11-12 12:35 ` [patch v7 10/10] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

Move need_update() function up in vmstat.c, needed by next patch.
No code changes.

Remove a duplicate comment while at it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   58 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1865,6 +1865,35 @@ static const struct seq_operations vmsta
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	pg_data_t *last_pgdat = NULL;
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		struct per_cpu_nodestat *n;
+
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 */
+		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
+			return true;
+
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
+		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
+		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
+			return true;
+	}
+	return false;
+}
+
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1946,35 +1975,6 @@ static void vmstat_update(struct work_st
 }
 
 /*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
-			return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch v7 10/10] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2021-11-12 12:35 ` [patch v7 09/10] mm: vmstat: move need_update Marcelo Tosatti
@ 2021-11-12 12:35 ` Marcelo Tosatti
  9 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-12 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira,
	Marcelo Tosatti

It is not necessary to queue work item to run refresh_vm_stats 
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.

This fixes sosreport hang (which uses vmstat_refresh) with 
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   49 ++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 44 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1895,6 +1895,31 @@ static bool need_update(int cpu)
 }
 
 #ifdef CONFIG_PROC_FS
+static bool need_drain_remote_zones(int cpu)
+{
+#ifdef CONFIG_NUMA
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		if (!pcp->count)
+			continue;
+
+		if (!pcp->expire)
+			continue;
+
+		if (zone_to_nid(zone) == cpu_to_node(cpu))
+			continue;
+
+		return true;
+	}
+#endif
+
+	return false;
+}
+
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
@@ -1904,8 +1929,12 @@ int vmstat_refresh(struct ctl_table *tab
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	long val;
-	int err;
-	int i;
+	int i, cpu;
+	struct work_struct __percpu *works;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1919,9 +1948,19 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+
+		INIT_WORK(work, refresh_vm_stats);
+		if (need_update(cpu) || need_drain_remote_zones(cpu))
+			schedule_work_on(cpu, work);
+	}
+	for_each_online_cpu(cpu)
+		flush_work(per_cpu_ptr(works, cpu));
+	cpus_read_unlock();
+	free_percpu(works);
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-11-12 12:35 ` [patch v7 02/10] add prctl task isolation prctl docs and samples Marcelo Tosatti
@ 2021-11-23 12:36   ` Frederic Weisbecker
  2021-11-29 15:13     ` Marcelo Tosatti
  2021-11-23 14:37   ` Frederic Weisbecker
  1 sibling, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2021-11-23 12:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

On Fri, Nov 12, 2021 at 09:35:33AM -0300, Marcelo Tosatti wrote:
> +**PR_ISOL_CFG_SET**:
> +
> +        Set task isolation configuration.
> +        The general format is::
> +
> +                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> +
> +        The 'what' argument specifies what to configure. Possible values are:
> +
> +        - ``I_CFG_FEAT``:
> +
> +                Set configuration of task isolation features. 'arg3' specifies
> +                the feature. Possible values are:
> +
> +                - ``ISOL_F_QUIESCE``:
> +
> +                        If arg4 is QUIESCE_CONTROL, set the control structure
> +                        for quiescing of background kernel activities, from
> +                        the location pointed to by ``(int *)arg5``::
> +
> +                         struct task_isol_quiesce_control {
> +                                __u64 flags;
> +                                __u64 quiesce_mask;
> +                                __u64 quiesce_oneshot_mask;
> +                                __u64 pad[5];
> +                         };
> +
> +                        Where:
> +
> +                        *flags*: Additional flags (should be zero).
> +
> +                        *quiesce_mask*: A bitmask containing which kernel
> +                        activities to quiesce.
> +
> +                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
> +                        activities should behave in oneshot mode, that is, quiescing
> +                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
> +                        on return of subsequent system calls. The corresponding bit(s)
> +                        must also be set at quiesce_mask.
> +
> +                        *pad*: Additional space for future enhancements.
> +
> +                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
> +
> +                        - ``ISOL_F_QUIESCE_VMSTATS``
> +
> +                        VM statistics are maintained in per-CPU counters to
> +                        improve performance. When a CPU modifies a VM statistic,
> +                        this modification is kept in the per-CPU counter.
> +                        Certain activities require a global count, which
> +                        involves requesting each CPU to flush its local counters
> +                        to the global VM counters.
> +
> +                        This flush is implemented via a workqueue item, which
> +                        might schedule a workqueue on isolated CPUs.
> +
> +                        To avoid this interruption, task isolation can be
> +                        configured to, upon return from system calls, synchronize
> +                        the per-CPU counters to global counters, thus avoiding
> +                        the interruption.

Sorry I know this is already v7 but we really don't want to screw up this interface.

What would be more simple and flexible for individual features to quiesce:

   arg3 = ISOL_F_QUIESCE
   arg4 = which feature to quiesce (eg: ISOL_F_QUIESCE_VMSTATS)
   arg5 =

       struct task_isol_quiesce_control {
           __u64 flags; //with ONESHOT as the first and only possible flag for now
           __u64 pad[5];
       };

This way we can really do a finegrained control over each feature to quiesce.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-11-12 12:35 ` [patch v7 02/10] add prctl task isolation prctl docs and samples Marcelo Tosatti
  2021-11-23 12:36   ` Frederic Weisbecker
@ 2021-11-23 14:37   ` Frederic Weisbecker
  2021-11-29 15:19     ` Marcelo Tosatti
  1 sibling, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2021-11-23 14:37 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

On Fri, Nov 12, 2021 at 09:35:33AM -0300, Marcelo Tosatti wrote:
> +**PR_ISOL_CFG_GET**:
> +
> +        Retrieve task isolation configuration.
> +        The general format is::
> +
> +                prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
> +
> +        The 'what' argument specifies what to configure. Possible values are:
> +
> +        - ``I_CFG_FEAT``:
> +
> +                Return configuration of task isolation features. The 'arg3' argument specifies
> +                whether to return configured features (if zero), or individual
> +                feature configuration (if not zero), as follows.
> +
> +                - ``0``:
> +
> +                        Return the bitmask of configured features, in the location
> +                        pointed  to  by  ``(int *)arg4``. The buffer should allow space
> +                        for 8 bytes.
> +
> +                - ``ISOL_F_QUIESCE``:
> +
> +                        If arg4 is QUIESCE_CONTROL, return the control structure for
> +                        quiescing of background kernel activities, in the location
> +                        pointed to by ``(int *)arg5``::
> +
> +                         struct task_isol_quiesce_control {
> +                                __u64 flags;
> +                                __u64 quiesce_mask;
> +                                __u64 quiesce_oneshot_mask;
> +                                __u64 pad[5];
> +                         };
> +
> +                        See PR_ISOL_CFG_GET description for meaning of
> fields.

PR_ISOL_CFG_SET ?

[...]
> +
> +                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
> +                        activities should behave in oneshot mode, that is, quiescing
> +                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
> +                        on return of subsequent system calls. The corresponding bit(s)
> +                        must also be set at quiesce_mask.

Don't forget to mention interrupts and exceptions.

> +
> +                        *pad*: Additional space for future enhancements.
> +
> +                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
> +
> +                        - ``ISOL_F_QUIESCE_VMSTATS``
> +
> +                        VM statistics are maintained in per-CPU counters to
> +                        improve performance. When a CPU modifies a VM statistic,
> +                        this modification is kept in the per-CPU counter.
> +                        Certain activities require a global count, which
> +                        involves requesting each CPU to flush its local counters
> +                        to the global VM counters.
> +
> +                        This flush is implemented via a workqueue item, which
> +                        might schedule a workqueue on isolated CPUs.
> +
> +                        To avoid this interruption, task isolation can be
> +                        configured to, upon return from system calls, synchronize
> +                        the per-CPU counters to global counters, thus avoiding
> +                        the interruption.
> +
> +        - ``I_CFG_INHERIT``:
> +                Set inheritance configuration when a new task
> +                is created via fork and clone.
> +
> +                The ``(int *)arg4`` argument is a pointer to::
> +
> +                        struct task_isol_inherit_control {
> +                                __u8    inherit_mask;
> +                                __u8    pad[7];
> +                        };
> +
> +                inherit_mask is a bitmask that specifies which part
> +                of task isolation should be inherited:
> +
> +                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
> +                  This is the state written via prctl(PR_ISOL_CFG_SET, ...).
> +
> +                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
> +                  (requires ISOL_INHERIT_CONF to be set). The new task
> +                  should behave, after fork/clone, in the same manner
> +                  as the parent task after it executed:
> +
> +                        prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);

I'm confused, what is the purpose of ISOL_INHERIT_CONF?

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-11-23 12:36   ` Frederic Weisbecker
@ 2021-11-29 15:13     ` Marcelo Tosatti
  2021-12-02 17:13       ` Frederic Weisbecker
  0 siblings, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-29 15:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

On Tue, Nov 23, 2021 at 01:36:20PM +0100, Frederic Weisbecker wrote:
> On Fri, Nov 12, 2021 at 09:35:33AM -0300, Marcelo Tosatti wrote:
> > +**PR_ISOL_CFG_SET**:
> > +
> > +        Set task isolation configuration.
> > +        The general format is::
> > +
> > +                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> > +
> > +        The 'what' argument specifies what to configure. Possible values are:
> > +
> > +        - ``I_CFG_FEAT``:
> > +
> > +                Set configuration of task isolation features. 'arg3' specifies
> > +                the feature. Possible values are:
> > +
> > +                - ``ISOL_F_QUIESCE``:
> > +
> > +                        If arg4 is QUIESCE_CONTROL, set the control structure
> > +                        for quiescing of background kernel activities, from
> > +                        the location pointed to by ``(int *)arg5``::
> > +
> > +                         struct task_isol_quiesce_control {
> > +                                __u64 flags;
> > +                                __u64 quiesce_mask;
> > +                                __u64 quiesce_oneshot_mask;
> > +                                __u64 pad[5];
> > +                         };
> > +
> > +                        Where:
> > +
> > +                        *flags*: Additional flags (should be zero).
> > +
> > +                        *quiesce_mask*: A bitmask containing which kernel
> > +                        activities to quiesce.
> > +
> > +                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
> > +                        activities should behave in oneshot mode, that is, quiescing
> > +                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
> > +                        on return of subsequent system calls. The corresponding bit(s)
> > +                        must also be set at quiesce_mask.
> > +
> > +                        *pad*: Additional space for future enhancements.
> > +
> > +                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
> > +
> > +                        - ``ISOL_F_QUIESCE_VMSTATS``
> > +
> > +                        VM statistics are maintained in per-CPU counters to
> > +                        improve performance. When a CPU modifies a VM statistic,
> > +                        this modification is kept in the per-CPU counter.
> > +                        Certain activities require a global count, which
> > +                        involves requesting each CPU to flush its local counters
> > +                        to the global VM counters.
> > +
> > +                        This flush is implemented via a workqueue item, which
> > +                        might schedule a workqueue on isolated CPUs.
> > +
> > +                        To avoid this interruption, task isolation can be
> > +                        configured to, upon return from system calls, synchronize
> > +                        the per-CPU counters to global counters, thus avoiding
> > +                        the interruption.
> 
> Sorry I know this is already v7 but we really don't want to screw up this interface.

No problem.

> What would be more simple and flexible for individual features to quiesce:
> 
>    arg3 = ISOL_F_QUIESCE
>    arg4 = which feature to quiesce (eg: ISOL_F_QUIESCE_VMSTATS)

arg4 is QUIESCE_CONTROL today so one can expand the interface
(by adding new commands).

>    arg5 =
> 
>        struct task_isol_quiesce_control {
>            __u64 flags; //with ONESHOT as the first and only possible flag for now
>            __u64 pad[5];
>        };

So your idea is to allow expansion at this level ? Say while adding
a new

ISOL_F_QUIESCE_NEWITEM

one can add

	struct task_isol_quiesce_control_newitem {
		__u64 flags;
		__u64 pad[5];
	};

And add new fields to "struct task_isol_quiesce_control_newitem".

One downside of this suggestion is that for use-cases of the task_isol_computation.c type,
(see patch 2 "add prctl task isolation prctl docs and samples"), this might need
multiple system calls for each enable/disable cycle. Is that OK?

See more below.

> This way we can really do a finegrained control over each feature to quiesce.

With the patchset above, one can add new values to arg4 
(at the same level of QUIESCE_CONTROL). Your suggestion does not save
room for that.

One could add new values to the same space of I_CFG_FEAT:

          prctl(PR_ISOL_CFG_SET, I_CFG_FEAT_xxx, ...);

But that sounds awkward.

Or change the current ioctl to:

          prctl(PR_ISOL_CFG, I_CFG_FEAT_CONTROL, ...);

Which makes it less awkward.

What do you say?

--- 

And then, what about keeping the bitmaps with enabled/one-shot mode
per feature per bit (to avoid multiple system calls)
but having (in the future) additional per-quiesce instance commands ?









^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-11-23 14:37   ` Frederic Weisbecker
@ 2021-11-29 15:19     ` Marcelo Tosatti
  2021-12-02 17:44       ` Frederic Weisbecker
  0 siblings, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2021-11-29 15:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

On Tue, Nov 23, 2021 at 03:37:26PM +0100, Frederic Weisbecker wrote:
> On Fri, Nov 12, 2021 at 09:35:33AM -0300, Marcelo Tosatti wrote:
> > +**PR_ISOL_CFG_GET**:
> > +
> > +        Retrieve task isolation configuration.
> > +        The general format is::
> > +
> > +                prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
> > +
> > +        The 'what' argument specifies what to configure. Possible values are:
> > +
> > +        - ``I_CFG_FEAT``:
> > +
> > +                Return configuration of task isolation features. The 'arg3' argument specifies
> > +                whether to return configured features (if zero), or individual
> > +                feature configuration (if not zero), as follows.
> > +
> > +                - ``0``:
> > +
> > +                        Return the bitmask of configured features, in the location
> > +                        pointed  to  by  ``(int *)arg4``. The buffer should allow space
> > +                        for 8 bytes.
> > +
> > +                - ``ISOL_F_QUIESCE``:
> > +
> > +                        If arg4 is QUIESCE_CONTROL, return the control structure for
> > +                        quiescing of background kernel activities, in the location
> > +                        pointed to by ``(int *)arg5``::
> > +
> > +                         struct task_isol_quiesce_control {
> > +                                __u64 flags;
> > +                                __u64 quiesce_mask;
> > +                                __u64 quiesce_oneshot_mask;
> > +                                __u64 pad[5];
> > +                         };
> > +
> > +                        See PR_ISOL_CFG_GET description for meaning of
> > fields.
> 
> PR_ISOL_CFG_SET ?

Yes, _SET.

> [...]
> > +
> > +                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
> > +                        activities should behave in oneshot mode, that is, quiescing
> > +                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
> > +                        on return of subsequent system calls. The corresponding bit(s)
> > +                        must also be set at quiesce_mask.
> 
> Don't forget to mention interrupts and exceptions.

OK.

> > +
> > +                        *pad*: Additional space for future enhancements.
> > +
> > +                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
> > +
> > +                        - ``ISOL_F_QUIESCE_VMSTATS``
> > +
> > +                        VM statistics are maintained in per-CPU counters to
> > +                        improve performance. When a CPU modifies a VM statistic,
> > +                        this modification is kept in the per-CPU counter.
> > +                        Certain activities require a global count, which
> > +                        involves requesting each CPU to flush its local counters
> > +                        to the global VM counters.
> > +
> > +                        This flush is implemented via a workqueue item, which
> > +                        might schedule a workqueue on isolated CPUs.
> > +
> > +                        To avoid this interruption, task isolation can be
> > +                        configured to, upon return from system calls, synchronize
> > +                        the per-CPU counters to global counters, thus avoiding
> > +                        the interruption.
> > +
> > +        - ``I_CFG_INHERIT``:
> > +                Set inheritance configuration when a new task
> > +                is created via fork and clone.
> > +
> > +                The ``(int *)arg4`` argument is a pointer to::
> > +
> > +                        struct task_isol_inherit_control {
> > +                                __u8    inherit_mask;
> > +                                __u8    pad[7];
> > +                        };
> > +
> > +                inherit_mask is a bitmask that specifies which part
> > +                of task isolation should be inherited:
> > +
> > +                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
> > +                  This is the state written via prctl(PR_ISOL_CFG_SET, ...).
> > +
> > +                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
> > +                  (requires ISOL_INHERIT_CONF to be set). The new task
> > +                  should behave, after fork/clone, in the same manner
> > +                  as the parent task after it executed:
> > +
> > +                        prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);
> 
> I'm confused, what is the purpose of ISOL_INHERIT_CONF?

When ISOL_INHERIT_CONF is set, task isolation configuration (everything
configured through PR_ISOL_CFG_SET) is copied across fork/clone
(but not activation) so one can:

	1) configure task isolation (with chisol, for example).
	2) activate task isolation from the latency sensitive app:

+This is a snippet of code to activate task isolation if
+it has been previously configured (by chisol for example)::
+
+        #include <sys/prctl.h>
+        #include <linux/types.h>
+
+        #ifdef PR_ISOL_CFG_GET
+        unsigned long long fmask;
+
+        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+        if (ret != -1 && fmask != 0) {
+                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
+                if (ret == -1) {
+                        perror("prctl PR_ISOL_ACTIVATE_SET");
+                        return ret;
+                }
+        }
+        #endif

Regarding the 3 possible modes of operation and their relation 
to ISOL_INHERIT_CONF / ISOL_INHERIT_ACTIVE:

+This results in three combinations:
+
+1. Both configuration and activation performed by the
+latency sensitive application.
+Allows fine grained control of what task isolation
+features are enabled and when (see samples section below).

	inherit_mask = 0

+2. Only activation can be performed by the latency sensitive app
+(and configuration performed by chisol).
+This allows the admin/user to control task isolation parameters,
+and applications have to be modified only once.

	inherit_mask = ISOL_INHERIT_CONF

+3. Configuration and activation performed by an external tool.
+This allows unmodified applications to take advantage of
+task isolation. Activation is performed by the "-a" option
+of chisol.

	inherit_mask = ISOL_INHERIT_ACTIVE


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-11-29 15:13     ` Marcelo Tosatti
@ 2021-12-02 17:13       ` Frederic Weisbecker
  2021-12-02 18:29         ` Marcelo Tosatti
  0 siblings, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2021-12-02 17:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

On Mon, Nov 29, 2021 at 12:13:25PM -0300, Marcelo Tosatti wrote:
> On Tue, Nov 23, 2021 at 01:36:20PM +0100, Frederic Weisbecker wrote:
> > On Fri, Nov 12, 2021 at 09:35:33AM -0300, Marcelo Tosatti wrote:
> > > +**PR_ISOL_CFG_SET**:
> > > +
> > > +        Set task isolation configuration.
> > > +        The general format is::
> > > +
> > > +                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> > > +
> > > +        The 'what' argument specifies what to configure. Possible values are:
> > > +
> > > +        - ``I_CFG_FEAT``:
> > > +
> > > +                Set configuration of task isolation features. 'arg3' specifies
> > > +                the feature. Possible values are:
> > > +
> > > +                - ``ISOL_F_QUIESCE``:
> > > +
> > > +                        If arg4 is QUIESCE_CONTROL, set the control structure
> > > +                        for quiescing of background kernel activities, from
> > > +                        the location pointed to by ``(int *)arg5``::
> > > +
> > > +                         struct task_isol_quiesce_control {
> > > +                                __u64 flags;
> > > +                                __u64 quiesce_mask;
> > > +                                __u64 quiesce_oneshot_mask;
> > > +                                __u64 pad[5];
> > > +                         };
> > > +
> > > +                        Where:
> > > +
> > > +                        *flags*: Additional flags (should be zero).
> > > +
> > > +                        *quiesce_mask*: A bitmask containing which kernel
> > > +                        activities to quiesce.
> > > +
> > > +                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
> > > +                        activities should behave in oneshot mode, that is, quiescing
> > > +                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
> > > +                        on return of subsequent system calls. The corresponding bit(s)
> > > +                        must also be set at quiesce_mask.
> > > +
> > > +                        *pad*: Additional space for future enhancements.
> > > +
> > > +                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
> > > +
> > > +                        - ``ISOL_F_QUIESCE_VMSTATS``
> > > +
> > > +                        VM statistics are maintained in per-CPU counters to
> > > +                        improve performance. When a CPU modifies a VM statistic,
> > > +                        this modification is kept in the per-CPU counter.
> > > +                        Certain activities require a global count, which
> > > +                        involves requesting each CPU to flush its local counters
> > > +                        to the global VM counters.
> > > +
> > > +                        This flush is implemented via a workqueue item, which
> > > +                        might schedule a workqueue on isolated CPUs.
> > > +
> > > +                        To avoid this interruption, task isolation can be
> > > +                        configured to, upon return from system calls, synchronize
> > > +                        the per-CPU counters to global counters, thus avoiding
> > > +                        the interruption.
> > 
> > Sorry I know this is already v7 but we really don't want to screw up this interface.
> 
> No problem.
> 
> > What would be more simple and flexible for individual features to quiesce:
> > 
> >    arg3 = ISOL_F_QUIESCE
> >    arg4 = which feature to quiesce (eg: ISOL_F_QUIESCE_VMSTATS)
> 
> arg4 is QUIESCE_CONTROL today so one can expand the interface
> (by adding new commands).
> 
> >    arg5 =
> > 
> >        struct task_isol_quiesce_control {
> >            __u64 flags; //with ONESHOT as the first and only possible flag for now
> >            __u64 pad[5];
> >        };
> 
> So your idea is to allow expansion at this level ? Say while adding
> a new
> 
> ISOL_F_QUIESCE_NEWITEM
> 
> one can add
> 
> 	struct task_isol_quiesce_control_newitem {
> 		__u64 flags;
> 		__u64 pad[5];
> 	};
> 
> And add new fields to "struct task_isol_quiesce_control_newitem".
> 
> One downside of this suggestion is that for use-cases of the task_isol_computation.c type,
> (see patch 2 "add prctl task isolation prctl docs and samples"), this might need
> multiple system calls for each enable/disable cycle. Is that OK?
> 
> See more below.
> 
> > This way we can really do a finegrained control over each feature to quiesce.
> 
> With the patchset above, one can add new values to arg4 
> (at the same level of QUIESCE_CONTROL). Your suggestion does not save
> room for that.
> 
> One could add new values to the same space of I_CFG_FEAT:
> 
>           prctl(PR_ISOL_CFG_SET, I_CFG_FEAT_xxx, ...);
> 
> But that sounds awkward.
> 
> Or change the current ioctl to:
> 
>           prctl(PR_ISOL_CFG, I_CFG_FEAT_CONTROL, ...);
> 
> Which makes it less awkward.
> 
> What do you say?
> 
> --- 
> 
> And then, what about keeping the bitmaps with enabled/one-shot mode
> per feature per bit (to avoid multiple system calls)
> but having (in the future) additional per-quiesce instance commands ?

Ok got your points.

I guess we can then simply rename ISOL_F_QUIESCE to ISOL_F_QUIESCE_MULTIPLE
for simple all-in-one configuration. Then if the need ever arise in the future,
we can always add ISOL_F_QUIESCE (or ISOL_F_QUIESCE_ONE) to do finegrained
control over a single quiescing feature.

Does that sound ok?

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-11-29 15:19     ` Marcelo Tosatti
@ 2021-12-02 17:44       ` Frederic Weisbecker
  0 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-12-02 17:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

On Mon, Nov 29, 2021 at 12:19:24PM -0300, Marcelo Tosatti wrote:
> On Tue, Nov 23, 2021 at 03:37:26PM +0100, Frederic Weisbecker wrote:
> > On Fri, Nov 12, 2021 at 09:35:33AM -0300, Marcelo Tosatti wrote:
> > > +        - ``I_CFG_INHERIT``:
> > > +                Set inheritance configuration when a new task
> > > +                is created via fork and clone.
> > > +
> > > +                The ``(int *)arg4`` argument is a pointer to::
> > > +
> > > +                        struct task_isol_inherit_control {
> > > +                                __u8    inherit_mask;
> > > +                                __u8    pad[7];
> > > +                        };
> > > +
> > > +                inherit_mask is a bitmask that specifies which part
> > > +                of task isolation should be inherited:
> > > +
> > > +                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
> > > +                  This is the state written via prctl(PR_ISOL_CFG_SET, ...).
> > > +
> > > +                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
> > > +                  (requires ISOL_INHERIT_CONF to be set). The new task
> > > +                  should behave, after fork/clone, in the same manner
> > > +                  as the parent task after it executed:
> > > +
> > > +                        prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);
> > 
> > I'm confused, what is the purpose of ISOL_INHERIT_CONF?
> 
> When ISOL_INHERIT_CONF is set, task isolation configuration (everything
> configured through PR_ISOL_CFG_SET) is copied across fork/clone
> (but not activation) so one can:
> 
> 	1) configure task isolation (with chisol, for example).
> 	2) activate task isolation from the latency sensitive app:
> 
> +This is a snippet of code to activate task isolation if
> +it has been previously configured (by chisol for example)::
> +
> +        #include <sys/prctl.h>
> +        #include <linux/types.h>
> +
> +        #ifdef PR_ISOL_CFG_GET
> +        unsigned long long fmask;
> +
> +        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
> +        if (ret != -1 && fmask != 0) {
> +                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
> +                if (ret == -1) {
> +                        perror("prctl PR_ISOL_ACTIVATE_SET");
> +                        return ret;
> +                }
> +        }
> +        #endif
> 
> Regarding the 3 possible modes of operation and their relation 
> to ISOL_INHERIT_CONF / ISOL_INHERIT_ACTIVE:
> 
> +This results in three combinations:
> +
> +1. Both configuration and activation performed by the
> +latency sensitive application.
> +Allows fine grained control of what task isolation
> +features are enabled and when (see samples section below).
> 
> 	inherit_mask = 0
> 
> +2. Only activation can be performed by the latency sensitive app
> +(and configuration performed by chisol).
> +This allows the admin/user to control task isolation parameters,
> +and applications have to be modified only once.
> 
> 	inherit_mask = ISOL_INHERIT_CONF
> 
> +3. Configuration and activation performed by an external tool.
> +This allows unmodified applications to take advantage of
> +task isolation. Activation is performed by the "-a" option
> +of chisol.
> 
> 	inherit_mask = ISOL_INHERIT_ACTIVE
> 

Doh yes of course, I read it too fast but it was actually clear.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-12-02 17:13       ` Frederic Weisbecker
@ 2021-12-02 18:29         ` Marcelo Tosatti
  2021-12-07 17:05           ` Marcelo Tosatti
  0 siblings, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2021-12-02 18:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

On Thu, Dec 02, 2021 at 06:13:20PM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 29, 2021 at 12:13:25PM -0300, Marcelo Tosatti wrote:
> > On Tue, Nov 23, 2021 at 01:36:20PM +0100, Frederic Weisbecker wrote:
> > > On Fri, Nov 12, 2021 at 09:35:33AM -0300, Marcelo Tosatti wrote:
> > > > +**PR_ISOL_CFG_SET**:
> > > > +
> > > > +        Set task isolation configuration.
> > > > +        The general format is::
> > > > +
> > > > +                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> > > > +
> > > > +        The 'what' argument specifies what to configure. Possible values are:
> > > > +
> > > > +        - ``I_CFG_FEAT``:
> > > > +
> > > > +                Set configuration of task isolation features. 'arg3' specifies
> > > > +                the feature. Possible values are:
> > > > +
> > > > +                - ``ISOL_F_QUIESCE``:
> > > > +
> > > > +                        If arg4 is QUIESCE_CONTROL, set the control structure
> > > > +                        for quiescing of background kernel activities, from
> > > > +                        the location pointed to by ``(int *)arg5``::
> > > > +
> > > > +                         struct task_isol_quiesce_control {
> > > > +                                __u64 flags;
> > > > +                                __u64 quiesce_mask;
> > > > +                                __u64 quiesce_oneshot_mask;
> > > > +                                __u64 pad[5];
> > > > +                         };
> > > > +
> > > > +                        Where:
> > > > +
> > > > +                        *flags*: Additional flags (should be zero).
> > > > +
> > > > +                        *quiesce_mask*: A bitmask containing which kernel
> > > > +                        activities to quiesce.
> > > > +
> > > > +                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
> > > > +                        activities should behave in oneshot mode, that is, quiescing
> > > > +                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
> > > > +                        on return of subsequent system calls. The corresponding bit(s)
> > > > +                        must also be set at quiesce_mask.
> > > > +
> > > > +                        *pad*: Additional space for future enhancements.
> > > > +
> > > > +                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
> > > > +
> > > > +                        - ``ISOL_F_QUIESCE_VMSTATS``
> > > > +
> > > > +                        VM statistics are maintained in per-CPU counters to
> > > > +                        improve performance. When a CPU modifies a VM statistic,
> > > > +                        this modification is kept in the per-CPU counter.
> > > > +                        Certain activities require a global count, which
> > > > +                        involves requesting each CPU to flush its local counters
> > > > +                        to the global VM counters.
> > > > +
> > > > +                        This flush is implemented via a workqueue item, which
> > > > +                        might schedule a workqueue on isolated CPUs.
> > > > +
> > > > +                        To avoid this interruption, task isolation can be
> > > > +                        configured to, upon return from system calls, synchronize
> > > > +                        the per-CPU counters to global counters, thus avoiding
> > > > +                        the interruption.
> > > 
> > > Sorry I know this is already v7 but we really don't want to screw up this interface.
> > 
> > No problem.
> > 
> > > What would be more simple and flexible for individual features to quiesce:
> > > 
> > >    arg3 = ISOL_F_QUIESCE
> > >    arg4 = which feature to quiesce (eg: ISOL_F_QUIESCE_VMSTATS)
> > 
> > arg4 is QUIESCE_CONTROL today so one can expand the interface
> > (by adding new commands).
> > 
> > >    arg5 =
> > > 
> > >        struct task_isol_quiesce_control {
> > >            __u64 flags; //with ONESHOT as the first and only possible flag for now
> > >            __u64 pad[5];
> > >        };
> > 
> > So your idea is to allow expansion at this level ? Say while adding
> > a new
> > 
> > ISOL_F_QUIESCE_NEWITEM
> > 
> > one can add
> > 
> > 	struct task_isol_quiesce_control_newitem {
> > 		__u64 flags;
> > 		__u64 pad[5];
> > 	};
> > 
> > And add new fields to "struct task_isol_quiesce_control_newitem".
> > 
> > One downside of this suggestion is that for use-cases of the task_isol_computation.c type,
> > (see patch 2 "add prctl task isolation prctl docs and samples"), this might need
> > multiple system calls for each enable/disable cycle. Is that OK?
> > 
> > See more below.
> > 
> > > This way we can really do a finegrained control over each feature to quiesce.
> > 
> > With the patchset above, one can add new values to arg4 
> > (at the same level of QUIESCE_CONTROL). Your suggestion does not save
> > room for that.
> > 
> > One could add new values to the same space of I_CFG_FEAT:
> > 
> >           prctl(PR_ISOL_CFG_SET, I_CFG_FEAT_xxx, ...);
> > 
> > But that sounds awkward.
> > 
> > Or change the current ioctl to:
> > 
> >           prctl(PR_ISOL_CFG, I_CFG_FEAT_CONTROL, ...);
> > 
> > Which makes it less awkward.
> > 
> > What do you say?
> > 
> > --- 
> > 
> > And then, what about keeping the bitmaps with enabled/one-shot mode
> > per feature per bit (to avoid multiple system calls)
> > but having (in the future) additional per-quiesce instance commands ?
> 
> Ok got your points.
> 
> I guess we can then simply rename ISOL_F_QUIESCE to ISOL_F_QUIESCE_MULTIPLE
> for simple all-in-one configuration. Then if the need ever arise in the future,
> we can always add ISOL_F_QUIESCE (or ISOL_F_QUIESCE_ONE) to do finegrained
> control over a single quiescing feature.
> 
> Does that sound ok?

Yep, it does, will change that.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch v7 02/10] add prctl task isolation prctl docs and samples
  2021-12-02 18:29         ` Marcelo Tosatti
@ 2021-12-07 17:05           ` Marcelo Tosatti
  0 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-12-07 17:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner, Daniel Bristot de Oliveira

On Thu, Dec 02, 2021 at 03:29:30PM -0300, Marcelo Tosatti wrote:
> On Thu, Dec 02, 2021 at 06:13:20PM +0100, Frederic Weisbecker wrote:
> > On Mon, Nov 29, 2021 at 12:13:25PM -0300, Marcelo Tosatti wrote:
> > > On Tue, Nov 23, 2021 at 01:36:20PM +0100, Frederic Weisbecker wrote:
> > > > On Fri, Nov 12, 2021 at 09:35:33AM -0300, Marcelo Tosatti wrote:
> > > > > +**PR_ISOL_CFG_SET**:
> > > > > +
> > > > > +        Set task isolation configuration.
> > > > > +        The general format is::
> > > > > +
> > > > > +                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> > > > > +
> > > > > +        The 'what' argument specifies what to configure. Possible values are:
> > > > > +
> > > > > +        - ``I_CFG_FEAT``:
> > > > > +
> > > > > +                Set configuration of task isolation features. 'arg3' specifies
> > > > > +                the feature. Possible values are:
> > > > > +
> > > > > +                - ``ISOL_F_QUIESCE``:
> > > > > +
> > > > > +                        If arg4 is QUIESCE_CONTROL, set the control structure
> > > > > +                        for quiescing of background kernel activities, from
> > > > > +                        the location pointed to by ``(int *)arg5``::
> > > > > +
> > > > > +                         struct task_isol_quiesce_control {
> > > > > +                                __u64 flags;
> > > > > +                                __u64 quiesce_mask;
> > > > > +                                __u64 quiesce_oneshot_mask;
> > > > > +                                __u64 pad[5];
> > > > > +                         };
> > > > > +
> > > > > +                        Where:
> > > > > +
> > > > > +                        *flags*: Additional flags (should be zero).
> > > > > +
> > > > > +                        *quiesce_mask*: A bitmask containing which kernel
> > > > > +                        activities to quiesce.
> > > > > +
> > > > > +                        *quiesce_oneshot_mask*: A bitmask indicating which kernel
> > > > > +                        activities should behave in oneshot mode, that is, quiescing
> > > > > +                        will happen on return from prctl(PR_ISOL_ACTIVATE_SET), but not
> > > > > +                        on return of subsequent system calls. The corresponding bit(s)
> > > > > +                        must also be set at quiesce_mask.
> > > > > +
> > > > > +                        *pad*: Additional space for future enhancements.
> > > > > +
> > > > > +                        For quiesce_mask (and quiesce_oneshot_mask), possible bit sets are:
> > > > > +
> > > > > +                        - ``ISOL_F_QUIESCE_VMSTATS``
> > > > > +
> > > > > +                        VM statistics are maintained in per-CPU counters to
> > > > > +                        improve performance. When a CPU modifies a VM statistic,
> > > > > +                        this modification is kept in the per-CPU counter.
> > > > > +                        Certain activities require a global count, which
> > > > > +                        involves requesting each CPU to flush its local counters
> > > > > +                        to the global VM counters.
> > > > > +
> > > > > +                        This flush is implemented via a workqueue item, which
> > > > > +                        might schedule a workqueue on isolated CPUs.
> > > > > +
> > > > > +                        To avoid this interruption, task isolation can be
> > > > > +                        configured to, upon return from system calls, synchronize
> > > > > +                        the per-CPU counters to global counters, thus avoiding
> > > > > +                        the interruption.
> > > > 
> > > > Sorry I know this is already v7 but we really don't want to screw up this interface.
> > > 
> > > No problem.
> > > 
> > > > What would be more simple and flexible for individual features to quiesce:
> > > > 
> > > >    arg3 = ISOL_F_QUIESCE
> > > >    arg4 = which feature to quiesce (eg: ISOL_F_QUIESCE_VMSTATS)
> > > 
> > > arg4 is QUIESCE_CONTROL today so one can expand the interface
> > > (by adding new commands).
> > > 
> > > >    arg5 =
> > > > 
> > > >        struct task_isol_quiesce_control {
> > > >            __u64 flags; //with ONESHOT as the first and only possible flag for now
> > > >            __u64 pad[5];
> > > >        };
> > > 
> > > So your idea is to allow expansion at this level ? Say while adding
> > > a new
> > > 
> > > ISOL_F_QUIESCE_NEWITEM
> > > 
> > > one can add
> > > 
> > > 	struct task_isol_quiesce_control_newitem {
> > > 		__u64 flags;
> > > 		__u64 pad[5];
> > > 	};
> > > 
> > > And add new fields to "struct task_isol_quiesce_control_newitem".
> > > 
> > > One downside of this suggestion is that for use-cases of the task_isol_computation.c type,
> > > (see patch 2 "add prctl task isolation prctl docs and samples"), this might need
> > > multiple system calls for each enable/disable cycle. Is that OK?
> > > 
> > > See more below.
> > > 
> > > > This way we can really do a finegrained control over each feature to quiesce.
> > > 
> > > With the patchset above, one can add new values to arg4 
> > > (at the same level of QUIESCE_CONTROL). Your suggestion does not save
> > > room for that.
> > > 
> > > One could add new values to the same space of I_CFG_FEAT:
> > > 
> > >           prctl(PR_ISOL_CFG_SET, I_CFG_FEAT_xxx, ...);
> > > 
> > > But that sounds awkward.
> > > 
> > > Or change the current ioctl to:
> > > 
> > >           prctl(PR_ISOL_CFG, I_CFG_FEAT_CONTROL, ...);
> > > 
> > > Which makes it less awkward.
> > > 
> > > What do you say?
> > > 
> > > --- 
> > > 
> > > And then, what about keeping the bitmaps with enabled/one-shot mode
> > > per feature per bit (to avoid multiple system calls)
> > > but having (in the future) additional per-quiesce instance commands ?
> > 
> > Ok got your points.
> > 
> > I guess we can then simply rename ISOL_F_QUIESCE to ISOL_F_QUIESCE_MULTIPLE
> > for simple all-in-one configuration. Then if the need ever arise in the future,
> > we can always add ISOL_F_QUIESCE (or ISOL_F_QUIESCE_ONE) to do finegrained
> > control over a single quiescing feature.
> > 
> > Does that sound ok?
> 
> Yep, it does, will change that.

Actually, after performing some of the changes, it turns out that
just adding ISOL_F_QUIESCE_ONE to configure individual features, 
and keeping the current ISOL_F_QUIESCE works just as well.

Fixed the other issues you raised, and added documentation about 
the possibility of ISOL_F_QUIESCE_ONE.

Will resend.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-12-07 17:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-12 12:35 [patch v7 00/10] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 01/10] add basic task isolation prctl interface Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 02/10] add prctl task isolation prctl docs and samples Marcelo Tosatti
2021-11-23 12:36   ` Frederic Weisbecker
2021-11-29 15:13     ` Marcelo Tosatti
2021-12-02 17:13       ` Frederic Weisbecker
2021-12-02 18:29         ` Marcelo Tosatti
2021-12-07 17:05           ` Marcelo Tosatti
2021-11-23 14:37   ` Frederic Weisbecker
2021-11-29 15:19     ` Marcelo Tosatti
2021-12-02 17:44       ` Frederic Weisbecker
2021-11-12 12:35 ` [patch v7 03/10] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 04/10] procfs: add per-pid task isolation state Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 05/10] task isolation: add hook to task exit Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 06/10] task isolation: sync vmstats conditional on changes Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 07/10] task isolation: enable return to userspace processing Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 08/10] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 09/10] mm: vmstat: move need_update Marcelo Tosatti
2021-11-12 12:35 ` [patch v7 10/10] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.