linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2)
@ 2021-07-30 20:18 Marcelo Tosatti
  2021-07-30 20:18 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-07-30 20:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.

The patchset is based on ideas and code from the
task isolation patchset from Alex Belits:
https://lwn.net/Articles/816298/

Please refer to Documentation/userspace-api/task_isolation.rst
(patch 1) for details.

---------

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).

- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 1/4] add basic task isolation prctl interface
  2021-07-30 20:18 [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Marcelo Tosatti
@ 2021-07-30 20:18 ` Marcelo Tosatti
       [not found]   ` <CAFki+Lnf0cs62Se0aPubzYxP9wh7xjMXn7RXEPvrmtBdYBrsow@mail.gmail.com>
                     ` (2 more replies)
  2021-07-30 20:18 ` [patch 2/4] task isolation: sync vmstats on return to userspace Marcelo Tosatti
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-07-30 20:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Add basic prctl task isolation interface, which allows
informing the kernel that application is executing 
latency sensitive code (where interruptions are undesired).

Interface is described by task_isolation.rst (added by this patch).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,187 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Task isolation prctl interface
+===============================
+
+Certain types of applications benefit from running uninterrupted by
+background OS activities. Realtime systems and high-bandwidth networking
+applications with user-space drivers can fall into the category.
+
+
+To create a OS noise free environment for the application, this
+interface allows userspace to inform the kernel the start and
+end of the latency sensitive application section (with configurable
+system behaviour for that section).
+
+The prctl options are:
+
+
+        - PR_ISOL_FEAT: Retrieve supported features.
+        - PR_ISOL_GET: Retrieve task isolation parameters.
+        - PR_ISOL_SET: Set task isolation parameters.
+        - PR_ISOL_CTRL_GET: Retrieve task isolation state.
+        - PR_ISOL_CTRL_SET: Set task isolation state (enable/disable task isolation).
+
+The isolation parameters and state are not inherited by
+children created by fork(2) and clone(2). The setting is
+preserved across execve(2).
+
+The sequence of steps to enable task isolation are:
+
+1. Retrieve supported task isolation features (PR_ISOL_FEAT).
+
+2. Configure task isolation features (PR_ISOL_SET/PR_ISOL_GET).
+
+3. Activate or deactivate task isolation features
+   (PR_ISOL_CTRL_GET/PR_ISOL_CTRL_SET).
+
+This interface is based on ideas and code from the
+task isolation patchset from Alex Belits:
+https://lwn.net/Articles/816298/
+
+--------------------
+Feature description
+--------------------
+
+        - ``ISOL_F_QUIESCE``
+
+        This feature allows quiescing select kernel activities on
+        return from system calls.
+
+---------------------
+Interface description
+---------------------
+
+**PR_ISOL_FEAT**:
+
+        Returns the supported features and feature
+        capabilities, as a bitmask. Features and its capabilities
+        are defined at include/uapi/linux/task_isolation.h::
+
+                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies whether to return
+        supported features (if zero), or feature capabilities
+        (if not zero). Possible non-zero values for 'feat' are:
+
+        - ``ISOL_F_QUIESCE``:
+
+                If arg3 is zero, returns a bitmask containing
+                which kernel activities are supported for quiescing.
+
+                If arg3 is ISOL_F_QUIESCE_DEFMASK, returns
+                default_quiesce_mask, a system-wide configurable.
+                See description of default_quiesce_mask below.
+
+**PR_ISOL_GET**:
+
+        Retrieve task isolation feature configuration.
+        The general format is::
+
+                prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
+
+        Possible values for feat are:
+
+        - ``ISOL_F_QUIESCE``:
+
+                Returns a bitmask containing which kernel
+                activities are enabled for quiescing.
+
+
+**PR_ISOL_SET**:
+
+        Configures task isolation features. The general format is::
+
+                prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies which feature to configure.
+        Possible values for feat are:
+
+        - ``ISOL_F_QUIESCE``:
+
+                The 'arg3' argument is a bitmask specifying which
+                kernel activities to quiesce. Possible bit sets are:
+
+                - ``ISOL_F_QUIESCE_VMSTATS``
+
+                  VM statistics are maintained in per-CPU counters to
+                  improve performance. When a CPU modifies a VM statistic,
+                  this modification is kept in the per-CPU counter.
+                  Certain activities require a global count, which
+                  involves requesting each CPU to flush its local counters
+                  to the global VM counters.
+
+                  This flush is implemented via a workqueue item, which
+                  might schedule a workqueue on isolated CPUs.
+
+                  To avoid this interruption, task isolation can be
+                  configured to, upon return from system calls, synchronize
+                  the per-CPU counters to global counters, thus avoiding
+                  the interruption.
+
+                  To ensure the application returns to userspace
+                  with no modified per-CPU counters, its necessary to
+                  use mlockall() in addition to this isolcpus flag.
+
+**PR_ISOL_CTRL_GET**:
+
+        Retrieve task isolation control.
+
+                prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
+
+        Returns which isolation features are active.
+
+**PR_ISOL_CTRL_SET**:
+
+        Activates/deactivates task isolation control.
+
+                prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
+
+        The 'mask' argument specifies which features
+        to activate (bit set) or deactivate (bit clear).
+
+        For ISOL_F_QUIESCE, quiescing of background activities
+        happens on return to userspace from the
+        prctl(PR_ISOL_CTRL_SET) call, and on return from
+        subsequent system calls.
+
+        Quiescing can be adjusted (while active) by
+        prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ...).
+
+--------------------
+Default quiesce mask
+--------------------
+
+Applications can either explicitly specify individual
+background activities that should be quiesced, or
+obtain a system configurable value, which is to be
+configured by the system admin/mgmt system.
+
+/sys/kernel/task_isolation/available_quiesce lists, as
+one string per line, the activities which the kernel
+supports quiescing.
+
+To configure the default quiesce mask, write a comma separated
+list of strings (from available_quiesce) to
+/sys/kernel/task_isolation/default_quiesce.
+
+echo > /sys/kernel/task_isolation/default_quiesce disables
+all quiescing via ISOL_F_QUIESCE_DEFMASK.
+
+Using ISOL_F_QUIESCE_DEFMASK allows for the application to
+take advantage of future quiescing capabilities without
+modification (provided default_quiesce is configured
+accordingly).
+
+See PR_ISOL_FEAT subsection of "Interface description" section
+for more details. samples/task_isolation/task_isolation.c
+contains an example.
+
+Examples
+========
+
+The ``samples/task_isolation/`` directory contains sample
+applications.
+
+
Index: linux-2.6/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/prctl.h
+++ linux-2.6/include/uapi/linux/prctl.h
@@ -267,4 +267,17 @@ struct prctl_mm_map {
 # define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
 # define PR_SCHED_CORE_MAX		4
 
+/* Task isolation control */
+#define PR_ISOL_FEAT			62
+#define PR_ISOL_GET			63
+#define PR_ISOL_SET			64
+#define PR_ISOL_CTRL_GET		65
+#define PR_ISOL_CTRL_SET		66
+
+# define ISOL_F_QUIESCE			(1UL << 0)
+#  define ISOL_F_QUIESCE_VMSTATS	(1UL << 0)
+
+# define ISOL_F_QUIESCE_DEFMASK		(1UL << 0)
+
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile
+++ linux-2.6/kernel/Makefile
@@ -132,6 +132,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_CPU_ISOLATION) += task_isolation.o
+
 CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
 obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -58,6 +58,7 @@
 #include <linux/sched/coredump.h>
 #include <linux/sched/task.h>
 #include <linux/sched/cputime.h>
+#include <linux/task_isolation.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
@@ -2567,6 +2568,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
 #endif
+	case PR_ISOL_FEAT:
+		error = prctl_task_isolation_feat(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_GET:
+		error = prctl_task_isolation_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_SET:
+		error = prctl_task_isolation_set(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CTRL_GET:
+		error = prctl_task_isolation_ctrl_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CTRL_SET:
+		error = prctl_task_isolation_ctrl_set(arg2, arg3, arg4, arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
Index: linux-2.6/samples/task_isolation/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isolation.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+
+int main(void)
+{
+	int ret, defmask;
+	void *buf = malloc(4096);
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = prctl(PR_ISOL_FEAT, 0, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT");
+		return EXIT_FAILURE;
+	}
+	printf("supported features bitmask: 0x%x\n", ret);
+
+	if (!(ret & ISOL_F_QUIESCE)) {
+		printf("quiesce feature unsupported, quitting\n");
+		return EXIT_FAILURE;
+	}
+
+	ret = prctl(PR_ISOL_FEAT, ISOL_F_QUIESCE, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE)");
+		return EXIT_FAILURE;
+	}
+	printf("supported ISOL_F_QUIESCE bits: 0x%x\n", ret);
+
+	ret = prctl(PR_ISOL_FEAT, ISOL_F_QUIESCE, ISOL_F_QUIESCE_DEFMASK,
+		    0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE, DEFMASK)");
+		return EXIT_FAILURE;
+	}
+
+	defmask = ret;
+	printf("default ISOL_F_QUIESCE bits: 0x%x\n", defmask);
+
+	/*
+	 * Application can either set the value from ISOL_F_QUIESCE_DEFMASK,
+	 * which is configurable through
+	 * /sys/kernel/task_isolation/default_quiesce, or specific values.
+	 *
+	 * Using ISOL_F_QUIESCE_DEFMASK allows for the application to
+	 * take advantage of future quiescing capabilities without
+	 * modification (provided default_quiesce is configured
+	 * accordingly).
+	 */
+	defmask = defmask | ISOL_F_QUIESCE_VMSTATS;
+
+	ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
+		    0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_SET");
+		return EXIT_FAILURE;
+	}
+
+	ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
+		return EXIT_FAILURE;
+	}
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+	/* busy loop */
+	while (ret < NR_LOOPS)  {
+		memset(buf, 0, 4096);
+		ret = ret+1;
+		if (!(ret % NR_PRINT))
+			printf("loops=%d of %d\n", ret, NR_LOOPS);
+	}
+
+	ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CTRL_SET (0)");
+		exit(0);
+	}
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct isol_info;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1400,6 +1401,10 @@ struct task_struct {
 	struct llist_head               kretprobe_instances;
 #endif
 
+#ifdef CONFIG_CPU_ISOLATION
+	struct isol_info		*isol_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -213,6 +213,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_CPU_ISOLATION
+	.isol_info	= NULL,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/task_isolation.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -734,6 +735,7 @@ void __put_task_struct(struct task_struc
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	tsk_isol_exit(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk, true);
@@ -2084,7 +2086,9 @@ static __latent_entropy struct task_stru
 #ifdef CONFIG_BPF_SYSCALL
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 #endif
-
+#ifdef CONFIG_CPU_ISOLATION
+	p->isol_info = NULL;
+#endif
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/task_isolation.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_TASK_ISOL_H
+#define __LINUX_TASK_ISOL_H
+
+#ifdef CONFIG_CPU_ISOLATION
+
+struct isol_info {
+	/* Which features are active */
+	unsigned long active_mask;
+	/* Quiesce mask */
+	unsigned long quiesce_mask;
+};
+
+extern void __tsk_isol_exit(struct task_struct *tsk);
+
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+	if (tsk->isol_info)
+		__tsk_isol_exit(tsk);
+}
+
+
+int prctl_task_isolation_feat(unsigned long arg2, unsigned long arg3,
+			      unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_get(unsigned long arg2, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_set(unsigned long arg2, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_ctrl_set(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5);
+
+#else
+
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+}
+
+static inline int prctl_task_isolation_feat(unsigned long arg2,
+					    unsigned long arg3,
+					    unsigned long arg4,
+					    unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_get(unsigned long arg2,
+					   unsigned long arg3,
+					   unsigned long arg4,
+					   unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_set(unsigned long arg2,
+					   unsigned long arg3,
+					   unsigned long arg4,
+					   unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_ctrl_get(unsigned long arg2,
+						unsigned long arg3,
+						unsigned long arg4,
+						unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_ctrl_set(unsigned long arg2,
+						unsigned long arg3,
+						unsigned long arg4,
+						unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
+#endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/task_isolation.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@belits.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ *   Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/task_isolation.h>
+#include <linux/prctl.h>
+#include <linux/slab.h>
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+
+static unsigned long default_quiesce_mask;
+
+static int tsk_isol_alloc_context(struct task_struct *task)
+{
+	struct isol_info *info;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (unlikely(!info))
+		return -ENOMEM;
+
+	task->isol_info = info;
+	return 0;
+}
+
+void __tsk_isol_exit(struct task_struct *tsk)
+{
+	kfree(tsk->isol_info);
+	tsk->isol_info = NULL;
+}
+
+static int prctl_task_isolation_feat_quiesce(unsigned long type)
+{
+	switch (type) {
+	case 0:
+		return ISOL_F_QUIESCE_VMSTATS;
+	case ISOL_F_QUIESCE_DEFMASK:
+		return default_quiesce_mask;
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+static int task_isolation_get_quiesce(void)
+{
+	if (current->isol_info != NULL)
+		return current->isol_info->quiesce_mask;
+
+	return 0;
+}
+
+static int task_isolation_set_quiesce(unsigned long quiesce_mask)
+{
+	if (quiesce_mask != ISOL_F_QUIESCE_VMSTATS && quiesce_mask != 0)
+		return -EINVAL;
+
+	current->isol_info->quiesce_mask = quiesce_mask;
+	return 0;
+}
+
+int prctl_task_isolation_feat(unsigned long feat, unsigned long arg3,
+			      unsigned long arg4, unsigned long arg5)
+{
+	switch (feat) {
+	case 0:
+		return ISOL_F_QUIESCE;
+	case ISOL_F_QUIESCE:
+		return prctl_task_isolation_feat_quiesce(arg3);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int prctl_task_isolation_get(unsigned long feat, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5)
+{
+	switch (feat) {
+	case ISOL_F_QUIESCE:
+		return task_isolation_get_quiesce();
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int prctl_task_isolation_set(unsigned long feat, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	bool err_free_ctx = false;
+
+	if (current->isol_info == NULL)
+		err_free_ctx = true;
+
+	ret = tsk_isol_alloc_context(current);
+	if (ret)
+		return ret;
+
+	switch (feat) {
+	case ISOL_F_QUIESCE:
+		ret = task_isolation_set_quiesce(arg3);
+		if (ret)
+			break;
+		return 0;
+	default:
+		break;
+	}
+
+	if (err_free_ctx)
+		__tsk_isol_exit(current);
+	return -EINVAL;
+}
+
+int prctl_task_isolation_ctrl_set(unsigned long feat, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5)
+{
+	if (current->isol_info == NULL)
+		return -EINVAL;
+
+	if (feat != ISOL_F_QUIESCE && feat != 0)
+		return -EINVAL;
+
+	current->isol_info->active_mask = feat;
+	return 0;
+}
+
+int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5)
+{
+	if (current->isol_info == NULL)
+		return 0;
+
+	return current->isol_info->active_mask;
+}
+
+struct qoptions {
+	unsigned long mask;
+	char *name;
+};
+
+static struct qoptions qopts[] = {
+	{ISOL_F_QUIESCE_VMSTATS, "vmstat"},
+};
+
+#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
+
+static ssize_t default_quiesce_store(struct kobject *kobj,
+				     struct kobj_attribute *attr,
+				     const char *buf, size_t count)
+{
+	char *p, *s;
+	unsigned long defmask = 0;
+
+	s = (char *)buf;
+	if (count == 1 && strlen(strim(s)) == 0) {
+		default_quiesce_mask = 0;
+		return count;
+	}
+
+	while ((p = strsep(&s, ",")) != NULL) {
+		int i;
+		bool found = false;
+
+		if (!*p)
+			continue;
+
+		for (i = 0; i < QLEN; i++) {
+			struct qoptions *opt = &qopts[i];
+
+			if (strncmp(strim(p), opt->name, strlen(opt->name)) == 0) {
+				defmask |= opt->mask;
+				found = true;
+				break;
+			}
+		}
+		if (found == true)
+			continue;
+		return -EINVAL;
+	}
+	default_quiesce_mask = defmask;
+
+	return count;
+}
+
+#define MAXARRLEN 100
+
+static ssize_t default_quiesce_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *buf)
+{
+	int i;
+	char tbuf[MAXARRLEN] = "";
+
+	for (i = 0; i < QLEN; i++) {
+		struct qoptions *opt = &qopts[i];
+
+		if (default_quiesce_mask & opt->mask) {
+			strlcat(tbuf, opt->name, MAXARRLEN);
+			strlcat(tbuf, "\n", MAXARRLEN);
+		}
+	}
+
+	return sprintf(buf, "%s", tbuf);
+}
+
+static struct kobj_attribute default_quiesce_attr =
+				__ATTR_RW(default_quiesce);
+
+static ssize_t available_quiesce_show(struct kobject *kobj,
+				      struct kobj_attribute *attr, char *buf)
+{
+	int i;
+	char tbuf[MAXARRLEN] = "";
+
+	for (i = 0; i < QLEN; i++) {
+		struct qoptions *opt = &qopts[i];
+
+		strlcat(tbuf, opt->name, MAXARRLEN);
+		strlcat(tbuf, "\n", MAXARRLEN);
+	}
+
+	return sprintf(buf, "%s", tbuf);
+}
+
+static struct kobj_attribute available_quiesce_attr =
+				__ATTR_RO(available_quiesce);
+
+static struct attribute *task_isol_attrs[] = {
+	&available_quiesce_attr.attr,
+	&default_quiesce_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group task_isol_attr_group = {
+	.attrs = task_isol_attrs,
+	.bin_attrs = NULL,
+};
+
+static int __init task_isol_ksysfs_init(void)
+{
+	int ret;
+	struct kobject *task_isol_kobj;
+
+	task_isol_kobj = kobject_create_and_add("task_isolation",
+						kernel_kobj);
+	if (!task_isol_kobj) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = sysfs_create_group(task_isol_kobj, &task_isol_attr_group);
+	if (ret)
+		goto out_task_isol_kobj;
+
+	return 0;
+
+out_task_isol_kobj:
+	kobject_put(task_isol_kobj);
+out:
+	return ret;
+}
+
+arch_initcall(task_isol_ksysfs_init);
Index: linux-2.6/samples/Kconfig
===================================================================
--- linux-2.6.orig/samples/Kconfig
+++ linux-2.6/samples/Kconfig
@@ -223,4 +223,11 @@ config SAMPLE_WATCH_QUEUE
 	  Build example userspace program to use the new mount_notify(),
 	  sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
 
+config SAMPLE_TASK_ISOLATION
+	bool "task isolation sample"
+	depends on CC_CAN_LINK && HEADERS_INSTALL
+	help
+	  Build example userspace program to use prctl task isolation
+	  interface.
+
 endif # SAMPLES
Index: linux-2.6/samples/Makefile
===================================================================
--- linux-2.6.orig/samples/Makefile
+++ linux-2.6/samples/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
 subdir-$(CONFIG_SAMPLE_WATCHDOG)	+= watchdog
 subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)	+= watch_queue
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)	+= kmemleak/
+subdir-$(CONFIG_SAMPLE_TASK_ISOLATION)	+= task_isolation
Index: linux-2.6/samples/task_isolation/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+userprogs-always-y += task_isolation
+
+userccflags += -I usr/include



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 2/4] task isolation: sync vmstats on return to userspace
  2021-07-30 20:18 [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Marcelo Tosatti
  2021-07-30 20:18 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
@ 2021-07-30 20:18 ` Marcelo Tosatti
  2021-08-03 15:13   ` nsaenzju
  2021-07-30 20:18 ` [patch 3/4] mm: vmstat: move need_update Marcelo Tosatti
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2021-07-30 20:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, use the task isolation prctl interface to quiesce 
deferred actions when returning to userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -32,8 +32,20 @@ int prctl_task_isolation_ctrl_get(unsign
 int prctl_task_isolation_ctrl_set(unsigned long arg2, unsigned long arg3,
 				  unsigned long arg4, unsigned long arg5);
 
+void __isolation_exit_to_user_mode_prepare(void);
+
+static inline void isolation_exit_to_user_mode_prepare(void)
+{
+	if (current->isol_info != NULL)
+		__isolation_exit_to_user_mode_prepare();
+}
+
 #else
 
+static void isolation_exit_to_user_mode_prepare(void)
+{
+}
+
 static inline void tsk_isol_exit(struct task_struct *tsk)
 {
 }
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
 		void *buffer, size_t *length, loff_t *ppos);
 #endif
 
+#ifdef CONFIG_SMP
+void sync_vmstat(void);
+#else
+static inline void sync_vmstat(void)
+{
+}
+#endif
+
 struct reclaim_stat {
 	unsigned nr_dirty;
 	unsigned nr_unqueued_dirty;
Index: linux-2.6/kernel/entry/common.c
===================================================================
--- linux-2.6.orig/kernel/entry/common.c
+++ linux-2.6/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/task_isolation.h>
 
 #include "common.h"
 
@@ -287,6 +288,7 @@ static void syscall_exit_to_user_mode_pr
 static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
 {
 	syscall_exit_to_user_mode_prepare(regs);
+	isolation_exit_to_user_mode_prepare();
 	local_irq_disable_exit_to_user();
 	exit_to_user_mode_prepare(regs);
 }
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -17,6 +17,8 @@
 #include <linux/string.h>
 #include <linux/sysfs.h>
 #include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
 
 static unsigned long default_quiesce_mask;
 
@@ -145,6 +147,17 @@ int prctl_task_isolation_ctrl_get(unsign
 	return current->isol_info->active_mask;
 }
 
+void __isolation_exit_to_user_mode_prepare(void)
+{
+	struct isol_info *i = current->isol_info;
+
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return;
+
+	if (i->quiesce_mask & ISOL_F_QUIESCE_VMSTATS)
+		sync_vmstat();
+}
+
 struct qoptions {
 	unsigned long mask;
 	char *name;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1964,6 +1964,27 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+void sync_vmstat(void)
+{
+	int cpu;
+
+	cpu = get_cpu();
+
+	refresh_cpu_vm_stats(false);
+	put_cpu();
+
+	/*
+	 * If task is migrated to another CPU between put_cpu
+	 * and cancel_delayed_work_sync, the code below might
+	 * cancel vmstat_update work for a different cpu
+	 * (than the one from which the vmstats were flushed).
+	 *
+	 * However, vmstat shepherd will re-enable it later,
+	 * so its harmless.
+	 */
+	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+}
+
 static void __init start_shepherd_timer(void)
 {
 	int cpu;



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 3/4] mm: vmstat: move need_update
  2021-07-30 20:18 [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Marcelo Tosatti
  2021-07-30 20:18 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
  2021-07-30 20:18 ` [patch 2/4] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2021-07-30 20:18 ` Marcelo Tosatti
  2021-07-30 20:18 ` [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  2021-08-10 16:40 ` [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Thomas Gleixner
  4 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-07-30 20:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Move need_update() function up in vmstat.c, needed by next patch.
No code changes.

Remove a duplicate comment while at it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1794,6 +1794,37 @@ static const struct seq_operations vmsta
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	pg_data_t *last_pgdat = NULL;
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		struct per_cpu_nodestat *n;
+
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 */
+		if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
+			       sizeof(pzstats->vm_stat_diff[0])))
+			return true;
+
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
+		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
+		if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS *
+			       sizeof(n->vm_node_stat_diff[0])))
+		    return true;
+	}
+	return false;
+}
+
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1874,42 +1905,6 @@ static void vmstat_update(struct work_st
 }
 
 /*
- * Switch off vmstat processing and then fold all the remaining differentials
- * until the diffs stay at zero. The function is used by NOHZ and can only be
- * invoked when tick processing is not active.
- */
-/*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
-			       sizeof(pzstats->vm_stat_diff[0])))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS *
-			       sizeof(n->vm_node_stat_diff[0])))
-		    return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-07-30 20:18 [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2021-07-30 20:18 ` [patch 3/4] mm: vmstat: move need_update Marcelo Tosatti
@ 2021-07-30 20:18 ` Marcelo Tosatti
  2021-08-07  2:47   ` Nitesh Lal
  2021-08-10 16:40 ` [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Thomas Gleixner
  4 siblings, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2021-07-30 20:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

It is not necessary to queue work item to run refresh_vm_stats 
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.

This fixes sosreport hang (which uses vmstat_refresh) with 
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6-vmstat-update/mm/vmstat.c
===================================================================
--- linux-2.6-vmstat-update.orig/mm/vmstat.c
+++ linux-2.6-vmstat-update/mm/vmstat.c
@@ -1826,17 +1826,42 @@ static bool need_update(int cpu)
 }
 
 #ifdef CONFIG_PROC_FS
-static void refresh_vm_stats(struct work_struct *work)
+static bool need_drain_remote_zones(int cpu)
+{
+#ifdef CONFIG_NUMA
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+
+		if (!pcp->count)
+			continue;
+
+		if (!pcp->expire)
+			continue;
+
+		if (zone_to_nid(zone) == cpu_to_node(cpu))
+			continue;
+
+		return true;
+	}
+#endif
+
+	return false;
+}
+
+static long refresh_vm_stats(void *arg)
 {
 	refresh_cpu_vm_stats(true);
+	return 0;
 }
 
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	long val;
-	int err;
-	int i;
+	int i, cpu;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1850,9 +1875,15 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		if (need_update(cpu) || need_drain_remote_zones(cpu))
+			work_on_cpu(cpu, refresh_vm_stats, NULL);
+
+		cond_resched();
+	}
+	put_online_cpus();
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/4] add basic task isolation prctl interface
       [not found]   ` <CAFki+Lnf0cs62Se0aPubzYxP9wh7xjMXn7RXEPvrmtBdYBrsow@mail.gmail.com>
@ 2021-07-31  0:49     ` Marcelo Tosatti
  0 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-07-31  0:49 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Fri, Jul 30, 2021 at 07:36:31PM -0400, Nitesh Lal wrote:
> On Fri, Jul 30, 2021 at 4:21 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > Add basic prctl task isolation interface, which allows
> > informing the kernel that application is executing
> > latency sensitive code (where interruptions are undesired).
> >
> > Interface is described by task_isolation.rst (added by this patch).
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >
> > Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/Documentation/userspace-api/task_isolation.rst
> > @@ -0,0 +1,187 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===============================
> > +Task isolation prctl interface
> > +===============================
> > +
> > +Certain types of applications benefit from running uninterrupted by
> > +background OS activities. Realtime systems and high-bandwidth networking
> > +applications with user-space drivers can fall into the category.
> > +
> > +
> > +To create a OS noise free environment for the application, this
> > +interface allows userspace to inform the kernel the start and
> > +end of the latency sensitive application section (with configurable
> > +system behaviour for that section).
> > +
> > +The prctl options are:
> > +
> > +
> > +        - PR_ISOL_FEAT: Retrieve supported features.
> > +        - PR_ISOL_GET: Retrieve task isolation parameters.
> > +        - PR_ISOL_SET: Set task isolation parameters.
> > +        - PR_ISOL_CTRL_GET: Retrieve task isolation state.
> > +        - PR_ISOL_CTRL_SET: Set task isolation state (enable/disable task
> > isolation).
> > +
> >
> 
> Didn't we decide to replace FEAT/FEATURES with MODE?

Searching for the definition of mode:

mode: one of a series of ways that a machine can be made to work
in manual/automatic mode.

mode: a particular way of doing something.

mode: a way of operating, living, or behaving.

So "mode" seems to fit the case where one case can be chosen 
between different choices (exclusively).

Now for this case it seems a composition of things is what is
happening, because quiescing might be functional with both 
"syscalls allowed" and "syscalls not allowed" modes 
(in that case, "mode" makes more sense).

> > +The isolation parameters and state are not inherited by
> > +children created by fork(2) and clone(2). The setting is
> > +preserved across execve(2).
> > +
> > +The sequence of steps to enable task isolation are:
> > +
> > +1. Retrieve supported task isolation features (PR_ISOL_FEAT).
> > +
> > +2. Configure task isolation features (PR_ISOL_SET/PR_ISOL_GET).
> > +
> > +3. Activate or deactivate task isolation features
> > +   (PR_ISOL_CTRL_GET/PR_ISOL_CTRL_SET).
> > +
> > +This interface is based on ideas and code from the
> > +task isolation patchset from Alex Belits:
> > +https://lwn.net/Articles/816298/
> > +
> > +--------------------
> > +Feature description
> > +--------------------
> > +
> > +        - ``ISOL_F_QUIESCE``
> > +
> > +        This feature allows quiescing select kernel activities on
> > +        return from system calls.
> > +
> > +---------------------
> > +Interface description
> > +---------------------
> > +
> > +**PR_ISOL_FEAT**:
> > +
> > +        Returns the supported features and feature
> > +        capabilities, as a bitmask. Features and its capabilities
> > +        are defined at include/uapi/linux/task_isolation.h::
> > +
> > +                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> > +
> > +        The 'feat' argument specifies whether to return
> > +        supported features (if zero), or feature capabilities
> > +        (if not zero). Possible non-zero values for 'feat' are:
> >
> 
> By feature capabilities you mean the kernel activities (vmstat, tlb_flush)?

Not necessarily, but in the case of ISOL_F_QUIESCE, yes, the different
kernel activities that might interrupt the task.

Feature capabilities is a generic term. For example, one might add

ISOL_F_NOTIFY with ISOL_F_NOTIFY_SIGNAL capabilities.
or
ISOL_F_NOTIFY with ISOL_F_NOTIFY_EVENTFD capabilities.
or
ISOL_F_future_feature with ISOL_F_future_feature_capability.

> +
> > +        - ``ISOL_F_QUIESCE``:
> > +
> > +                If arg3 is zero, returns a bitmask containing
> > +                which kernel activities are supported for quiescing.
> > +
> > +                If arg3 is ISOL_F_QUIESCE_DEFMASK, returns
> > +                default_quiesce_mask, a system-wide configurable.
> > +                See description of default_quiesce_mask below.
> > +
> > +**PR_ISOL_GET**:
> > +
> > +        Retrieve task isolation feature configuration.
> > +        The general format is::
> > +
> > +                prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
> > +
> > +        Possible values for feat are:
> > +
> > +        - ``ISOL_F_QUIESCE``:
> > +
> > +                Returns a bitmask containing which kernel
> > +                activities are enabled for quiescing.
> > +
> > +
> > +**PR_ISOL_SET**:
> > +
> > +        Configures task isolation features. The general format is::
> > +
> > +                prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
> > +
> > +        The 'feat' argument specifies which feature to configure.
> > +        Possible values for feat are:
> >
> 
> We should be able to enable multiple features as well via this? Something
> like ISOL_F_QUIESCE|ISOL_F_BLOCK_INTERRUPTORS as you have mentioned in the
> last posting.

One probably would do it separately (PR_ISOL_SET configures each
feature separately):

       ret = prctl(PR_ISOL_FEAT, 0, 0, 0, 0);
       if (ret == -1) {
               perror("prctl PR_ISOL_FEAT");
               return EXIT_FAILURE;
       }

       if (!(ret & ISOL_F_BLOCK_INTERRUPTORS)) {
               printf("ISOL_F_BLOCK_INTERRUPTORS feature unsupported, quitting\n");
               return EXIT_FAILURE;
       }

       ret = prctl(PR_ISOL_SET, ISOL_F_BLOCK_INTERRUPTORS, params...);
       if (ret == -1) {
               perror("prctl PR_ISOL_SET");
               return EXIT_FAILURE;
       }

       /* configure ISOL_F_QUIESCE, ISOL_F_NOTIFY,
 	* ISOL_F_future_feature... */

	ctrl_set_mask = ISOL_F_QUIESCE|ISOL_F_BLOCK_INTERRUPTORS|
			ISOL_F_NOTIFY|ISOL_F_future_feature;

	/* 
	 * activate isolation mode with the features
	 * as configured above
	 */
	ret = prctl(PR_ISOL_CTRL_SET, ctrl_set_mask, 0, 0, 0);
	if (ret == -1) {
		perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
		return EXIT_FAILURE;
	}

	latency sensitive loop

> > +
> > +        - ``ISOL_F_QUIESCE``:
> > +
> > +                The 'arg3' argument is a bitmask specifying which
> > +                kernel activities to quiesce. Possible bit sets are:
> > +
> > +                - ``ISOL_F_QUIESCE_VMSTATS``
> > +
> > +                  VM statistics are maintained in per-CPU counters to
> > +                  improve performance. When a CPU modifies a VM statistic,
> > +                  this modification is kept in the per-CPU counter.
> > +                  Certain activities require a global count, which
> > +                  involves requesting each CPU to flush its local counters
> > +                  to the global VM counters.
> > +
> > +                  This flush is implemented via a workqueue item, which
> > +                  might schedule a workqueue on isolated CPUs.
> > +
> > +                  To avoid this interruption, task isolation can be
> > +                  configured to, upon return from system calls,
> > synchronize
> > +                  the per-CPU counters to global counters, thus avoiding
> > +                  the interruption.
> > +
> > +                  To ensure the application returns to userspace
> > +                  with no modified per-CPU counters, its necessary to
> > +                  use mlockall() in addition to this isolcpus flag.
> > +
> > +**PR_ISOL_CTRL_GET**:
> > +
> > +        Retrieve task isolation control.
> > +
> > +                prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
> > +
> > +        Returns which isolation features are active.
> > +
> > +**PR_ISOL_CTRL_SET**:
> > +
> > +        Activates/deactivates task isolation control.
> > +
> > +                prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> > +
> > +        The 'mask' argument specifies which features
> > +        to activate (bit set) or deactivate (bit clear).
> > +
> > +        For ISOL_F_QUIESCE, quiescing of background activities
> > +        happens on return to userspace from the
> > +        prctl(PR_ISOL_CTRL_SET) call, and on return from
> > +        subsequent system calls.
> > +
> > +        Quiescing can be adjusted (while active) by
> > +        prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ...).
> >
> 
> Why do we need this additional control? We should be able to enable or
> disable task isolation using the _GET_ and _SET_ calls, isn't it?

The distinction is so one is able to configure the features separately,
and then enter isolated mode with them activated.

> > +
> > +--------------------
> > +Default quiesce mask
> > +--------------------
> > +
> > +Applications can either explicitly specify individual
> > +background activities that should be quiesced, or
> > +obtain a system configurable value, which is to be
> > +configured by the system admin/mgmt system.
> > +
> > +/sys/kernel/task_isolation/available_quiesce lists, as
> > +one string per line, the activities which the kernel
> > +supports quiescing.
> >
> 
> Probably replace 'quiesce' with 'quiesce_activities' because we are really
> controlling the kernel activities via this control and not the quiesce
> state/feature itself.

OK, makes sense.

> > +
> > +To configure the default quiesce mask, write a comma separated
> > +list of strings (from available_quiesce) to
> > +/sys/kernel/task_isolation/default_quiesce.
> > +
> > +echo > /sys/kernel/task_isolation/default_quiesce disables
> > +all quiescing via ISOL_F_QUIESCE_DEFMASK.
> > +
> > +Using ISOL_F_QUIESCE_DEFMASK allows for the application to
> > +take advantage of future quiescing capabilities without
> > +modification (provided default_quiesce is configured
> > +accordingly).
> >
> 
> ISOL_F_QUIESCE_DEFMASK is really telling to quite all kernel
> activities including the one that is not currently supported or I am
> misinterpreting something?

Its telling to quiesce activities that are configured via
/sys/kernel/task_isolation/default_quiesce, including
ones that are not currently supported (in the future,
/sys/kernel/task_isolation/default_quiesce will have to contain the bit
for the new feature as 1).

So userspace can either:

quiesce_mask = value of /sys/kernel/task_isolation/default_quiesce
prctl(PR_ISOL_SET, ISOL_F_QUIESCE, quiesce_mask, 0, 0);

(so that new features might be automatically enabled by
a sysadmin).

or

quiesce_mask = application choice of bits
prctl(PR_ISOL_SET, ISOL_F_QUIESCE, quiesce_mask, 0, 0);

(so that new features might be automatically enabled by
a sysadmin).




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/4] add basic task isolation prctl interface
  2021-07-30 20:18 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
       [not found]   ` <CAFki+Lnf0cs62Se0aPubzYxP9wh7xjMXn7RXEPvrmtBdYBrsow@mail.gmail.com>
@ 2021-07-31  7:47   ` kernel test robot
       [not found]   ` <CAFki+LkQVQOe+5aNEKWDvLdnjWjxzKWOiqOvBZzeuPWX+G=XgA@mail.gmail.com>
  2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2021-07-31  7:47 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: kbuild-all, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 11061 bytes --]

Hi Marcelo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.14-rc3]
[cannot apply to hnaz-linux-mm/master linux/master tip/sched/core tip/core/entry next-20210730]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Marcelo-Tosatti/extensible-prctl-task-isolation-interface-and-vmstat-sync-v2/20210731-042348
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 4669e13cd67f8532be12815ed3d37e775a9bdc16
config: s390-randconfig-r012-20210730 (attached as .config)
compiler: s390-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/c4a772b2c4f14959c65758feac89b3cd0e00a915
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Marcelo-Tosatti/extensible-prctl-task-isolation-interface-and-vmstat-sync-v2/20210731-042348
        git checkout c4a772b2c4f14959c65758feac89b3cd0e00a915
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/sys.c: In function '__do_sys_prctl':
>> kernel/sys.c:2571:2: error: duplicate case value
    2571 |  case PR_ISOL_FEAT:
         |  ^~~~
   kernel/sys.c:2567:2: note: previously used here
    2567 |  case PR_SCHED_CORE:
         |  ^~~~


vim +2571 kernel/sys.c

  2301	
  2302	SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
  2303			unsigned long, arg4, unsigned long, arg5)
  2304	{
  2305		struct task_struct *me = current;
  2306		unsigned char comm[sizeof(me->comm)];
  2307		long error;
  2308	
  2309		error = security_task_prctl(option, arg2, arg3, arg4, arg5);
  2310		if (error != -ENOSYS)
  2311			return error;
  2312	
  2313		error = 0;
  2314		switch (option) {
  2315		case PR_SET_PDEATHSIG:
  2316			if (!valid_signal(arg2)) {
  2317				error = -EINVAL;
  2318				break;
  2319			}
  2320			me->pdeath_signal = arg2;
  2321			break;
  2322		case PR_GET_PDEATHSIG:
  2323			error = put_user(me->pdeath_signal, (int __user *)arg2);
  2324			break;
  2325		case PR_GET_DUMPABLE:
  2326			error = get_dumpable(me->mm);
  2327			break;
  2328		case PR_SET_DUMPABLE:
  2329			if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) {
  2330				error = -EINVAL;
  2331				break;
  2332			}
  2333			set_dumpable(me->mm, arg2);
  2334			break;
  2335	
  2336		case PR_SET_UNALIGN:
  2337			error = SET_UNALIGN_CTL(me, arg2);
  2338			break;
  2339		case PR_GET_UNALIGN:
  2340			error = GET_UNALIGN_CTL(me, arg2);
  2341			break;
  2342		case PR_SET_FPEMU:
  2343			error = SET_FPEMU_CTL(me, arg2);
  2344			break;
  2345		case PR_GET_FPEMU:
  2346			error = GET_FPEMU_CTL(me, arg2);
  2347			break;
  2348		case PR_SET_FPEXC:
  2349			error = SET_FPEXC_CTL(me, arg2);
  2350			break;
  2351		case PR_GET_FPEXC:
  2352			error = GET_FPEXC_CTL(me, arg2);
  2353			break;
  2354		case PR_GET_TIMING:
  2355			error = PR_TIMING_STATISTICAL;
  2356			break;
  2357		case PR_SET_TIMING:
  2358			if (arg2 != PR_TIMING_STATISTICAL)
  2359				error = -EINVAL;
  2360			break;
  2361		case PR_SET_NAME:
  2362			comm[sizeof(me->comm) - 1] = 0;
  2363			if (strncpy_from_user(comm, (char __user *)arg2,
  2364					      sizeof(me->comm) - 1) < 0)
  2365				return -EFAULT;
  2366			set_task_comm(me, comm);
  2367			proc_comm_connector(me);
  2368			break;
  2369		case PR_GET_NAME:
  2370			get_task_comm(comm, me);
  2371			if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
  2372				return -EFAULT;
  2373			break;
  2374		case PR_GET_ENDIAN:
  2375			error = GET_ENDIAN(me, arg2);
  2376			break;
  2377		case PR_SET_ENDIAN:
  2378			error = SET_ENDIAN(me, arg2);
  2379			break;
  2380		case PR_GET_SECCOMP:
  2381			error = prctl_get_seccomp();
  2382			break;
  2383		case PR_SET_SECCOMP:
  2384			error = prctl_set_seccomp(arg2, (char __user *)arg3);
  2385			break;
  2386		case PR_GET_TSC:
  2387			error = GET_TSC_CTL(arg2);
  2388			break;
  2389		case PR_SET_TSC:
  2390			error = SET_TSC_CTL(arg2);
  2391			break;
  2392		case PR_TASK_PERF_EVENTS_DISABLE:
  2393			error = perf_event_task_disable();
  2394			break;
  2395		case PR_TASK_PERF_EVENTS_ENABLE:
  2396			error = perf_event_task_enable();
  2397			break;
  2398		case PR_GET_TIMERSLACK:
  2399			if (current->timer_slack_ns > ULONG_MAX)
  2400				error = ULONG_MAX;
  2401			else
  2402				error = current->timer_slack_ns;
  2403			break;
  2404		case PR_SET_TIMERSLACK:
  2405			if (arg2 <= 0)
  2406				current->timer_slack_ns =
  2407						current->default_timer_slack_ns;
  2408			else
  2409				current->timer_slack_ns = arg2;
  2410			break;
  2411		case PR_MCE_KILL:
  2412			if (arg4 | arg5)
  2413				return -EINVAL;
  2414			switch (arg2) {
  2415			case PR_MCE_KILL_CLEAR:
  2416				if (arg3 != 0)
  2417					return -EINVAL;
  2418				current->flags &= ~PF_MCE_PROCESS;
  2419				break;
  2420			case PR_MCE_KILL_SET:
  2421				current->flags |= PF_MCE_PROCESS;
  2422				if (arg3 == PR_MCE_KILL_EARLY)
  2423					current->flags |= PF_MCE_EARLY;
  2424				else if (arg3 == PR_MCE_KILL_LATE)
  2425					current->flags &= ~PF_MCE_EARLY;
  2426				else if (arg3 == PR_MCE_KILL_DEFAULT)
  2427					current->flags &=
  2428							~(PF_MCE_EARLY|PF_MCE_PROCESS);
  2429				else
  2430					return -EINVAL;
  2431				break;
  2432			default:
  2433				return -EINVAL;
  2434			}
  2435			break;
  2436		case PR_MCE_KILL_GET:
  2437			if (arg2 | arg3 | arg4 | arg5)
  2438				return -EINVAL;
  2439			if (current->flags & PF_MCE_PROCESS)
  2440				error = (current->flags & PF_MCE_EARLY) ?
  2441					PR_MCE_KILL_EARLY : PR_MCE_KILL_LATE;
  2442			else
  2443				error = PR_MCE_KILL_DEFAULT;
  2444			break;
  2445		case PR_SET_MM:
  2446			error = prctl_set_mm(arg2, arg3, arg4, arg5);
  2447			break;
  2448		case PR_GET_TID_ADDRESS:
  2449			error = prctl_get_tid_address(me, (int __user * __user *)arg2);
  2450			break;
  2451		case PR_SET_CHILD_SUBREAPER:
  2452			me->signal->is_child_subreaper = !!arg2;
  2453			if (!arg2)
  2454				break;
  2455	
  2456			walk_process_tree(me, propagate_has_child_subreaper, NULL);
  2457			break;
  2458		case PR_GET_CHILD_SUBREAPER:
  2459			error = put_user(me->signal->is_child_subreaper,
  2460					 (int __user *)arg2);
  2461			break;
  2462		case PR_SET_NO_NEW_PRIVS:
  2463			if (arg2 != 1 || arg3 || arg4 || arg5)
  2464				return -EINVAL;
  2465	
  2466			task_set_no_new_privs(current);
  2467			break;
  2468		case PR_GET_NO_NEW_PRIVS:
  2469			if (arg2 || arg3 || arg4 || arg5)
  2470				return -EINVAL;
  2471			return task_no_new_privs(current) ? 1 : 0;
  2472		case PR_GET_THP_DISABLE:
  2473			if (arg2 || arg3 || arg4 || arg5)
  2474				return -EINVAL;
  2475			error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
  2476			break;
  2477		case PR_SET_THP_DISABLE:
  2478			if (arg3 || arg4 || arg5)
  2479				return -EINVAL;
  2480			if (mmap_write_lock_killable(me->mm))
  2481				return -EINTR;
  2482			if (arg2)
  2483				set_bit(MMF_DISABLE_THP, &me->mm->flags);
  2484			else
  2485				clear_bit(MMF_DISABLE_THP, &me->mm->flags);
  2486			mmap_write_unlock(me->mm);
  2487			break;
  2488		case PR_MPX_ENABLE_MANAGEMENT:
  2489		case PR_MPX_DISABLE_MANAGEMENT:
  2490			/* No longer implemented: */
  2491			return -EINVAL;
  2492		case PR_SET_FP_MODE:
  2493			error = SET_FP_MODE(me, arg2);
  2494			break;
  2495		case PR_GET_FP_MODE:
  2496			error = GET_FP_MODE(me);
  2497			break;
  2498		case PR_SVE_SET_VL:
  2499			error = SVE_SET_VL(arg2);
  2500			break;
  2501		case PR_SVE_GET_VL:
  2502			error = SVE_GET_VL();
  2503			break;
  2504		case PR_GET_SPECULATION_CTRL:
  2505			if (arg3 || arg4 || arg5)
  2506				return -EINVAL;
  2507			error = arch_prctl_spec_ctrl_get(me, arg2);
  2508			break;
  2509		case PR_SET_SPECULATION_CTRL:
  2510			if (arg4 || arg5)
  2511				return -EINVAL;
  2512			error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
  2513			break;
  2514		case PR_PAC_RESET_KEYS:
  2515			if (arg3 || arg4 || arg5)
  2516				return -EINVAL;
  2517			error = PAC_RESET_KEYS(me, arg2);
  2518			break;
  2519		case PR_PAC_SET_ENABLED_KEYS:
  2520			if (arg4 || arg5)
  2521				return -EINVAL;
  2522			error = PAC_SET_ENABLED_KEYS(me, arg2, arg3);
  2523			break;
  2524		case PR_PAC_GET_ENABLED_KEYS:
  2525			if (arg2 || arg3 || arg4 || arg5)
  2526				return -EINVAL;
  2527			error = PAC_GET_ENABLED_KEYS(me);
  2528			break;
  2529		case PR_SET_TAGGED_ADDR_CTRL:
  2530			if (arg3 || arg4 || arg5)
  2531				return -EINVAL;
  2532			error = SET_TAGGED_ADDR_CTRL(arg2);
  2533			break;
  2534		case PR_GET_TAGGED_ADDR_CTRL:
  2535			if (arg2 || arg3 || arg4 || arg5)
  2536				return -EINVAL;
  2537			error = GET_TAGGED_ADDR_CTRL();
  2538			break;
  2539		case PR_SET_IO_FLUSHER:
  2540			if (!capable(CAP_SYS_RESOURCE))
  2541				return -EPERM;
  2542	
  2543			if (arg3 || arg4 || arg5)
  2544				return -EINVAL;
  2545	
  2546			if (arg2 == 1)
  2547				current->flags |= PR_IO_FLUSHER;
  2548			else if (!arg2)
  2549				current->flags &= ~PR_IO_FLUSHER;
  2550			else
  2551				return -EINVAL;
  2552			break;
  2553		case PR_GET_IO_FLUSHER:
  2554			if (!capable(CAP_SYS_RESOURCE))
  2555				return -EPERM;
  2556	
  2557			if (arg2 || arg3 || arg4 || arg5)
  2558				return -EINVAL;
  2559	
  2560			error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
  2561			break;
  2562		case PR_SET_SYSCALL_USER_DISPATCH:
  2563			error = set_syscall_user_dispatch(arg2, arg3, arg4,
  2564							  (char __user *) arg5);
  2565			break;
  2566	#ifdef CONFIG_SCHED_CORE
  2567		case PR_SCHED_CORE:
  2568			error = sched_core_share_pid(arg2, arg3, arg4, arg5);
  2569			break;
  2570	#endif
> 2571		case PR_ISOL_FEAT:
  2572			error = prctl_task_isolation_feat(arg2, arg3, arg4, arg5);
  2573			break;
  2574		case PR_ISOL_GET:
  2575			error = prctl_task_isolation_get(arg2, arg3, arg4, arg5);
  2576			break;
  2577		case PR_ISOL_SET:
  2578			error = prctl_task_isolation_set(arg2, arg3, arg4, arg5);
  2579			break;
  2580		case PR_ISOL_CTRL_GET:
  2581			error = prctl_task_isolation_ctrl_get(arg2, arg3, arg4, arg5);
  2582			break;
  2583		case PR_ISOL_CTRL_SET:
  2584			error = prctl_task_isolation_ctrl_set(arg2, arg3, arg4, arg5);
  2585			break;
  2586		default:
  2587			error = -EINVAL;
  2588			break;
  2589		}
  2590		return error;
  2591	}
  2592	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35453 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/4] add basic task isolation prctl interface
       [not found]   ` <CAFki+LkQVQOe+5aNEKWDvLdnjWjxzKWOiqOvBZzeuPWX+G=XgA@mail.gmail.com>
@ 2021-08-02 14:16     ` Marcelo Tosatti
  0 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-08-02 14:16 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Mon, Aug 02, 2021 at 10:02:03AM -0400, Nitesh Lal wrote:
> On Fri, Jul 30, 2021 at 4:21 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > Add basic prctl task isolation interface, which allows
> > informing the kernel that application is executing
> > latency sensitive code (where interruptions are undesired).
> >
> > Interface is described by task_isolation.rst (added by this patch).
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >
> >
>  [...]
> 
> +extern void __tsk_isol_exit(struct task_struct *tsk);
> > +
> > +static inline void tsk_isol_exit(struct task_struct *tsk)
> > +{
> > +       if (tsk->isol_info)
> > +               __tsk_isol_exit(tsk);
> > +}
> > +
> > +
> >
> 
> nit: we can get rid of this extra line.
> 
> 
> > +int prctl_task_isolation_feat(unsigned long arg2, unsigned long arg3,
> > +                             unsigned long arg4, unsigned long arg5);
> > +int prctl_task_isolation_get(unsigned long arg2, unsigned long arg3,
> > +                            unsigned long arg4, unsigned long arg5);
> > +int prctl_task_isolation_set(unsigned long arg2, unsigned long arg3,
> > +                            unsigned long arg4, unsigned long arg5);
> > +int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
> > +                                 unsigned long arg4, unsigned long arg5);
> > +int prctl_task_isolation_ctrl_set(unsigned long arg2, unsigned long arg3,
> > +                                 unsigned long arg4, unsigned long arg5);
> > +
> > +#else
> > +
> > +static inline void tsk_isol_exit(struct task_struct *tsk)
> > +{
> > +}
> > +
> > +static inline int prctl_task_isolation_feat(unsigned long arg2,
> > +                                           unsigned long arg3,
> > +                                           unsigned long arg4,
> > +                                           unsigned long arg5)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +static inline int prctl_task_isolation_get(unsigned long arg2,
> > +                                          unsigned long arg3,
> > +                                          unsigned long arg4,
> > +                                          unsigned long arg5)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +static inline int prctl_task_isolation_set(unsigned long arg2,
> > +                                          unsigned long arg3,
> > +                                          unsigned long arg4,
> > +                                          unsigned long arg5)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +static inline int prctl_task_isolation_ctrl_get(unsigned long arg2,
> > +                                               unsigned long arg3,
> > +                                               unsigned long arg4,
> > +                                               unsigned long arg5)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +static inline int prctl_task_isolation_ctrl_set(unsigned long arg2,
> > +                                               unsigned long arg3,
> > +                                               unsigned long arg4,
> > +                                               unsigned long arg5)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +#endif /* CONFIG_CPU_ISOLATION */
> > +
> > +#endif /* __LINUX_TASK_ISOL_H */
> > Index: linux-2.6/kernel/task_isolation.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/kernel/task_isolation.c
> > @@ -0,0 +1,274 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + *  Implementation of task isolation.
> > + *
> > + * Authors:
> > + *   Chris Metcalf <cmetcalf@mellanox.com>
> > + *   Alex Belits <abelits@belits.com>
> > + *   Yuri Norov <ynorov@marvell.com>
> > + *   Marcelo Tosatti <mtosatti@redhat.com>
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/task_isolation.h>
> > +#include <linux/prctl.h>
> > +#include <linux/slab.h>
> > +#include <linux/kobject.h>
> > +#include <linux/string.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/init.h>
> > +
> > +static unsigned long default_quiesce_mask;
> > +
> > +static int tsk_isol_alloc_context(struct task_struct *task)
> > +{
> > +       struct isol_info *info;
> > +
> > +       info = kzalloc(sizeof(*info), GFP_KERNEL);
> > +       if (unlikely(!info))
> > +               return -ENOMEM;
> > +
> > +       task->isol_info = info;
> > +       return 0;
> > +}
> > +
> > +void __tsk_isol_exit(struct task_struct *tsk)
> > +{
> > +       kfree(tsk->isol_info);
> > +       tsk->isol_info = NULL;
> > +}
> > +
> > +static int prctl_task_isolation_feat_quiesce(unsigned long type)
> > +{
> > +       switch (type) {
> > +       case 0:
> > +               return ISOL_F_QUIESCE_VMSTATS;
> > +       case ISOL_F_QUIESCE_DEFMASK:
> > +               return default_quiesce_mask;
> > +       default:
> > +               break;
> > +       }
> > +
> > +       return -EINVAL;
> > +}
> > +
> > +static int task_isolation_get_quiesce(void)
> > +{
> > +       if (current->isol_info != NULL)
> >
> 
> Should replace the above with just 'if (current->isol_info)'.
> 
> +               return current->isol_info->quiesce_mask;
> > +
> > +       return 0;
> > +}
> > +
> > +static int task_isolation_set_quiesce(unsigned long quiesce_mask)
> > +{
> > +       if (quiesce_mask != ISOL_F_QUIESCE_VMSTATS && quiesce_mask != 0)
> > +               return -EINVAL;
> > +
> > +       current->isol_info->quiesce_mask = quiesce_mask;
> > +       return 0;
> > +}
> > +
> > +int prctl_task_isolation_feat(unsigned long feat, unsigned long arg3,
> > +                             unsigned long arg4, unsigned long arg5)
> > +{
> > +       switch (feat) {
> > +       case 0:
> > +               return ISOL_F_QUIESCE;
> > +       case ISOL_F_QUIESCE:
> > +               return prctl_task_isolation_feat_quiesce(arg3);
> > +       default:
> > +               break;
> > +       }
> > +       return -EINVAL;
> > +}
> > +
> > +int prctl_task_isolation_get(unsigned long feat, unsigned long arg3,
> > +                            unsigned long arg4, unsigned long arg5)
> > +{
> > +       switch (feat) {
> > +       case ISOL_F_QUIESCE:
> > +               return task_isolation_get_quiesce();
> > +       default:
> > +               break;
> > +       }
> > +       return -EINVAL;
> > +}
> > +
> > +int prctl_task_isolation_set(unsigned long feat, unsigned long arg3,
> > +                            unsigned long arg4, unsigned long arg5)
> > +{
> > +       int ret;
> > +       bool err_free_ctx = false;
> > +
> > +       if (current->isol_info == NULL)
> >
> 
> Can replace this with 'if (!current->isol_info).
> There are other places below where similar improvement can be done.
> 
> 
> > +               err_free_ctx = true;
> > +
> > +       ret = tsk_isol_alloc_context(current);
> > +       if (ret)
> > +               return ret;
> > +
> > +       switch (feat) {
> > +       case ISOL_F_QUIESCE:
> > +               ret = task_isolation_set_quiesce(arg3);
> > +               if (ret)
> > +                       break;
> > +               return 0;
> > +       default:
> > +               break;
> > +       }
> > +
> > +       if (err_free_ctx)
> > +               __tsk_isol_exit(current);
> > +       return -EINVAL;
> > +}
> > +
> > +int prctl_task_isolation_ctrl_set(unsigned long feat, unsigned long arg3,
> > +                                 unsigned long arg4, unsigned long arg5)
> > +{
> > +       if (current->isol_info == NULL)
> > +               return -EINVAL;
> > +
> > +       if (feat != ISOL_F_QUIESCE && feat != 0)
> > +               return -EINVAL;
> > +
> > +       current->isol_info->active_mask = feat;
> > +       return 0;
> > +}
> > +
> > +int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
> > +                                 unsigned long arg4, unsigned long arg5)
> > +{
> > +       if (current->isol_info == NULL)
> > +               return 0;
> > +
> > +       return current->isol_info->active_mask;
> > +}
> > +
> > +struct qoptions {
> > +       unsigned long mask;
> > +       char *name;
> > +};
> > +
> > +static struct qoptions qopts[] = {
> > +       {ISOL_F_QUIESCE_VMSTATS, "vmstat"},
> > +};
> > +
> > +#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
> > +
> > +static ssize_t default_quiesce_store(struct kobject *kobj,
> > +                                    struct kobj_attribute *attr,
> > +                                    const char *buf, size_t count)
> > +{
> > +       char *p, *s;
> > +       unsigned long defmask = 0;
> > +
> > +       s = (char *)buf;
> > +       if (count == 1 && strlen(strim(s)) == 0) {
> > +               default_quiesce_mask = 0;
> > +               return count;
> > +       }
> > +
> > +       while ((p = strsep(&s, ",")) != NULL) {
> > +               int i;
> > +               bool found = false;
> > +
> > +               if (!*p)
> > +                       continue;
> > +
> > +               for (i = 0; i < QLEN; i++) {
> > +                       struct qoptions *opt = &qopts[i];
> > +
> > +                       if (strncmp(strim(p), opt->name,
> > strlen(opt->name)) == 0) {
> > +                               defmask |= opt->mask;
> > +                               found = true;
> > +                               break;
> > +                       }
> > +               }
> > +               if (found == true)
> > +                       continue;
> > +               return -EINVAL;
> > +       }
> > +       default_quiesce_mask = defmask;
> > +
> > +       return count;
> > +}
> > +
> > +#define MAXARRLEN 100
> > +
> > +static ssize_t default_quiesce_show(struct kobject *kobj,
> > +                                   struct kobj_attribute *attr, char *buf)
> > +{
> > +       int i;
> > +       char tbuf[MAXARRLEN] = "";
> > +
> > +       for (i = 0; i < QLEN; i++) {
> > +               struct qoptions *opt = &qopts[i];
> > +
> > +               if (default_quiesce_mask & opt->mask) {
> > +                       strlcat(tbuf, opt->name, MAXARRLEN);
> > +                       strlcat(tbuf, "\n", MAXARRLEN);
> > +               }
> > +       }
> > +
> > +       return sprintf(buf, "%s", tbuf);
> > +}
> > +
> > +static struct kobj_attribute default_quiesce_attr =
> > +                               __ATTR_RW(default_quiesce);
> > +
> > +static ssize_t available_quiesce_show(struct kobject *kobj,
> > +                                     struct kobj_attribute *attr, char
> > *buf)
> > +{
> > +       int i;
> > +       char tbuf[MAXARRLEN] = "";
> > +
> > +       for (i = 0; i < QLEN; i++) {
> > +               struct qoptions *opt = &qopts[i];
> > +
> > +               strlcat(tbuf, opt->name, MAXARRLEN);
> > +               strlcat(tbuf, "\n", MAXARRLEN);
> > +       }
> > +
> > +       return sprintf(buf, "%s", tbuf);
> > +}
> > +
> > +static struct kobj_attribute available_quiesce_attr =
> > +                               __ATTR_RO(available_quiesce);
> > +
> > +static struct attribute *task_isol_attrs[] = {
> > +       &available_quiesce_attr.attr,
> > +       &default_quiesce_attr.attr,
> > +       NULL,
> > +};
> > +
> > +static const struct attribute_group task_isol_attr_group = {
> > +       .attrs = task_isol_attrs,
> > +       .bin_attrs = NULL,
> > +};
> > +
> > +static int __init task_isol_ksysfs_init(void)
> > +{
> > +       int ret;
> > +       struct kobject *task_isol_kobj;
> > +
> > +       task_isol_kobj = kobject_create_and_add("task_isolation",
> > +                                               kernel_kobj);
> > +       if (!task_isol_kobj) {
> > +               ret = -ENOMEM;
> > +               goto out;
> > +       }
> > +
> > +       ret = sysfs_create_group(task_isol_kobj, &task_isol_attr_group);
> > +       if (ret)
> > +               goto out_task_isol_kobj;
> > +
> > +       return 0;
> > +
> > +out_task_isol_kobj:
> > +       kobject_put(task_isol_kobj);
> > +out:
> > +       return ret;
> > +}
> > +
> > +arch_initcall(task_isol_ksysfs_init);
> > Index: linux-2.6/samples/Kconfig
> > ===================================================================
> > --- linux-2.6.orig/samples/Kconfig
> > +++ linux-2.6/samples/Kconfig
> > @@ -223,4 +223,11 @@ config SAMPLE_WATCH_QUEUE
> >           Build example userspace program to use the new mount_notify(),
> >           sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
> >
> > +config SAMPLE_TASK_ISOLATION
> > +       bool "task isolation sample"
> > +       depends on CC_CAN_LINK && HEADERS_INSTALL
> > +       help
> > +         Build example userspace program to use prctl task isolation
> > +         interface.
> > +
> >  endif # SAMPLES
> > Index: linux-2.6/samples/Makefile
> > ===================================================================
> > --- linux-2.6.orig/samples/Makefile
> > +++ linux-2.6/samples/Makefile
> > @@ -30,3 +30,4 @@ obj-$(CONFIG_SAMPLE_INTEL_MEI)                += mei/
> >  subdir-$(CONFIG_SAMPLE_WATCHDOG)       += watchdog
> >  subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)    += watch_queue
> >  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)      += kmemleak/
> > +subdir-$(CONFIG_SAMPLE_TASK_ISOLATION) += task_isolation
> > Index: linux-2.6/samples/task_isolation/Makefile
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/samples/task_isolation/Makefile
> > @@ -0,0 +1,4 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +userprogs-always-y += task_isolation
> > +
> > +userccflags += -I usr/include
> >
> >
> >
> I am wondering if it is possible to further split this patch into smaller
> ones?

OK, will try to split in smaller patches and fix the
style issues.

Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/4] task isolation: sync vmstats on return to userspace
  2021-07-30 20:18 ` [patch 2/4] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2021-08-03 15:13   ` nsaenzju
  2021-08-03 16:44     ` Marcelo Tosatti
  0 siblings, 1 reply; 16+ messages in thread
From: nsaenzju @ 2021-08-03 15:13 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu

On Fri, 2021-07-30 at 17:18 -0300, Marcelo Tosatti wrote:
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
> 
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
> 
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
> 
> To fix this, use the task isolation prctl interface to quiesce 
> deferred actions when returning to userspace.

Even though this is mostly targeted at nohz_full users, I believe I haven't
seen anything in this series that forces the feature to be run on nohz_full
CPUs (this is a good thing IMO). So, I'd suggest to reword the patch
description so it doesn't imply nohz_full is necessary to use this.

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/4] task isolation: sync vmstats on return to userspace
  2021-08-03 15:13   ` nsaenzju
@ 2021-08-03 16:44     ` Marcelo Tosatti
  0 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-08-03 16:44 UTC (permalink / raw)
  To: nsaenzju
  Cc: linux-kernel, Nitesh Lal, Frederic Weisbecker, Christoph Lameter,
	Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu

On Tue, Aug 03, 2021 at 05:13:03PM +0200, nsaenzju@redhat.com wrote:
> On Fri, 2021-07-30 at 17:18 -0300, Marcelo Tosatti wrote:
> > The logic to disable vmstat worker thread, when entering
> > nohz full, does not cover all scenarios. For example, it is possible
> > for the following to happen:
> > 
> > 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> > 2) app runs mlock, which increases counters for mlock'ed pages.
> > 3) start -RT loop
> > 
> > Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> > the mlock, vmstat shepherd can restart vmstat worker thread on
> > the CPU in question.
> > 
> > To fix this, use the task isolation prctl interface to quiesce 
> > deferred actions when returning to userspace.
> 
> Even though this is mostly targeted at nohz_full users, I believe I haven't
> seen anything in this series that forces the feature to be run on nohz_full
> CPUs (this is a good thing IMO). 

I don't think there is such a dependency either.

> So, I'd suggest to reword the patch
> description so it doesn't imply nohz_full is necessary to use this.

Its describing a fact from nohz_full where it can't guarantee entering
userspace with vmstat turned off (which is a reply to Christopher's
earlier comment that "this should just work with nohz_full and
logic to shutdown the vmstat delayed work timer").

Will add a comment to make it explicit that the series does not depend
on nohz_full.

Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-07-30 20:18 ` [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
@ 2021-08-07  2:47   ` Nitesh Lal
  2021-08-09 17:34     ` Marcelo Tosatti
  0 siblings, 1 reply; 16+ messages in thread
From: Nitesh Lal @ 2021-08-07  2:47 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

Hi Marcelo,

On Fri, Jul 30, 2021 at 4:21 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> It is not necessary to queue work item to run refresh_vm_stats
> on a remote CPU if that CPU has no dirty stats and no per-CPU
> allocations for remote nodes.

[...]

>         /*
>          * The regular update, every sysctl_stat_interval, may come later
> @@ -1850,9 +1875,15 @@ int vmstat_refresh(struct ctl_table *tab
>          * transiently negative values, report an error here if any of
>          * the stats is negative, so we know to go looking for imbalance.
>          */
> -       err = schedule_on_each_cpu(refresh_vm_stats);
> -       if (err)
> -               return err;
> +       get_online_cpus();
> +       for_each_online_cpu(cpu) {
> +               if (need_update(cpu) || need_drain_remote_zones(cpu))
> +                       work_on_cpu(cpu, refresh_vm_stats, NULL);
> +
> +               cond_resched();
> +       }
> +       put_online_cpus();

While testing this patch-set by enabling the debug config options, I
ran into the following splat:

[  945.149209] ======================================================
[  945.156111] WARNING: possible circular locking dependency detected
[  945.163013] 5.14.0-rc3+ #2 Tainted: G S
[  945.168754] ------------------------------------------------------
[  945.175655] sysctl/6820 is trying to acquire lock:
[  945.181006] ffff8881679efb00
((work_completion)(&wfc.work)){+.+.}-{0:0}, at:
__flush_work+0x5c2/0x8c0
[  945.191332]
[  945.191332] but task is already holding lock:
[  945.197846] ffffffff93b43660 (cpu_hotplug_lock){++++}-{0:0}, at:
vmstat_refresh+0x4a/0x450
[  945.207098]
[  945.207098] which lock already depends on the new lock.
[  945.207098]
[  945.216228]
[  945.216228] the existing dependency chain (in reverse order) is:
[  945.224574]
[  945.224574] -> #1 (cpu_hotplug_lock){++++}-{0:0}:
[  945.231488]        lock_acquire+0x1d7/0x540
[  945.236157]        cpus_read_lock+0x3b/0xc0
[  945.240834]        alloc_workqueue+0x884/0xd00
[  945.245800]        scsi_host_alloc+0xbdb/0xf60
[  945.250769]        megasas_probe_one+0x164/0x800 [megaraid_sas]
[  945.257402]        local_pci_probe+0xd8/0x170
[  945.262270]        work_for_cpu_fn+0x51/0xa0
[  945.267041]        process_one_work+0x960/0x16a0
[  945.272200]        worker_thread+0x55e/0xbf0
[  945.276970]        kthread+0x371/0x440
[  945.281158]        ret_from_fork+0x22/0x30
[  945.285737]
[  945.285737] -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  945.293813]        check_prevs_add+0x3fd/0x2470
[  945.298874]        __lock_acquire+0x23f7/0x2f80
[  945.303934]        lock_acquire+0x1d7/0x540
[  945.308605]        __flush_work+0x5e2/0x8c0
[  945.313278]        work_on_cpu+0xee/0x140
[  945.317753]        vmstat_refresh+0x12f/0x450
[  945.322622]        proc_sys_call_handler+0x389/0x500
[  945.328170]        new_sync_read+0x3af/0x610
[  945.332942]        vfs_read+0x268/0x490
[  945.337225]        ksys_read+0xf1/0x1c0
[  945.341511]        do_syscall_64+0x42/0x90
[  945.346091]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  945.352318]
[  945.352318] other info that might help us debug this:
[  945.352318]
[  945.361254]  Possible unsafe locking scenario:
[  945.361254]
[  945.367863]        CPU0                    CPU1
[  945.372920]        ----                    ----
[  945.377976]   lock(cpu_hotplug_lock);
[  945.382072]
lock((work_completion)(&wfc.work));
[  945.390140]                                lock(cpu_hotplug_lock);
[  945.397046]   lock((work_completion)(&wfc.work));
[  945.402301]
[  945.402301]  *** DEADLOCK ***
[  945.402301]
[  945.408909] 1 lock held by sysctl/6820:
[  945.413194]  #0: ffffffff93b43660 (cpu_hotplug_lock){++++}-{0:0},
at: vmstat_refresh+0x4a/0x450
[  945.422929]
[  945.422929] stack backtrace:
[  945.427793] CPU: 25 PID: 6820 Comm: sysctl Tainted: G S
   5.14.0-rc3+ #2
[  945.436540] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS
2.6.0 10/31/2017
[  945.444886] Call Trace:
[  945.447623]  dump_stack_lvl+0x44/0x57
[  945.451720]  check_noncircular+0x280/0x320
[  945.456299]  ? print_circular_bug.isra.49+0x440/0x440
[  945.461943]  ? mark_lock.part.54+0x107/0x1370
[  945.466813]  ? mark_lock.part.54+0x107/0x1370
[  945.471683]  ? native_sched_clock+0x32/0x130
[  945.476449]  ? lock_chain_count+0x20/0x20
[  945.480929]  ? sched_clock_cpu+0x151/0x1d0
[  945.485512]  check_prevs_add+0x3fd/0x2470
[  945.489999]  ? native_sched_clock+0x32/0x130
[  945.494770]  ? sched_clock_cpu+0x151/0x1d0
[  945.499347]  ? find_held_lock+0x3a/0x1c0
[  945.503731]  ? check_irq_usage+0xe10/0xe10
[  945.508306]  ? lockdep_lock+0xbf/0x1b0
[  945.512495]  ? static_obj+0xc0/0xc0
[  945.516392]  ? sched_clock_cpu+0x151/0x1d0
[  945.520975]  __lock_acquire+0x23f7/0x2f80
[  945.525461]  ? rcu_read_lock_bh_held+0xc0/0xc0
[  945.530430]  lock_acquire+0x1d7/0x540
[  945.534523]  ? __flush_work+0x5c2/0x8c0
[  945.538811]  ? rcu_read_unlock+0x50/0x50
[  945.543196]  ? native_sched_clock+0x32/0x130
[  945.547969]  ? __queue_work+0x4a3/0xfd0
[  945.552256]  ? rcu_read_lock_sched_held+0xaf/0xe0
[  945.557514]  __flush_work+0x5e2/0x8c0
[  945.561604]  ? __flush_work+0x5c2/0x8c0
[  945.565894]  ? queue_delayed_work_on+0x80/0x80
[  945.570853]  ? lock_downgrade+0x700/0x700
[  945.575339]  ? mark_held_locks+0xb7/0x120
[  945.579829]  ? lockdep_hardirqs_on_prepare+0x28f/0x3e0
[  945.585572]  ? queue_work_on+0x48/0x80
[  945.589763]  ? trace_hardirqs_on+0x32/0x170
[  945.594436]  work_on_cpu+0xee/0x140
[  945.598327]  ? flush_delayed_work+0xc0/0xc0
[  945.603004]  ? work_debug_hint+0x40/0x40
[  945.607388]  ? refresh_cpu_vm_stats+0x530/0x530
[  945.612452]  ? need_update+0x162/0x210
[  945.616646]  vmstat_refresh+0x12f/0x450
[  945.620938]  proc_sys_call_handler+0x389/0x500
[  945.625905]  ? proc_sys_permission+0x120/0x120
[  945.630872]  ? native_sched_clock+0x32/0x130
[  945.635644]  new_sync_read+0x3af/0x610
[  945.639838]  ? __x64_sys_llseek+0x2e0/0x2e0
[  945.644505]  ? native_sched_clock+0x32/0x130
[  945.649290]  ? fsnotify+0xf10/0xf10
[  945.653199]  vfs_read+0x268/0x490
[  945.656913]  ksys_read+0xf1/0x1c0
[  945.660619]  ? vfs_write+0x910/0x910
[  945.664615]  ? syscall_trace_enter.isra.17+0x18c/0x260
[  945.670369]  do_syscall_64+0x42/0x90
[  945.674371]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Maybe we should replace get/put_online_cpus() with cpu_hotplug/enable_disable()?

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-08-07  2:47   ` Nitesh Lal
@ 2021-08-09 17:34     ` Marcelo Tosatti
  2021-08-09 19:13       ` Nitesh Lal
  0 siblings, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2021-08-09 17:34 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Fri, Aug 06, 2021 at 10:47:40PM -0400, Nitesh Lal wrote:
> Hi Marcelo,
> 
> On Fri, Jul 30, 2021 at 4:21 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > It is not necessary to queue work item to run refresh_vm_stats
> > on a remote CPU if that CPU has no dirty stats and no per-CPU
> > allocations for remote nodes.
> 
> [...]
> 
> >         /*
> >          * The regular update, every sysctl_stat_interval, may come later
> > @@ -1850,9 +1875,15 @@ int vmstat_refresh(struct ctl_table *tab
> >          * transiently negative values, report an error here if any of
> >          * the stats is negative, so we know to go looking for imbalance.
> >          */
> > -       err = schedule_on_each_cpu(refresh_vm_stats);
> > -       if (err)
> > -               return err;
> > +       get_online_cpus();
> > +       for_each_online_cpu(cpu) {
> > +               if (need_update(cpu) || need_drain_remote_zones(cpu))
> > +                       work_on_cpu(cpu, refresh_vm_stats, NULL);
> > +
> > +               cond_resched();
> > +       }
> > +       put_online_cpus();
> 
> While testing this patch-set by enabling the debug config options, I
> ran into the following splat:
> 
> [  945.149209] ======================================================
> [  945.156111] WARNING: possible circular locking dependency detected
> [  945.163013] 5.14.0-rc3+ #2 Tainted: G S
> [  945.168754] ------------------------------------------------------
> [  945.175655] sysctl/6820 is trying to acquire lock:
> [  945.181006] ffff8881679efb00
> ((work_completion)(&wfc.work)){+.+.}-{0:0}, at:
> __flush_work+0x5c2/0x8c0
> [  945.191332]
> [  945.191332] but task is already holding lock:
> [  945.197846] ffffffff93b43660 (cpu_hotplug_lock){++++}-{0:0}, at:
> vmstat_refresh+0x4a/0x450
> [  945.207098]
> [  945.207098] which lock already depends on the new lock.
> [  945.207098]
> [  945.216228]
> [  945.216228] the existing dependency chain (in reverse order) is:
> [  945.224574]
> [  945.224574] -> #1 (cpu_hotplug_lock){++++}-{0:0}:
> [  945.231488]        lock_acquire+0x1d7/0x540
> [  945.236157]        cpus_read_lock+0x3b/0xc0
> [  945.240834]        alloc_workqueue+0x884/0xd00
> [  945.245800]        scsi_host_alloc+0xbdb/0xf60
> [  945.250769]        megasas_probe_one+0x164/0x800 [megaraid_sas]
> [  945.257402]        local_pci_probe+0xd8/0x170
> [  945.262270]        work_for_cpu_fn+0x51/0xa0
> [  945.267041]        process_one_work+0x960/0x16a0
> [  945.272200]        worker_thread+0x55e/0xbf0
> [  945.276970]        kthread+0x371/0x440
> [  945.281158]        ret_from_fork+0x22/0x30
> [  945.285737]
> [  945.285737] -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
> [  945.293813]        check_prevs_add+0x3fd/0x2470
> [  945.298874]        __lock_acquire+0x23f7/0x2f80
> [  945.303934]        lock_acquire+0x1d7/0x540
> [  945.308605]        __flush_work+0x5e2/0x8c0
> [  945.313278]        work_on_cpu+0xee/0x140
> [  945.317753]        vmstat_refresh+0x12f/0x450
> [  945.322622]        proc_sys_call_handler+0x389/0x500
> [  945.328170]        new_sync_read+0x3af/0x610
> [  945.332942]        vfs_read+0x268/0x490
> [  945.337225]        ksys_read+0xf1/0x1c0
> [  945.341511]        do_syscall_64+0x42/0x90
> [  945.346091]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  945.352318]
> [  945.352318] other info that might help us debug this:
> [  945.352318]
> [  945.361254]  Possible unsafe locking scenario:
> [  945.361254]
> [  945.367863]        CPU0                    CPU1
> [  945.372920]        ----                    ----
> [  945.377976]   lock(cpu_hotplug_lock);
> [  945.382072]
> lock((work_completion)(&wfc.work));
> [  945.390140]                                lock(cpu_hotplug_lock);
> [  945.397046]   lock((work_completion)(&wfc.work));
> [  945.402301]
> [  945.402301]  *** DEADLOCK ***
> [  945.402301]
> [  945.408909] 1 lock held by sysctl/6820:
> [  945.413194]  #0: ffffffff93b43660 (cpu_hotplug_lock){++++}-{0:0},
> at: vmstat_refresh+0x4a/0x450
> [  945.422929]
> [  945.422929] stack backtrace:
> [  945.427793] CPU: 25 PID: 6820 Comm: sysctl Tainted: G S
>    5.14.0-rc3+ #2
> [  945.436540] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS
> 2.6.0 10/31/2017
> [  945.444886] Call Trace:
> [  945.447623]  dump_stack_lvl+0x44/0x57
> [  945.451720]  check_noncircular+0x280/0x320
> [  945.456299]  ? print_circular_bug.isra.49+0x440/0x440
> [  945.461943]  ? mark_lock.part.54+0x107/0x1370
> [  945.466813]  ? mark_lock.part.54+0x107/0x1370
> [  945.471683]  ? native_sched_clock+0x32/0x130
> [  945.476449]  ? lock_chain_count+0x20/0x20
> [  945.480929]  ? sched_clock_cpu+0x151/0x1d0
> [  945.485512]  check_prevs_add+0x3fd/0x2470
> [  945.489999]  ? native_sched_clock+0x32/0x130
> [  945.494770]  ? sched_clock_cpu+0x151/0x1d0
> [  945.499347]  ? find_held_lock+0x3a/0x1c0
> [  945.503731]  ? check_irq_usage+0xe10/0xe10
> [  945.508306]  ? lockdep_lock+0xbf/0x1b0
> [  945.512495]  ? static_obj+0xc0/0xc0
> [  945.516392]  ? sched_clock_cpu+0x151/0x1d0
> [  945.520975]  __lock_acquire+0x23f7/0x2f80
> [  945.525461]  ? rcu_read_lock_bh_held+0xc0/0xc0
> [  945.530430]  lock_acquire+0x1d7/0x540
> [  945.534523]  ? __flush_work+0x5c2/0x8c0
> [  945.538811]  ? rcu_read_unlock+0x50/0x50
> [  945.543196]  ? native_sched_clock+0x32/0x130
> [  945.547969]  ? __queue_work+0x4a3/0xfd0
> [  945.552256]  ? rcu_read_lock_sched_held+0xaf/0xe0
> [  945.557514]  __flush_work+0x5e2/0x8c0
> [  945.561604]  ? __flush_work+0x5c2/0x8c0
> [  945.565894]  ? queue_delayed_work_on+0x80/0x80
> [  945.570853]  ? lock_downgrade+0x700/0x700
> [  945.575339]  ? mark_held_locks+0xb7/0x120
> [  945.579829]  ? lockdep_hardirqs_on_prepare+0x28f/0x3e0
> [  945.585572]  ? queue_work_on+0x48/0x80
> [  945.589763]  ? trace_hardirqs_on+0x32/0x170
> [  945.594436]  work_on_cpu+0xee/0x140
> [  945.598327]  ? flush_delayed_work+0xc0/0xc0
> [  945.603004]  ? work_debug_hint+0x40/0x40
> [  945.607388]  ? refresh_cpu_vm_stats+0x530/0x530
> [  945.612452]  ? need_update+0x162/0x210
> [  945.616646]  vmstat_refresh+0x12f/0x450
> [  945.620938]  proc_sys_call_handler+0x389/0x500
> [  945.625905]  ? proc_sys_permission+0x120/0x120
> [  945.630872]  ? native_sched_clock+0x32/0x130
> [  945.635644]  new_sync_read+0x3af/0x610
> [  945.639838]  ? __x64_sys_llseek+0x2e0/0x2e0
> [  945.644505]  ? native_sched_clock+0x32/0x130
> [  945.649290]  ? fsnotify+0xf10/0xf10
> [  945.653199]  vfs_read+0x268/0x490
> [  945.656913]  ksys_read+0xf1/0x1c0
> [  945.660619]  ? vfs_write+0x910/0x910
> [  945.664615]  ? syscall_trace_enter.isra.17+0x18c/0x260
> [  945.670369]  do_syscall_64+0x42/0x90
> [  945.674371]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Maybe we should replace get/put_online_cpus() with cpu_hotplug/enable_disable()?

schedule_on_each_cpu has a similar pattern and should exhibit
the same problem.

So perhaps this problem already existed... 
Don't see the warning on a VM.

Can you revert the last patch and confirm that it introduces this
issue?

Thanks for testing!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-08-09 17:34     ` Marcelo Tosatti
@ 2021-08-09 19:13       ` Nitesh Lal
  0 siblings, 0 replies; 16+ messages in thread
From: Nitesh Lal @ 2021-08-09 19:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Mon, Aug 9, 2021 at 1:34 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Fri, Aug 06, 2021 at 10:47:40PM -0400, Nitesh Lal wrote:
> > Hi Marcelo,
> >
> > On Fri, Jul 30, 2021 at 4:21 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >
> > > It is not necessary to queue work item to run refresh_vm_stats
> > > on a remote CPU if that CPU has no dirty stats and no per-CPU
> > > allocations for remote nodes.
> >
> > [...]
> >
> > >         /*
> > >          * The regular update, every sysctl_stat_interval, may come later
> > > @@ -1850,9 +1875,15 @@ int vmstat_refresh(struct ctl_table *tab
> > >          * transiently negative values, report an error here if any of
> > >          * the stats is negative, so we know to go looking for imbalance.
> > >          */
> > > -       err = schedule_on_each_cpu(refresh_vm_stats);
> > > -       if (err)
> > > -               return err;
> > > +       get_online_cpus();
> > > +       for_each_online_cpu(cpu) {
> > > +               if (need_update(cpu) || need_drain_remote_zones(cpu))
> > > +                       work_on_cpu(cpu, refresh_vm_stats, NULL);
> > > +
> > > +               cond_resched();
> > > +       }
> > > +       put_online_cpus();
> >
> > While testing this patch-set by enabling the debug config options, I
> > ran into the following splat:
> >
> > [  945.149209] ======================================================
> > [  945.156111] WARNING: possible circular locking dependency detected
> > [  945.163013] 5.14.0-rc3+ #2 Tainted: G S
> > [  945.168754] ------------------------------------------------------
> > [  945.175655] sysctl/6820 is trying to acquire lock:
> > [  945.181006] ffff8881679efb00
> > ((work_completion)(&wfc.work)){+.+.}-{0:0}, at:
> > __flush_work+0x5c2/0x8c0
> > [  945.191332]
> > [  945.191332] but task is already holding lock:
> > [  945.197846] ffffffff93b43660 (cpu_hotplug_lock){++++}-{0:0}, at:
> > vmstat_refresh+0x4a/0x450
> > [  945.207098]
> > [  945.207098] which lock already depends on the new lock.
> > [  945.207098]
> > [  945.216228]
> > [  945.216228] the existing dependency chain (in reverse order) is:
> > [  945.224574]
> > [  945.224574] -> #1 (cpu_hotplug_lock){++++}-{0:0}:
> > [  945.231488]        lock_acquire+0x1d7/0x540
> > [  945.236157]        cpus_read_lock+0x3b/0xc0
> > [  945.240834]        alloc_workqueue+0x884/0xd00
> > [  945.245800]        scsi_host_alloc+0xbdb/0xf60
> > [  945.250769]        megasas_probe_one+0x164/0x800 [megaraid_sas]
> > [  945.257402]        local_pci_probe+0xd8/0x170
> > [  945.262270]        work_for_cpu_fn+0x51/0xa0
> > [  945.267041]        process_one_work+0x960/0x16a0
> > [  945.272200]        worker_thread+0x55e/0xbf0
> > [  945.276970]        kthread+0x371/0x440
> > [  945.281158]        ret_from_fork+0x22/0x30
> > [  945.285737]
> > [  945.285737] -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
> > [  945.293813]        check_prevs_add+0x3fd/0x2470
> > [  945.298874]        __lock_acquire+0x23f7/0x2f80
> > [  945.303934]        lock_acquire+0x1d7/0x540
> > [  945.308605]        __flush_work+0x5e2/0x8c0
> > [  945.313278]        work_on_cpu+0xee/0x140
> > [  945.317753]        vmstat_refresh+0x12f/0x450
> > [  945.322622]        proc_sys_call_handler+0x389/0x500
> > [  945.328170]        new_sync_read+0x3af/0x610
> > [  945.332942]        vfs_read+0x268/0x490
> > [  945.337225]        ksys_read+0xf1/0x1c0
> > [  945.341511]        do_syscall_64+0x42/0x90
> > [  945.346091]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  945.352318]
> > [  945.352318] other info that might help us debug this:
> > [  945.352318]
> > [  945.361254]  Possible unsafe locking scenario:
> > [  945.361254]
> > [  945.367863]        CPU0                    CPU1
> > [  945.372920]        ----                    ----
> > [  945.377976]   lock(cpu_hotplug_lock);
> > [  945.382072]
> > lock((work_completion)(&wfc.work));
> > [  945.390140]                                lock(cpu_hotplug_lock);
> > [  945.397046]   lock((work_completion)(&wfc.work));
> > [  945.402301]
> > [  945.402301]  *** DEADLOCK ***
> > [  945.402301]
> > [  945.408909] 1 lock held by sysctl/6820:
> > [  945.413194]  #0: ffffffff93b43660 (cpu_hotplug_lock){++++}-{0:0},
> > at: vmstat_refresh+0x4a/0x450
> > [  945.422929]
> > [  945.422929] stack backtrace:
> > [  945.427793] CPU: 25 PID: 6820 Comm: sysctl Tainted: G S
> >    5.14.0-rc3+ #2
> > [  945.436540] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS
> > 2.6.0 10/31/2017
> > [  945.444886] Call Trace:
> > [  945.447623]  dump_stack_lvl+0x44/0x57
> > [  945.451720]  check_noncircular+0x280/0x320
> > [  945.456299]  ? print_circular_bug.isra.49+0x440/0x440
> > [  945.461943]  ? mark_lock.part.54+0x107/0x1370
> > [  945.466813]  ? mark_lock.part.54+0x107/0x1370
> > [  945.471683]  ? native_sched_clock+0x32/0x130
> > [  945.476449]  ? lock_chain_count+0x20/0x20
> > [  945.480929]  ? sched_clock_cpu+0x151/0x1d0
> > [  945.485512]  check_prevs_add+0x3fd/0x2470
> > [  945.489999]  ? native_sched_clock+0x32/0x130
> > [  945.494770]  ? sched_clock_cpu+0x151/0x1d0
> > [  945.499347]  ? find_held_lock+0x3a/0x1c0
> > [  945.503731]  ? check_irq_usage+0xe10/0xe10
> > [  945.508306]  ? lockdep_lock+0xbf/0x1b0
> > [  945.512495]  ? static_obj+0xc0/0xc0
> > [  945.516392]  ? sched_clock_cpu+0x151/0x1d0
> > [  945.520975]  __lock_acquire+0x23f7/0x2f80
> > [  945.525461]  ? rcu_read_lock_bh_held+0xc0/0xc0
> > [  945.530430]  lock_acquire+0x1d7/0x540
> > [  945.534523]  ? __flush_work+0x5c2/0x8c0
> > [  945.538811]  ? rcu_read_unlock+0x50/0x50
> > [  945.543196]  ? native_sched_clock+0x32/0x130
> > [  945.547969]  ? __queue_work+0x4a3/0xfd0
> > [  945.552256]  ? rcu_read_lock_sched_held+0xaf/0xe0
> > [  945.557514]  __flush_work+0x5e2/0x8c0
> > [  945.561604]  ? __flush_work+0x5c2/0x8c0
> > [  945.565894]  ? queue_delayed_work_on+0x80/0x80
> > [  945.570853]  ? lock_downgrade+0x700/0x700
> > [  945.575339]  ? mark_held_locks+0xb7/0x120
> > [  945.579829]  ? lockdep_hardirqs_on_prepare+0x28f/0x3e0
> > [  945.585572]  ? queue_work_on+0x48/0x80
> > [  945.589763]  ? trace_hardirqs_on+0x32/0x170
> > [  945.594436]  work_on_cpu+0xee/0x140
> > [  945.598327]  ? flush_delayed_work+0xc0/0xc0
> > [  945.603004]  ? work_debug_hint+0x40/0x40
> > [  945.607388]  ? refresh_cpu_vm_stats+0x530/0x530
> > [  945.612452]  ? need_update+0x162/0x210
> > [  945.616646]  vmstat_refresh+0x12f/0x450
> > [  945.620938]  proc_sys_call_handler+0x389/0x500
> > [  945.625905]  ? proc_sys_permission+0x120/0x120
> > [  945.630872]  ? native_sched_clock+0x32/0x130
> > [  945.635644]  new_sync_read+0x3af/0x610
> > [  945.639838]  ? __x64_sys_llseek+0x2e0/0x2e0
> > [  945.644505]  ? native_sched_clock+0x32/0x130
> > [  945.649290]  ? fsnotify+0xf10/0xf10
> > [  945.653199]  vfs_read+0x268/0x490
> > [  945.656913]  ksys_read+0xf1/0x1c0
> > [  945.660619]  ? vfs_write+0x910/0x910
> > [  945.664615]  ? syscall_trace_enter.isra.17+0x18c/0x260
> > [  945.670369]  do_syscall_64+0x42/0x90
> > [  945.674371]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > Maybe we should replace get/put_online_cpus() with cpu_hotplug/enable_disable()?
>
> schedule_on_each_cpu has a similar pattern and should exhibit
> the same problem.
>
> So perhaps this problem already existed...
> Don't see the warning on a VM.

I triggered this issue on bare-metal, not sure about the VM.

>
> Can you revert the last patch and confirm that it introduces this
> issue?

Yes, on reverting this patch I don't see the splat.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2)
  2021-07-30 20:18 [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2021-07-30 20:18 ` [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
@ 2021-08-10 16:40 ` Thomas Gleixner
  2021-08-10 18:37   ` Marcelo Tosatti
  4 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2021-08-10 16:40 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

Marcelo,

On Fri, Jul 30 2021 at 17:18, Marcelo Tosatti wrote:

can you pretty please:

 1) Add a version number to your patch series right where it belongs:

    [patch V2 N/M]

    Just having a (v2) at the end of the subject line of 0/M is sloppy
    at best.

 2) Provide a lore link to the previous version

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2)
  2021-08-10 16:40 ` [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Thomas Gleixner
@ 2021-08-10 18:37   ` Marcelo Tosatti
  2021-08-10 19:15     ` Marcelo Tosatti
  0 siblings, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2021-08-10 18:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu

On Tue, Aug 10, 2021 at 06:40:48PM +0200, Thomas Gleixner wrote:
> Marcelo,
> 
> On Fri, Jul 30 2021 at 17:18, Marcelo Tosatti wrote:
> 
> can you pretty please:
> 
>  1) Add a version number to your patch series right where it belongs:
> 
>     [patch V2 N/M]
> 
>     Just having a (v2) at the end of the subject line of 0/M is sloppy
>     at best.
> 
>  2) Provide a lore link to the previous version
> 
> Thanks,
> 
>         tglx

Thomas,

Sure, will resend -v3 once done with the following:

1) Adding support for KVM.

2) Adding a tool called "chisol" to util-linux, similar 
to chrt/taskset, to prctl+exec (for unmodified applications).

This raises the question whether or not to add an option to preserve
the task parameters across fork (i think the answer is yes).

--

But the following points are unclear to me (in quotes are earlier 
comments you made):

1) "It's about silencing different and largely independent parts of the OS
on a particular CPU. Just defining upfront that there is only the choice
of all or nothing _is_ policy.

There is a very wide range of use case scenarios out there and just
because the ones which you care about needs X does not mean that X is
the right thing for everybody else. You still can have X and let other
people define their own set of things they want to be protected
against.

Aside of that having it selectively is a plus for debugability, testing
etc."

So for the ability to individually select what parts of the OS 
on a particular CPU are quiesced, there is:

+	defmask = defmask | ISOL_F_QUIESCE_VMSTATS;
+
+	ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
+                   0, 0);
+	if (ret == -1) {
+               perror("prctl PR_ISOL_SET");
+               return EXIT_FAILURE;
+	}

However there is a feeling that implementation details are being exposed 
to userspace... However that seems to be alright: what could happen is that
the feature ceases to exist (say vmstat sync), in kernel, and the bit
is kept for compability (but the kernel does nothing about it). 

That of course means whatever "vmstat sync" replacement comes up, it should
avoid IPIs as well.

Any thoughts on this?

2) "Again: I fundamentaly disagree with the proposed task isolation patches
approach as they leave no choice at all.

There is a reasonable middle ground where an application is willing to
pay the price (delay) until the reqested quiescing has taken place in
order to run undisturbed (hint: cache ...) and also is willing to take
the addtional overhead of an occacional syscall in the slow path without
tripping some OS imposed isolation safe guard.

Aside of that such a granular approach does not necessarily require the
application to be aware of it. If the admin knows the computational
pattern of the application, e.g.

 1     read_data_set() <- involving syscalls/OS obviously
 2     compute_set()   <- let me alone
 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

then it's at his discretion to decide to inflict a particular isolation
set on the task which is obviously ineffective while doing #1 and #3 but
might provide the so desired 0.9% boost for compute_set() which
dominates the judgement.

That's what we need to think about and once we figured out how to do
that it gives Marcelo the mechanism to solve his 'run virt undisturbed
by vmstat or whatever' problem and it allows Alex to build his stuff on
it.

Summary: The problem to be solved cannot be restricted to

    self_defined_important_task(OWN_WORLD);

Policy is not a binary on/off problem. It's manifold across all levels
of the stack and only a kernel problem when it comes down to the last
line of defence.

Up to the point where the kernel puts the line of last defence, policy
is defined by the user/admin via mechanims provided by the kernel.

Emphasis on "mechanims provided by the kernel", aka. user API.

Just in case, I hope that I don't have to explain what level of scrunity
and thought this requires."

OK, so perhaps a handful of use-cases can clarify whether the proposed
interface requires changes?

The example on samples/task_isolation/ is focused on "enter task isolation
and very rarely exit".

There are two other cases i am aware of:

A) Christoph's use-case:

	1) Enter task-isolation.
	2) Latency sensitive loop begins.
	3) Some event interrupts latency sensitive section.

	4) Handling of the event requires N syscalls, which the programmer
	   would be interested in happening without quiescing at 
	   every return to system call. The current scheme would be:


       /*
        * Application can either set the value from ISOL_F_QUIESCE_DEFMASK,
        * which is configurable through
        * /sys/kernel/task_isolation/default_quiesce_activities,
        * or specific values.
        *
        * Using ISOL_F_QUIESCE_DEFMASK allows for the application to
        * take advantage of future quiescing capabilities without
        * modification (provided default_quiesce_activities is
        * configured accordingly).
        */
       defmask = defmask | ISOL_F_QUIESCE_VMSTATS;

       ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
                   0, 0);
       if (ret == -1) {
               perror("prctl PR_ISOL_SET");
               return EXIT_FAILURE;
       }

lat_loop:
       ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);
       if (ret == -1) {
               perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
               return EXIT_FAILURE;
       }

       latency sensitive loop

       if (event == 1) {
		/* disables quiescing of all features, while maintaining
		 * other features such as logging and avoidance of
		 * interruptions enabled.
		 */
		ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);
		syscall1
		syscall2
		...
		syscallN
		/* reenter isolated mode with quiescing */
		goto lat_loop;
	}
	...

Should it be possible to modify individual quiescing parts individually
while maintaining isolated mode? Yes, that seems to be desired.


The other use-case (from you) seems to be:

 1     read_data_set() <- involving syscalls/OS obviously
 2     compute_set()   <- let me alone
 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

Well, the implementation of Christoph's use above seems not
to be that bad as well:

 1     read_data_set() <- involving syscalls/OS obviously
	/* disables quiescing of all (or some, if desired)
	 * features, while maintaining other features such
	 * as logging and avoidance of interruptions enabled.
	 */
	ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);

 2     compute_set()   <- let me alone

	ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);

 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

What kind of different behaviour, other than enabling/disabling
quiescing, would be desired in this use-case?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2)
  2021-08-10 18:37   ` Marcelo Tosatti
@ 2021-08-10 19:15     ` Marcelo Tosatti
  0 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-08-10 19:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu

On Tue, Aug 10, 2021 at 03:37:46PM -0300, Marcelo Tosatti wrote:
> On Tue, Aug 10, 2021 at 06:40:48PM +0200, Thomas Gleixner wrote:
> > Marcelo,
> > 
> > On Fri, Jul 30 2021 at 17:18, Marcelo Tosatti wrote:
> > 
> > can you pretty please:
> > 
> >  1) Add a version number to your patch series right where it belongs:
> > 
> >     [patch V2 N/M]
> > 
> >     Just having a (v2) at the end of the subject line of 0/M is sloppy
> >     at best.
> > 
> >  2) Provide a lore link to the previous version
> > 
> > Thanks,
> > 
> >         tglx
> 
> Thomas,
> 
> Sure, will resend -v3 once done with the following:
> 
> 1) Adding support for KVM.
> 
> 2) Adding a tool called "chisol" to util-linux, similar 
> to chrt/taskset, to prctl+exec (for unmodified applications).
> 
> This raises the question whether or not to add an option to preserve
> the task parameters across fork (i think the answer is yes).
> 
> --
> 
> But the following points are unclear to me (in quotes are earlier 
> comments you made):
> 
> 1) "It's about silencing different and largely independent parts of the OS
> on a particular CPU. Just defining upfront that there is only the choice
> of all or nothing _is_ policy.
> 
> There is a very wide range of use case scenarios out there and just
> because the ones which you care about needs X does not mean that X is
> the right thing for everybody else. You still can have X and let other
> people define their own set of things they want to be protected
> against.
> 
> Aside of that having it selectively is a plus for debugability, testing
> etc."
> 
> So for the ability to individually select what parts of the OS 
> on a particular CPU are quiesced, there is:
> 
> +	defmask = defmask | ISOL_F_QUIESCE_VMSTATS;
> +
> +	ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
> +                   0, 0);
> +	if (ret == -1) {
> +               perror("prctl PR_ISOL_SET");
> +               return EXIT_FAILURE;
> +	}
> 
> However there is a feeling that implementation details are being exposed 
> to userspace... However that seems to be alright: what could happen is that
> the feature ceases to exist (say vmstat sync), in kernel, and the bit
> is kept for compability (but the kernel does nothing about it). 
> 
> That of course means whatever "vmstat sync" replacement comes up, it should
> avoid IPIs as well.
> 
> Any thoughts on this?
> 
> 2) "Again: I fundamentaly disagree with the proposed task isolation patches
> approach as they leave no choice at all.
> 
> There is a reasonable middle ground where an application is willing to
> pay the price (delay) until the reqested quiescing has taken place in
> order to run undisturbed (hint: cache ...) and also is willing to take
> the addtional overhead of an occacional syscall in the slow path without
> tripping some OS imposed isolation safe guard.
> 
> Aside of that such a granular approach does not necessarily require the
> application to be aware of it. If the admin knows the computational
> pattern of the application, e.g.
> 
>  1     read_data_set() <- involving syscalls/OS obviously
>  2     compute_set()   <- let me alone
>  3     save_data_set() <- involving syscalls/OS obviously
> 
>        repeat the above...
> 
> then it's at his discretion to decide to inflict a particular isolation
> set on the task which is obviously ineffective while doing #1 and #3 but
> might provide the so desired 0.9% boost for compute_set() which
> dominates the judgement.
> 
> That's what we need to think about and once we figured out how to do
> that it gives Marcelo the mechanism to solve his 'run virt undisturbed
> by vmstat or whatever' problem and it allows Alex to build his stuff on
> it.
> 
> Summary: The problem to be solved cannot be restricted to
> 
>     self_defined_important_task(OWN_WORLD);
> 
> Policy is not a binary on/off problem. It's manifold across all levels
> of the stack and only a kernel problem when it comes down to the last
> line of defence.
> 
> Up to the point where the kernel puts the line of last defence, policy
> is defined by the user/admin via mechanims provided by the kernel.
> 
> Emphasis on "mechanims provided by the kernel", aka. user API.
> 
> Just in case, I hope that I don't have to explain what level of scrunity
> and thought this requires."
> 
> OK, so perhaps a handful of use-cases can clarify whether the proposed
> interface requires changes?
> 
> The example on samples/task_isolation/ is focused on "enter task isolation
> and very rarely exit".
> 
> There are two other cases i am aware of:
> 
> A) Christoph's use-case:
> 
> 	1) Enter task-isolation.
> 	2) Latency sensitive loop begins.
> 	3) Some event interrupts latency sensitive section.
> 
> 	4) Handling of the event requires N syscalls, which the programmer
> 	   would be interested in happening without quiescing at 
> 	   every return to system call. The current scheme would be:
> 
> 
>        /*
>         * Application can either set the value from ISOL_F_QUIESCE_DEFMASK,
>         * which is configurable through
>         * /sys/kernel/task_isolation/default_quiesce_activities,
>         * or specific values.
>         *
>         * Using ISOL_F_QUIESCE_DEFMASK allows for the application to
>         * take advantage of future quiescing capabilities without
>         * modification (provided default_quiesce_activities is
>         * configured accordingly).
>         */
>        defmask = defmask | ISOL_F_QUIESCE_VMSTATS;
> 
>        ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
>                    0, 0);
>        if (ret == -1) {
>                perror("prctl PR_ISOL_SET");
>                return EXIT_FAILURE;
>        }
> 
> lat_loop:
>        ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);
>        if (ret == -1) {
>                perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
>                return EXIT_FAILURE;
>        }
> 
>        latency sensitive loop
> 
>        if (event == 1) {
> 		/* disables quiescing of all features, while maintaining
> 		 * other features such as logging and avoidance of
> 		 * interruptions enabled.
> 		 */
> 		ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);
> 		syscall1
> 		syscall2
> 		...
> 		syscallN
> 		/* reenter isolated mode with quiescing */
> 		goto lat_loop;
> 	}
> 	...
> 
> Should it be possible to modify individual quiescing parts individually
> while maintaining isolated mode? Yes, that seems to be desired.
> 
> 
> The other use-case (from you) seems to be:
> 
>  1     read_data_set() <- involving syscalls/OS obviously
>  2     compute_set()   <- let me alone
>  3     save_data_set() <- involving syscalls/OS obviously
> 
>        repeat the above...
> 
> Well, the implementation of Christoph's use above seems not
> to be that bad as well:
> 
>  1     read_data_set() <- involving syscalls/OS obviously
> 	/* disables quiescing of all (or some, if desired)
> 	 * features, while maintaining other features such
> 	 * as logging and avoidance of interruptions enabled.
> 	 */
> 	ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);
> 
>  2     compute_set()   <- let me alone
> 
> 	ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);
> 
>  3     save_data_set() <- involving syscalls/OS obviously
> 
>        repeat the above...
> 
> What kind of different behaviour, other than enabling/disabling
> quiescing, would be desired in this use-case?
> 

And 3) Is a global ISOL_F_QUIESCE_DEFMASK sufficient, or should this 
be per-task, or cgroups?

       /*
        * Application can either set the value from ISOL_F_QUIESCE_DEFMASK,
        * which is configurable through
        * /sys/kernel/task_isolation/default_quiesce_activities,
        * or specific values.
        *
        * Using ISOL_F_QUIESCE_DEFMASK allows for the application to
        * take advantage of future quiescing capabilities without
        * modification (provided default_quiesce_activities is
        * configured accordingly).
        */
       defmask = defmask | ISOL_F_QUIESCE_VMSTATS;

       ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
                   0, 0);
       if (ret == -1) {
               perror("prctl PR_ISOL_SET");
               return EXIT_FAILURE;
       }


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-08-10 19:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-30 20:18 [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Marcelo Tosatti
2021-07-30 20:18 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
     [not found]   ` <CAFki+Lnf0cs62Se0aPubzYxP9wh7xjMXn7RXEPvrmtBdYBrsow@mail.gmail.com>
2021-07-31  0:49     ` Marcelo Tosatti
2021-07-31  7:47   ` kernel test robot
     [not found]   ` <CAFki+LkQVQOe+5aNEKWDvLdnjWjxzKWOiqOvBZzeuPWX+G=XgA@mail.gmail.com>
2021-08-02 14:16     ` Marcelo Tosatti
2021-07-30 20:18 ` [patch 2/4] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2021-08-03 15:13   ` nsaenzju
2021-08-03 16:44     ` Marcelo Tosatti
2021-07-30 20:18 ` [patch 3/4] mm: vmstat: move need_update Marcelo Tosatti
2021-07-30 20:18 ` [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
2021-08-07  2:47   ` Nitesh Lal
2021-08-09 17:34     ` Marcelo Tosatti
2021-08-09 19:13       ` Nitesh Lal
2021-08-10 16:40 ` [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Thomas Gleixner
2021-08-10 18:37   ` Marcelo Tosatti
2021-08-10 19:15     ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).