[patch v4 0/8] extensible prctl task isolation interface and vmstat sync

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch v4 0/8] extensible prctl task isolation interface and vmstat sync
@ 2021-10-07 19:23 Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 1/8] add basic task isolation prctl interface Marcelo Tosatti
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.

The patchset is based on ideas and code from the 
task isolation patchset from Alex Belits:
https://lwn.net/Articles/816298/

Please refer to Documentation/userspace-api/task_isolation.rst
(patch 1) for details.

Note: the prctl interface is independent of nohz_full=.

---------

v4:
 - Switch to structures for parameters when possible
   (which are more extensible).
 - Switch to CFG_{S,G}ET naming and use drop
   "internal configuration" prctls	      (Frederic Weisbecker).
 - Add summary of terms to documentation      (Frederic Weisbecker).
 - Examples for compute and one-shot modes    (Thomas G/Christoph L).

v3: 

 - Split in smaller patches		 (Nitesh Lal).
 - Misc cleanups			 (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications  
   can activate externally configured
   task isolation parameters.
 - Remove "system default" notion (chisol should
   make it obsolete).
 - Update documentation: add new section with explanation
   about configuration/activation and code example.
 - Update samples.
 - Report configuration/activation state at
   /proc/pid/task_isolation.
 - Condense dirty information of per-CPU vmstats counters 
   in a bool.
 - In-kernel KVM support.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).

- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

v3 can be found at:
https://lore.kernel.org/lkml/20210824152423.300346181@fuller.cnet/

v2 can be found at:
https://lore.kernel.org/patchwork/project/lkml/list/?series=510225






^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch v4 1/8] add basic task isolation prctl interface
  2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
@ 2021-10-07 19:23 ` Marcelo Tosatti
  2021-10-12 13:05   ` Peter Zijlstra
  2021-10-07 19:23 ` [patch v4 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Add basic prctl task isolation interface, which allows
informing the kernel that application is executing 
latency sensitive code (where interruptions are undesired).

Interface is described by task_isolation.rst (added by
next patch).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/sched.h          |    5 +
 include/linux/task_isolation.h |   87 +++++++++++++++++
 include/uapi/linux/prctl.h     |   10 ++
 init/init_task.c               |    3 
 kernel/Makefile                |    2 
 kernel/fork.c                  |    6 +
 kernel/sys.c                   |   16 +++
 kernel/task_isolation.c        |  203 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 331 insertions(+), 1 deletion(-)

Index: linux-2.6/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/prctl.h
+++ linux-2.6/include/uapi/linux/prctl.h
@@ -269,4 +269,52 @@ struct prctl_mm_map {
 # define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
 # define PR_SCHED_CORE_MAX		4
 
+#define PR_ISOL_FEAT_GET		63
+#define PR_ISOL_CFG_GET			64
+#define PR_ISOL_CFG_SET			65
+
+/* arg2 to CFG_GET/CFG_SET */
+# define I_CFG_FEAT			1
+# define I_CFG_INHERIT			2
+# define I_CFG_ONESHOT			3
+
+#define PR_ISOL_ACTIVATE_GET		66
+#define PR_ISOL_ACTIVATE_SET		67
+
+# define ISOL_F_QUIESCE			(1UL << 0)
+#  define ISOL_F_QUIESCE_VMSTATS	(1UL << 0)
+
+struct task_isol_quiesce_extensions {
+	__u64 flags;
+	__u64 supported_quiesce_bits;
+	__u64 pad[6];
+};
+
+/*
+ * This structure provides control over
+ * inheritance of task isolation across
+ * clone and fork.
+ */
+struct task_isol_inherit_control {
+	__u8	inherit_mask;
+	__u8	flags;
+	__u8	pad[6];
+};
+
+# define ISOL_INHERIT_CONF		(1UL << 0)
+# define ISOL_INHERIT_ACTIVE		(1UL << 1)
+
+struct task_isol_activate_control {
+	__u64 flags;
+	__u64 quiesce_oneshot_mask;
+	__u64 pad[6];
+};
+
+struct task_isol_quiesce_control {
+	__u64 flags;
+	__u64 quiesce_mask;
+	__u64 pad[6];
+};
+
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile
+++ linux-2.6/kernel/Makefile
@@ -132,6 +132,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_CPU_ISOLATION) += task_isolation.o
+
 CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
 obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -58,6 +58,7 @@
 #include <linux/sched/coredump.h>
 #include <linux/sched/task.h>
 #include <linux/sched/cputime.h>
+#include <linux/task_isolation.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
@@ -2530,6 +2531,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
 #endif
+	case PR_ISOL_FEAT_GET:
+		error = prctl_task_isolation_feat_get(arg2, arg3, arg4, arg5);
+		break;
+
+	case PR_ISOL_CFG_GET:
+		error = prctl_task_isolation_cfg_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CFG_SET:
+		error = prctl_task_isolation_cfg_set(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_ACTIVATE_GET:
+		error = prctl_task_isolation_activate_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_ACTIVATE_SET:
+		error = prctl_task_isolation_activate_set(arg2, arg3, arg4, arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct isol_info;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1488,6 +1489,10 @@ struct task_struct {
 	struct callback_head		l1d_flush_kill;
 #endif
 
+#ifdef CONFIG_CPU_ISOLATION
+	struct isol_info		*isol_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -214,6 +214,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_CPU_ISOLATION
+	.isol_info	= NULL,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/task_isolation.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -746,6 +747,7 @@ void __put_task_struct(struct task_struc
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	tsk_isol_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk, true);
@@ -1585,6 +1587,15 @@ static int copy_io(unsigned long clone_f
 	return 0;
 }
 
+static int copy_task_isolation(struct task_struct *tsk)
+{
+#ifdef CONFIG_CPU_ISOLATION
+	if (current->isol_info)
+		return __copy_task_isolation(tsk);
+#endif
+	return 0;
+}
+
 static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 {
 	struct sighand_struct *sig;
@@ -2159,7 +2170,9 @@ static __latent_entropy struct task_stru
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
-
+#ifdef CONFIG_CPU_ISOLATION
+	p->isol_info = NULL;
+#endif
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
@@ -2203,6 +2216,9 @@ static __latent_entropy struct task_stru
 	retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
+	retval = copy_task_isolation(p);
+	if (retval)
+		goto bad_fork_cleanup_thread;
 
 	stackleak_task_init(p);
 
@@ -2211,7 +2227,7 @@ static __latent_entropy struct task_stru
 				args->set_tid_size);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
-			goto bad_fork_cleanup_thread;
+			goto bad_fork_cleanup_task_isolation;
 		}
 	}
 
@@ -2429,6 +2445,8 @@ bad_fork_put_pidfd:
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
+bad_fork_cleanup_task_isolation:
+	tsk_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);
 bad_fork_cleanup_io:
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/task_isolation.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_TASK_ISOL_H
+#define __LINUX_TASK_ISOL_H
+
+#ifdef CONFIG_CPU_ISOLATION
+
+struct isol_info {
+	/* Which features have been configured */
+	u64 conf_mask;
+	/* Which features are active */
+	u64 active_mask;
+	/* Quiesce mask */
+	u64 quiesce_mask;
+
+	/* Oneshot mask */
+	u64 oneshot_mask;
+
+	u8 inherit_mask;
+};
+
+extern void __tsk_isol_free(struct task_struct *tsk);
+
+static inline void tsk_isol_free(struct task_struct *tsk)
+{
+	if (tsk->isol_info)
+		__tsk_isol_free(tsk);
+}
+
+int prctl_task_isolation_feat_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_cfg_get(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_cfg_set(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_activate_get(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_activate_set(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5);
+
+int __copy_task_isolation(struct task_struct *tsk);
+
+#else
+
+static inline void tsk_isol_free(struct task_struct *tsk)
+{
+}
+
+static inline int prctl_task_isolation_feat_get(unsigned long arg2,
+						unsigned long arg3,
+						unsigned long arg4,
+						unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_cfg_get(unsigned long arg2,
+					       unsigned long arg3,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_cfg_set(unsigned long arg2,
+					       unsigned long arg3,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_activate_get(unsigned long arg2,
+						    unsigned long arg3,
+						    unsigned long arg4,
+						    unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_activate_set(unsigned long arg2,
+						    unsigned long arg3,
+						    unsigned long arg4,
+						    unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
+#endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/task_isolation.c
@@ -0,0 +1,357 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@belits.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ *   Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/task_isolation.h>
+#include <linux/prctl.h>
+#include <linux/slab.h>
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <linux/sched/task.h>
+
+void __tsk_isol_free(struct task_struct *tsk)
+{
+	if (!tsk->isol_info)
+		return;
+	kfree(tsk->isol_info);
+	tsk->isol_info = NULL;
+}
+
+static struct isol_info *tsk_isol_alloc_context(void)
+{
+	struct isol_info *info;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (unlikely(!info))
+		return ERR_PTR(-ENOMEM);
+
+	return info;
+}
+
+int prctl_task_isolation_feat_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	void __user *addr = (void __user *) arg3;
+
+	switch (arg2) {
+	case 0: {
+		u64 supported_fmask = ISOL_F_QUIESCE;
+
+		ret = 0;
+		if (copy_to_user(addr, &supported_fmask, sizeof(u64)))
+			ret = -EFAULT;
+
+		return ret;
+	}
+	case ISOL_F_QUIESCE: {
+		struct task_isol_quiesce_extensions *q_ext;
+
+		q_ext = kzalloc(sizeof(struct task_isol_quiesce_extensions),
+			 GFP_KERNEL);
+		if (!q_ext)
+			return -ENOMEM;
+
+		q_ext->supported_quiesce_bits = ISOL_F_QUIESCE_VMSTATS;
+
+		ret = 0;
+		if (copy_to_user(addr, q_ext, sizeof(*q_ext)))
+			ret = -EFAULT;
+		kfree(q_ext);
+		return ret;
+	}
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int cfg_inherit_get(unsigned long arg3, unsigned long arg4,
+			   unsigned long arg5)
+{
+	struct task_isol_inherit_control *i_ctrl;
+	int ret;
+	void __user *addr = (void __user *) arg3;
+
+	if (!current->isol_info)
+		return -EINVAL;
+
+	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+			 GFP_KERNEL);
+	if (!i_ctrl)
+		return -ENOMEM;
+
+	i_ctrl->inherit_mask = current->isol_info->inherit_mask;
+
+	ret = 0;
+	if (copy_to_user(addr, i_ctrl, sizeof(*i_ctrl)))
+		ret = -EFAULT;
+	kfree(i_ctrl);
+
+	return ret;
+}
+
+static int cfg_feat_get(unsigned long arg3, unsigned long arg4,
+			unsigned long arg5)
+{
+	int ret = 0;
+	void __user *addr = (void __user *)arg4;
+
+	switch (arg3) {
+	case 0: {
+		u64 cfg_mask = 0;
+
+		if (current->isol_info)
+			cfg_mask = current->isol_info->conf_mask;
+
+		if (copy_to_user(addr, &cfg_mask, sizeof(u64)))
+			ret = -EFAULT;
+
+		return ret;
+	}
+	case ISOL_F_QUIESCE: {
+		struct task_isol_quiesce_control *i_qctrl;
+
+		i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
+				  GFP_KERNEL);
+		if (!i_qctrl)
+			return -ENOMEM;
+
+		if (current->isol_info)
+			i_qctrl->quiesce_mask = current->isol_info->quiesce_mask;
+
+		if (copy_to_user(addr, i_qctrl, sizeof(*i_qctrl)))
+			ret = -EFAULT;
+
+		kfree(i_qctrl);
+		return ret;
+	}
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int prctl_task_isolation_cfg_get(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	switch (arg2) {
+	case I_CFG_FEAT:
+		return cfg_feat_get(arg3, arg4, arg5);
+	case I_CFG_INHERIT:
+		return cfg_inherit_get(arg3, arg4, arg5);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int cfg_inherit_set(unsigned long arg3, unsigned long arg4,
+			   unsigned long arg5)
+{
+	int ret = 0;
+	struct task_isol_inherit_control *i_ctrl;
+	const void __user *addr = (const void __user *)arg3;
+
+	i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+			 GFP_KERNEL);
+	if (!i_ctrl)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(i_ctrl, addr, sizeof(*i_ctrl)))
+		goto out_free;
+
+	ret = -EINVAL;
+	if (i_ctrl->inherit_mask & ~(ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE))
+		goto out_free;
+
+	if (i_ctrl->inherit_mask & ISOL_INHERIT_ACTIVE)
+		if (!(i_ctrl->inherit_mask & ISOL_INHERIT_CONF))
+			goto out_free;
+
+	if (!current->isol_info) {
+		struct isol_info *isol_info;
+
+		isol_info = tsk_isol_alloc_context();
+		if (IS_ERR(isol_info)) {
+			ret = PTR_ERR(isol_info);
+			goto out_free;
+		}
+		current->isol_info = isol_info;
+	}
+
+	ret = 0;
+	current->isol_info->inherit_mask = i_ctrl->inherit_mask;
+
+out_free:
+	kfree(i_ctrl);
+
+	return ret;
+}
+
+static int cfg_feat_quiesce_set(unsigned long arg4, unsigned long arg5)
+{
+	int ret = 0;
+	struct isol_info *isol_info;
+	struct task_isol_quiesce_control *i_qctrl;
+	const void __user *addr = (const void __user *)arg4;
+
+	i_qctrl = kzalloc(sizeof(struct task_isol_quiesce_control),
+			 GFP_KERNEL);
+	if (!i_qctrl)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(i_qctrl, addr, sizeof(*i_qctrl)))
+		goto out_free;
+
+	ret = -EINVAL;
+	if (i_qctrl->flags != 0)
+		goto out_free;
+
+	if (i_qctrl->quiesce_mask != ISOL_F_QUIESCE_VMSTATS &&
+	    i_qctrl->quiesce_mask != 0)
+		goto out_free;
+
+	/* current->isol_info is only allocated/freed from task
+	 * context.
+	 */
+	if (!current->isol_info) {
+		isol_info = tsk_isol_alloc_context();
+		if (IS_ERR(isol_info)) {
+			ret = PTR_ERR(isol_info);
+			goto out_free;
+		}
+		current->isol_info = isol_info;
+	}
+
+	isol_info = current->isol_info;
+
+	isol_info->quiesce_mask = i_qctrl->quiesce_mask;
+	isol_info->conf_mask |= ISOL_F_QUIESCE;
+	ret = 0;
+
+out_free:
+	kfree(i_qctrl);
+
+	return ret;
+}
+
+int prctl_task_isolation_cfg_set(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	switch (arg2) {
+	case I_CFG_FEAT:
+		switch (arg3) {
+		case ISOL_F_QUIESCE:
+			return cfg_feat_quiesce_set(arg4, arg5);
+		default:
+			break;
+		}
+		break;
+	case I_CFG_INHERIT:
+		return cfg_inherit_set(arg3, arg4, arg5);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+int __copy_task_isolation(struct task_struct *tsk)
+{
+	struct isol_info *info, *new_info;
+
+	info = current->isol_info;
+	if (!(info->inherit_mask & (ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE)))
+		return 0;
+
+	new_info = tsk_isol_alloc_context();
+	if (IS_ERR(new_info))
+		return PTR_ERR(new_info);
+
+	new_info->inherit_mask = info->inherit_mask;
+
+	if (info->inherit_mask & ISOL_INHERIT_CONF) {
+		new_info->quiesce_mask = info->quiesce_mask;
+		new_info->conf_mask = info->conf_mask;
+	}
+
+	if (info->inherit_mask & ISOL_INHERIT_ACTIVE)
+		new_info->active_mask = info->active_mask;
+
+	tsk->isol_info = new_info;
+
+	return 0;
+}
+
+int prctl_task_isolation_activate_set(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	struct isol_info *isol_info;
+	u64 oneshot_mask, active_mask;
+	const void __user *addr_mask = (const void __user *)arg2;
+	const void __user *addr_ctrl = (const void __user *)arg3;
+
+	oneshot_mask = active_mask = 0;
+
+	ret = -EFAULT;
+	if (copy_from_user(&active_mask, addr_mask, sizeof(u64)))
+		goto out;
+
+	ret = -EINVAL;
+	if (active_mask != ISOL_F_QUIESCE && active_mask != 0)
+		return ret;
+
+	isol_info = current->isol_info;
+	if (!isol_info)
+		return ret;
+
+	if (addr_ctrl != NULL) {
+		struct task_isol_activate_control act_ctrl;
+
+		ret = -EFAULT;
+		if (copy_from_user(&act_ctrl, addr_ctrl, sizeof(act_ctrl)))
+			goto out;
+
+		ret = -EINVAL;
+		if (act_ctrl.flags)
+			goto out;
+
+		if (act_ctrl.quiesce_oneshot_mask != ISOL_F_QUIESCE_VMSTATS &&
+		    act_ctrl.quiesce_oneshot_mask != 0)
+			goto out;
+
+		oneshot_mask = act_ctrl.quiesce_oneshot_mask;
+	}
+
+	isol_info->active_mask = active_mask;
+	isol_info->oneshot_mask = oneshot_mask;
+	ret = 0;
+
+out:
+	return ret;
+}
+
+int prctl_task_isolation_activate_get(unsigned long arg2, unsigned long arg3,
+				      unsigned long arg4, unsigned long arg5)
+{
+	int ret = 0;
+
+	if (current->isol_info)
+		ret = current->isol_info->active_mask;
+
+	return ret;
+}
+



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch v4 2/8] add prctl task isolation prctl docs and samples
  2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 1/8] add basic task isolation prctl interface Marcelo Tosatti
@ 2021-10-07 19:23 ` Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 3/8] task isolation: sync vmstats on return to userspace Marcelo Tosatti
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Add documentation and userspace sample code for prctl
task isolation interface.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 Documentation/userspace-api/task_isolation.rst |  211 +++++++++++++++++++++++++
 samples/Kconfig                                |    7 
 samples/Makefile                               |    1 
 samples/task_isolation/Makefile                |    9 +
 samples/task_isolation/task_isol.c             |   83 +++++++++
 samples/task_isolation/task_isol.h             |    9 +
 samples/task_isolation/task_isol_userloop.c    |   56 ++++++
 7 files changed, 376 insertions(+)

Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,389 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Task isolation prctl interface
+===============================
+
+Certain types of applications benefit from running uninterrupted by
+background OS activities. Realtime systems and high-bandwidth networking
+applications with user-space drivers can fall into the category.
+
+To create an OS noise free environment for the application, this
+interface allows userspace to inform the kernel the start and
+end of the latency sensitive application section (with configurable
+system behaviour for that section).
+
+Note: the prctl interface is independent of nohz_full=.
+
+The prctl options are:
+
+
+        - PR_ISOL_FEAT_GET: Retrieve supported features.
+        - PR_ISOL_CFG_GET: Retrieve task isolation configuration.
+        - PR_ISOL_CFG_SET: Set task isolation configuration.
+        - PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
+        - PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
+
+Summary of terms:
+
+
+- feature:
+
+        A distinct attribute or aspect of task isolation. Examples of
+        features could be logging, new operating modes (eg: syscalls disallowed),
+        userspace notifications, etc. The only feature currently available is quiescing.
+
+- configuration:
+
+        A specific choice from a given set
+        of possible choices that dictate how the particular feature
+        in question should behave.
+
+- activation state:
+
+        The activation state (whether activate/inactive) of the task
+        isolation features (features must be configured before
+        being activated).
+
+Inheritance of the isolation parameters and state, across
+fork(2) and clone(2), can be changed via
+PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
+
+
+At a high-level, task isolation is divided in two steps:
+
+1. Configuration.
+2. Activation.
+
+Section "Userspace support" describes how to use
+task isolation.
+
+In terms of the interface, the sequence of steps to activate
+task isolation are:
+
+1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
+2. Configure task isolation features (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
+3. Activate or deactivate task isolation features (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
+
+This interface is based on ideas and code from the
+task isolation patchset from Alex Belits:
+https://lwn.net/Articles/816298/
+
+--------------------
+Feature description
+--------------------
+
+        - ``ISOL_F_QUIESCE``
+
+        This feature allows quiescing select kernel activities on
+        return from system calls.
+
+---------------------
+Interface description
+---------------------
+
+**PR_ISOL_FEAT**:
+
+        Returns the supported features and feature
+        capabilities, as a bitmask::
+
+                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies whether to return
+        supported features (if zero), or feature capabilities
+        (if not zero). Possible values for 'feat' are:
+
+
+        - ``0``:
+               Return the bitmask of supported features, in the location
+               pointed  to  by  ``(int *)arg3``. The buffer should allow space
+               for 8 bytes.
+
+        - ``ISOL_F_QUIESCE``:
+
+               Return a structure containing which kernel
+               activities are supported for quiescing, in the location
+               pointed to by ``(int *)arg3``::
+
+                        struct task_isol_quiesce_extensions {
+                                __u64 flags;
+                                __u64 supported_quiesce_bits;
+                                __u64 pad[6];
+                        };
+
+               Where:
+
+               *flags*: Additional flags (should be zero).
+
+               *supported_quiesce_bits*: Bitmask indicating
+                which features are supported for quiescing.
+
+               *pad*: Additional space for future enhancements.
+
+
+        Features and its capabilities are defined at
+        include/uapi/linux/task_isolation.h.
+
+**PR_ISOL_CFG_GET**:
+
+        Retrieve task isolation configuration.
+        The general format is::
+
+                prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
+
+        The 'what' argument specifies what to configure. Possible values are:
+
+        - ``I_CFG_FEAT``:
+
+                Return configuration of task isolation features. The 'arg3' argument specifies
+                whether to return configured features (if zero), or individual
+                feature configuration (if not zero), as follows.
+
+                - ``0``:
+
+                        Return the bitmask of configured features, in the location
+                        pointed  to  by  ``(int *)arg4``. The buffer should allow space
+                        for 8 bytes.
+
+                - ``ISOL_F_QUIESCE``:
+
+                        Return the control structure for quiescing of background
+                        kernel activities, in the location pointed to by
+                        ``(int *)arg4``::
+
+                         struct task_isol_quiesce_control {
+                                __u64 flags;
+                                __u64 quiesce_mask;
+                                __u64 pad[6];
+                         };
+
+                        Where:
+
+                        *flags*: Additional flags (should be zero).
+
+                        *quiesce_mask*: A bitmask containing which activities
+                        are configured for quiescing.
+
+                        *pad*: Additional space for future enhancements.
+
+        - ``I_CFG_INHERIT``:
+
+                Retrieve inheritance configuration across fork/clone.
+
+                Return the structure which configures inheritance
+                across fork/clone, in the location pointed to
+                by ``(int *)arg4``::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                See PR_ISOL_CFG_SET description for meaning of bits.
+
+**PR_ISOL_CFG_SET**:
+
+        Set task isolation configuration.
+        The general format is::
+
+                prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
+
+        The 'what' argument specifies what to configure. Possible values are:
+
+        - ``I_CFG_FEAT``:
+
+                Set configuration of task isolation features. 'arg3' specifies
+                the feature. Possible values are:
+
+                - ``ISOL_F_QUIESCE``:
+
+                        Set the control structure for quiescing of background
+                        kernel activities, from the location pointed to by
+                        ``(int *)arg4``::
+
+                         struct task_isol_quiesce_control {
+                                __u64 flags;
+                                __u64 quiesce_mask;
+                                __u64 pad[6];
+                         };
+
+                        Where:
+
+                        *flags*: Additional flags (should be zero).
+
+                        *quiesce_mask*: A bitmask containing which kernel
+                        activities to quiesce.
+
+                        *pad*: Additional space for future enhancements.
+
+                        For quiesce_mask, possible bit sets are:
+
+                        - ``ISOL_F_QUIESCE_VMSTATS``
+
+                        VM statistics are maintained in per-CPU counters to
+                        improve performance. When a CPU modifies a VM statistic,
+                        this modification is kept in the per-CPU counter.
+                        Certain activities require a global count, which
+                        involves requesting each CPU to flush its local counters
+                        to the global VM counters.
+
+                        This flush is implemented via a workqueue item, which
+                        might schedule a workqueue on isolated CPUs.
+
+                        To avoid this interruption, task isolation can be
+                        configured to, upon return from system calls, synchronize
+                        the per-CPU counters to global counters, thus avoiding
+                        the interruption.
+
+                        To ensure the application returns to userspace
+                        with no modified per-CPU counters, its necessary to
+                        use mlockall() in addition to this isolcpus flag.
+
+        - ``I_CFG_INHERIT``:
+                Set inheritance configuration when a new task
+                is created via fork and clone.
+
+                The ``(int *)arg4`` argument is a pointer to::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                inherit_mask is a bitmask that specifies which part
+                of task isolation should be inherited:
+
+                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
+                  This is the stated written via prctl(PR_ISOL_CFG_SET, ...).
+
+                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
+                  (requires ISOL_INHERIT_CONF to be set). The new task
+                  should behave, after fork/clone, in the same manner
+                  as the parent task after it executed:
+
+                        prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);
+
+**PR_ISOL_ACTIVATE_GET**:
+
+        Retrieve task isolation activation state.
+
+        The general format is::
+
+                prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);
+
+        'pmask' specifies the location of a feature mask, where
+        the current active mask will be copied. See PR_ISOL_ACTIVATE_SET
+        for description of individual bits.
+
+
+**PR_ISOL_ACTIVATE_SET**:
+
+        Set task isolation activation state (activates/deactivates
+        task isolation).
+
+        The general format is::
+
+                prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);
+
+
+        The 'pmask' argument specifies the location of an 8 byte mask
+        containing which features should be activated. Features whose
+        bits are cleared will be deactivated. The possible
+        bits for this mask are:
+
+                - ``ISOL_F_QUIESCE``:
+
+                Activate quiescing of background kernel activities.
+                Quiescing happens on return to userspace from this
+                system call, and on return from subsequent
+                system calls (unless quiesce_oneshot_mask is configured,
+                see below).
+
+        If the arg3 argument is non-zero, it specifies a pointer to::
+
+         struct task_isol_activate_control {
+                 __u64 flags;
+                 __u64 quiesce_oneshot_mask;
+                 __u64 pad[6];
+         };
+
+        Where:
+
+         *flags*: Additional flags (should be zero).
+
+         *quiesce_oneshot_mask*: Quiescing for the kernel activities
+          with bits set on this mask will happen on the return
+          from this system call, but not on return from subsequent ones.
+
+        Quiescing can be adjusted (while active) by
+        prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).
+
+
+==================
+Userspace support
+==================
+
+Task isolation is divided in two main steps: configuration and activation.
+
+Each step can be performed by an external tool or the latency sensitive
+application itself. util-linux contains the "chisol" tool for this
+purpose.
+
+This results in three combinations:
+
+1. Both configuration and activation performed by the
+latency sensitive application.
+Allows fine grained control of what task isolation
+features are enabled and when (see samples section below).
+
+2. Only activation can be performed by the latency sensitive app
+(and configuration performed by chisol).
+This allows the admin/user to control task isolation parameters,
+and applications have to be modified only once.
+
+3. Configuration and activation performed by an external tool.
+This allows unmodified applications to take advantage of
+task isolation. Activation is performed by the "-a" option
+of chisol.
+
+========
+Examples
+========
+
+The ``samples/task_isolation/`` directory contains 3 examples:
+
+* task_isol_userloop.c:
+
+        Example of program with a loop on userspace scenario.
+
+* task_isol_computation.c:
+
+        Example of program that enters task isolated mode,
+        performs an amount of computation, exits task
+        isolated mode, and writes the computation to disk.
+
+* task_isol_oneshot.c:
+
+        Example of program that enables one-shot
+        mode for quiescing, enters a processing loop, then upon an external
+        event performs a number of syscalls to handle that event.
+
+This is a snippet of code to activate task isolation if
+it has been previously configured (by chisol for example)::
+
+        #include <sys/prctl.h>
+        #include <linux/types.h>
+
+        #ifdef PR_ISOL_CFG_GET
+        unsigned long long fmask;
+
+        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+        if (ret != -1 && fmask != 0) {
+                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
+                if (ret == -1) {
+                        perror("prctl PR_ISOL_ACTIVATE_SET");
+                        return ret;
+                }
+        }
+        #endif
+
Index: linux-2.6/samples/task_isolation/task_isol.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include <errno.h>
+#include "task_isol.h"
+
+#ifdef PR_ISOL_FEAT_GET
+int task_isol_setup(void)
+{
+	int ret;
+	int errnosv;
+	unsigned long long fmask;
+	struct task_isol_quiesce_extensions qext;
+	struct task_isol_quiesce_control qctrl;
+
+	/* Retrieve supported task isolation features */
+	ret = prctl(PR_ISOL_FEAT_GET, 0, &fmask, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT");
+		return ret;
+	}
+	printf("supported features bitmask: 0x%llx\n", fmask);
+
+	/* Retrieve supported ISOL_F_QUIESCE bits */
+	ret = prctl(PR_ISOL_FEAT_GET, ISOL_F_QUIESCE, &qext, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE)");
+		return ret;
+	}
+	printf("supported ISOL_F_QUIESCE bits: 0x%llx\n",
+		qext.supported_quiesce_bits);
+
+	fmask = 0;
+	ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
+	errnosv = errno;
+	if (ret != -1 && fmask != 0) {
+		printf("Task isolation parameters already configured!\n");
+		return ret;
+	}
+	if (ret == -1 && errnosv != ENODATA) {
+		perror("prctl PR_ISOL_GET");
+		return ret;
+	}
+	memset(&qctrl, 0, sizeof(struct task_isol_quiesce_control));
+	qctrl.quiesce_mask = ISOL_F_QUIESCE_VMSTATS;
+	ret = prctl(PR_ISOL_CFG_SET, I_CFG_FEAT, ISOL_F_QUIESCE, &qctrl, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CFG_SET");
+		return ret;
+	}
+	return ISOL_F_QUIESCE;
+}
+
+int task_isol_activate_set(unsigned long long mask, int oneshot)
+{
+	int ret;
+	struct task_isol_activate_control act_ctrl;
+
+	if (oneshot)
+		act_ctrl.quiesce_oneshot_mask = ISOL_F_QUIESCE_VMSTATS;
+
+	ret = prctl(PR_ISOL_ACTIVATE_SET, &mask, &act_ctrl, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_ACTIVATE_SET");
+		return -1;
+	}
+
+	return 0;
+}
+
+#else
+
+int task_isol_setup(void)
+{
+	return 0;
+}
+
+int task_isol_activate_set(unsigned long long mask)
+{
+	return 0;
+}
+#endif
+
+
Index: linux-2.6/samples/task_isolation/task_isol.h
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TASK_ISOL_H
+#define __TASK_ISOL_H
+
+int task_isol_setup(void);
+
+int task_isol_activate_set(unsigned long long mask, int oneshot);
+
+#endif /* __TASK_ISOL_H */
Index: linux-2.6/samples/task_isolation/task_isol_userloop.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_userloop.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup();
+	if (ret == -1)
+		return EXIT_FAILURE;
+	mask = ret;
+
+	mask = ISOL_F_QUIESCE;
+	ret = prctl(PR_ISOL_ACTIVATE_SET, &mask, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_ACTIVATE_SET (ISOL_F_QUIESCE)");
+		return EXIT_FAILURE;
+	}
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+	/* busy loop */
+	while (ret < NR_LOOPS)  {
+		memset(buf, 0, 4096);
+		ret = ret+1;
+		if (!(ret % NR_PRINT))
+			printf("loops=%d of %d\n", ret, NR_LOOPS);
+	}
+
+
+	mask = 0;
+	ret = prctl(PR_ISOL_ACTIVATE_SET, &mask, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_ACTIVATE_SET (0)");
+		return EXIT_FAILURE;
+	}
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/Kconfig
===================================================================
--- linux-2.6.orig/samples/Kconfig
+++ linux-2.6/samples/Kconfig
@@ -223,4 +223,11 @@ config SAMPLE_WATCH_QUEUE
 	  Build example userspace program to use the new mount_notify(),
 	  sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
 
+config SAMPLE_TASK_ISOLATION
+	bool "task isolation sample"
+	depends on CC_CAN_LINK && HEADERS_INSTALL
+	help
+	  Build example userspace program to use prctl task isolation
+	  interface.
+
 endif # SAMPLES
Index: linux-2.6/samples/Makefile
===================================================================
--- linux-2.6.orig/samples/Makefile
+++ linux-2.6/samples/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
 subdir-$(CONFIG_SAMPLE_WATCHDOG)	+= watchdog
 subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)	+= watch_queue
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)	+= kmemleak/
+subdir-$(CONFIG_SAMPLE_TASK_ISOLATION)	+= task_isolation
Index: linux-2.6/samples/task_isolation/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+userprogs-always-y += task_isol_userloop task_isol_computation task_isol_oneshot
+task_isol_userloop-objs := task_isol.o task_isol_userloop.o
+task_isol_computation-objs := task_isol.o task_isol_computation.o
+task_isol_oneshot-objs := task_isol.o task_isol_oneshot.o
+
+userccflags += -I usr/include
+
+
+#$(CC) $^ -o $@
Index: linux-2.6/samples/task_isolation/task_isol_computation.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_computation.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Example of task isolation prctl interface with a loop:
+ *
+ *	do {
+ *		enable quiescing of kernel activities
+ *		perform computation
+ *		disable quiescing of kernel activities
+ *		write computation results to disk
+ *	} while (condition);
+ *
+ * Copyright (C) 2021 Red Hat, Inc.
+ *
+ * Marcelo Tosatti <mtosatti@redhat.com>
+ */
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret, fd, write_loops;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	fd = open("/tmp/comp_output.data", O_RDWR|O_CREAT);
+	if (fd == -1) {
+		perror("open");
+		return EXIT_FAILURE;
+	}
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup();
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+
+	write_loops = 0;
+	do {
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+		/* enable quiescing on system call return */
+		ret = task_isol_activate_set(mask, 0);
+		if (ret)
+			return EXIT_FAILURE;
+
+		/* busy loop */
+		while (ret < NR_LOOPS)  {
+			memset(buf, 0xf, 4096);
+			ret = ret+1;
+			if (!(ret % NR_PRINT))
+				printf("wloop=%d loops=%d of %d\n", write_loops,
+					ret, NR_LOOPS);
+		}
+		/* disable quiescing on system call return */
+		ret = task_isol_activate_set(mask & ~ISOL_F_QUIESCE, 0);
+		if (ret)
+			return EXIT_FAILURE;
+
+		/*
+		 * write computed data to disk, this would be
+		 * multiple writes on a real application, so
+		 * disabling quiescing is advantageous
+		 */
+		ret = write(fd, buf, 4096);
+		if (ret == -1) {
+			perror("write");
+			return EXIT_FAILURE;
+		}
+
+		write_loops += 1;
+	} while (write_loops < 5);
+
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/task_isolation/task_isol_oneshot.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_oneshot.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Example of task isolation prctl interface using
+ * oneshot mode for quiescing.
+ *
+ *	do {
+ *		enable oneshot quiescing of kernel activities
+ *		process data (no system calls)
+ *		if (event) {
+ *			process event with syscalls
+ *			enable oneshot quiescing of kernel activities
+ *		}
+ *	} while (exit_condition);
+ *
+ * Copyright (C) 2021 Red Hat, Inc.
+ *
+ * Marcelo Tosatti <mtosatti@redhat.com>
+ */
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret, fd;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	fd = open("/dev/zero", O_RDONLY);
+	if (fd == -1) {
+		perror("open");
+		return EXIT_FAILURE;
+	}
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup();
+	if (ret == -1)
+		return EXIT_FAILURE;
+
+	mask = ret;
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+
+	/* enable quiescing on system call return, oneshot */
+	ret = task_isol_activate_set(mask, 1);
+	if (ret)
+		return EXIT_FAILURE;
+	/* busy loop */
+	while (ret < NR_LOOPS)  {
+		memset(buf, 0xf, 4096);
+		ret = ret+1;
+		if (!(ret % NR_PRINT)) {
+			int i, r;
+
+			/* this could be considered handling an external
+			 * event: with one-shot mode, system calls
+			 * after prctl(PR_SET_ACTIVATE) will not incur
+			 * the penalty of quiescing
+			 */
+			printf("loops=%d of %d\n", ret, NR_LOOPS);
+			for (i = 0; i < 100; i++) {
+				r = read(fd, buf, 4096);
+				if (r == -1) {
+					perror("read");
+					return EXIT_FAILURE;
+				}
+			}
+			/* enable quiescing on system call return, oneshot */
+			ret = task_isol_activate_set(mask, 1);
+			if (ret)
+				return EXIT_FAILURE;
+		}
+	}
+
+	return EXIT_SUCCESS;
+}
+



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch v4 3/8] task isolation: sync vmstats on return to userspace
  2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 1/8] add basic task isolation prctl interface Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
@ 2021-10-07 19:23 ` Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 4/8] procfs: add per-pid task isolation state Marcelo Tosatti
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, use the task isolation prctl interface to quiesce 
deferred actions when returning to userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/task_isolation.h |   12 ++++++++++++
 include/linux/vmstat.h         |    8 ++++++++
 kernel/entry/common.c          |    2 ++
 kernel/task_isolation.c        |   26 ++++++++++++++++++++++++++
 mm/vmstat.c                    |   21 +++++++++++++++++++++
 5 files changed, 69 insertions(+)

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -40,8 +40,20 @@ int prctl_task_isolation_activate_set(un
 
 int __copy_task_isolation(struct task_struct *tsk);
 
+void __isolation_exit_to_user_mode_prepare(void);
+
+static inline void isolation_exit_to_user_mode_prepare(void)
+{
+	if (current->isol_info)
+		__isolation_exit_to_user_mode_prepare();
+}
+
 #else
 
+static void isolation_exit_to_user_mode_prepare(void)
+{
+}
+
 static inline void tsk_isol_free(struct task_struct *tsk)
 {
 }
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
 		void *buffer, size_t *length, loff_t *ppos);
 #endif
 
+#ifdef CONFIG_SMP
+void sync_vmstat(void);
+#else
+static inline void sync_vmstat(void)
+{
+}
+#endif
+
 struct reclaim_stat {
 	unsigned nr_dirty;
 	unsigned nr_unqueued_dirty;
Index: linux-2.6/kernel/entry/common.c
===================================================================
--- linux-2.6.orig/kernel/entry/common.c
+++ linux-2.6/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/task_isolation.h>
 
 #include "common.h"
 
@@ -285,6 +286,7 @@ static void syscall_exit_to_user_mode_pr
 static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
 {
 	syscall_exit_to_user_mode_prepare(regs);
+	isolation_exit_to_user_mode_prepare();
 	local_irq_disable_exit_to_user();
 	exit_to_user_mode_prepare(regs);
 }
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -18,6 +18,8 @@
 #include <linux/sysfs.h>
 #include <linux/init.h>
 #include <linux/sched/task.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
 
 void __tsk_isol_free(struct task_struct *tsk)
 {
@@ -355,3 +357,23 @@ int prctl_task_isolation_activate_get(un
 	return ret;
 }
 
+void __isolation_exit_to_user_mode_prepare(void)
+{
+	struct isol_info *i;
+
+	i = current->isol_info;
+	if (!i)
+		return;
+
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return;
+
+	if (i->quiesce_mask & ISOL_F_QUIESCE_VMSTATS) {
+		sync_vmstat();
+		if (i->oneshot_mask & ISOL_F_QUIESCE_VMSTATS) {
+			i->oneshot_mask &= ~ISOL_F_QUIESCE_VMSTATS;
+			i->quiesce_mask &= ~ISOL_F_QUIESCE_VMSTATS;
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(__isolation_exit_to_user_mode_prepare);
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -2003,6 +2003,27 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+void sync_vmstat(void)
+{
+	int cpu;
+
+	cpu = get_cpu();
+
+	refresh_cpu_vm_stats(false);
+	put_cpu();
+
+	/*
+	 * If task is migrated to another CPU between put_cpu
+	 * and cancel_delayed_work_sync, the code below might
+	 * cancel vmstat_update work for a different cpu
+	 * (than the one from which the vmstats were flushed).
+	 *
+	 * However, vmstat shepherd will re-enable it later,
+	 * so its harmless.
+	 */
+	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+}
+
 static void __init start_shepherd_timer(void)
 {
 	int cpu;



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch v4 4/8] procfs: add per-pid task isolation state
  2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2021-10-07 19:23 ` [patch v4 3/8] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2021-10-07 19:23 ` Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 5/8] task isolation: sync vmstats conditional on changes Marcelo Tosatti
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

Add /proc/pid/task_isolation file, to query the state of
task isolation configuration.

---
 fs/proc/base.c                 |    4 ++
 include/linux/task_isolation.h |    2 +
 kernel/task_isolation.c        |   60 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+)

Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -96,6 +96,8 @@
 #include <linux/time_namespace.h>
 #include <linux/resctrl.h>
 #include <linux/cn_proc.h>
+#include <linux/prctl.h>
+#include <linux/task_isolation.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -662,6 +664,69 @@ static int proc_pid_syscall(struct seq_f
 }
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
+#ifdef CONFIG_CPU_ISOLATION
+
+struct qoptions {
+	unsigned long mask;
+	char *name;
+};
+
+static struct qoptions iopts[] = {
+	{ISOL_F_QUIESCE, "quiesce"},
+};
+#define ILEN (sizeof(iopts) / sizeof(struct qoptions))
+
+static struct qoptions qopts[] = {
+	{ISOL_F_QUIESCE_VMSTATS, "vmstat_sync"},
+};
+#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
+
+static void show_isolation_state(struct seq_file *m,
+				 struct qoptions *iopt,
+				 int mask,
+				 const char *hdr)
+{
+	int i;
+
+	seq_printf(m, hdr);
+	for (i = 0; i < ILEN; i++) {
+		if (mask & iopt->mask)
+			seq_printf(m, "%s ", iopt->name);
+		iopt++;
+	}
+	if (mask == 0)
+		seq_printf(m, "none ");
+	seq_printf(m, "\n");
+}
+
+int proc_pid_task_isolation(struct seq_file *m, struct pid_namespace *ns,
+			    struct pid *pid, struct task_struct *t)
+{
+	int active_mask, quiesce_mask, conf_mask;
+	struct isol_info *isol_info;
+	struct inode *inode = m->private;
+	struct task_struct *task = get_proc_task(inode);
+
+	isol_info = t->isol_info;
+	if (!isol_info)
+		active_mask = quiesce_mask = conf_mask = 0;
+	else {
+		active_mask = isol_info->active_mask;
+		quiesce_mask = isol_info->quiesce_mask;
+		conf_mask = isol_info->conf_mask;
+	}
+
+	show_isolation_state(m, iopts, conf_mask, "Configured state: ");
+	show_isolation_state(m, iopts, active_mask, "Active state: ");
+	show_isolation_state(m, qopts, quiesce_mask, "Quiescing: ");
+
+	put_task_struct(task);
+
+	return 0;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
 /************************************************************************/
 /*                       Here the fs part begins                        */
 /************************************************************************/
@@ -3281,6 +3346,9 @@ static const struct pid_entry tgid_base_
 #ifdef CONFIG_SECCOMP_CACHE_DEBUG
 	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
 #endif
+#ifdef CONFIG_CPU_ISOLATION
+	ONE("task_isolation", S_IRUSR, proc_pid_task_isolation),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch v4 5/8] task isolation: sync vmstats conditional on changes
  2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2021-10-07 19:23 ` [patch v4 4/8] procfs: add per-pid task isolation state Marcelo Tosatti
@ 2021-10-07 19:23 ` Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 6/8] KVM: x86: call isolation prepare from VM-entry code path Marcelo Tosatti
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Rather than syncing VM-stats on every return to userspace
(or VM-entry), keep track of changes through a per-CPU bool.

This improves performance when enabling task isolated
for vcpu VMs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -22,7 +22,16 @@ int sysctl_vm_numa_stat_handler(struct c
 #endif
 
 #ifdef CONFIG_SMP
-void sync_vmstat(void);
+extern struct static_key vmstat_sync_enabled;
+
+void __sync_vmstat(void);
+static inline void sync_vmstat(void)
+{
+	if (static_key_false(&vmstat_sync_enabled))
+		__sync_vmstat();
+}
+
+void init_sync_vmstat(void);
 #else
 static inline void sync_vmstat(void)
 {
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -21,6 +21,17 @@
 #include <linux/mm.h>
 #include <linux/vmstat.h>
 
+void __tsk_isol_exit(struct task_struct *tsk)
+{
+	struct isol_info *i;
+
+	i = tsk->isol_info;
+	if (!i)
+		return;
+
+	static_key_slow_dec(&vmstat_sync_enabled);
+}
+
 void __tsk_isol_free(struct task_struct *tsk)
 {
 	if (!tsk->isol_info)
@@ -37,6 +48,12 @@ static struct isol_info *tsk_isol_alloc_
 	if (unlikely(!info))
 		return ERR_PTR(-ENOMEM);
 
+	preempt_disable();
+	init_sync_vmstat();
+	preempt_enable();
+
+	static_key_slow_inc(&vmstat_sync_enabled);
+
 	return info;
 }
 
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -306,6 +307,22 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+struct static_key vmstat_sync_enabled;
+static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+static inline void mark_vmstat_dirty(void)
+{
+	if (!static_key_false(&vmstat_sync_enabled))
+		return;
+
+	raw_cpu_write(vmstat_dirty, true);
+}
+
+void init_sync_vmstat(void)
+{
+	raw_cpu_write(vmstat_dirty, true);
+}
+
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -338,6 +355,7 @@ void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
@@ -376,6 +394,7 @@ void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
@@ -574,6 +593,7 @@ static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	mark_vmstat_dirty();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -642,6 +662,7 @@ static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	mark_vmstat_dirty();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -1082,6 +1103,7 @@ static void fill_contig_page_info(struct
 			info->free_blocks_suitable += blocks <<
 						(order - suitable_order);
 	}
+	mark_vmstat_dirty();
 }
 
 /*
@@ -1434,6 +1456,7 @@ static void walk_zones_in_node(struct se
 		if (!nolock)
 			spin_unlock_irqrestore(&zone->lock, flags);
 	}
+	mark_vmstat_dirty();
 }
 #endif
 
@@ -1499,6 +1522,7 @@ static void pagetypeinfo_showfree_print(
 		}
 		seq_putc(m, '\n');
 	}
+	mark_vmstat_dirty();
 }
 
 /* Print out the free pages at each order for each migatetype */
@@ -1917,6 +1941,7 @@ static void vmstat_update(struct work_st
 				this_cpu_ptr(&vmstat_work),
 				round_jiffies_relative(sysctl_stat_interval));
 	}
+	mark_vmstat_dirty();
 }
 
 /*
@@ -2003,13 +2028,18 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
-void sync_vmstat(void)
+void __sync_vmstat(void)
 {
 	int cpu;
 
 	cpu = get_cpu();
+	if (per_cpu(vmstat_dirty, cpu) == false) {
+		put_cpu();
+		return;
+	}
 
 	refresh_cpu_vm_stats(false);
+	raw_cpu_write(vmstat_dirty, false);
 	put_cpu();
 
 	/*
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -27,6 +27,13 @@ static inline void tsk_isol_free(struct
 		__tsk_isol_free(tsk);
 }
 
+void __tsk_isol_exit(struct task_struct *tsk);
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+	if (tsk->isol_info)
+		__tsk_isol_exit(tsk);
+}
+
 int prctl_task_isolation_feat_get(unsigned long arg2, unsigned long arg3,
 				  unsigned long arg4, unsigned long arg5);
 int prctl_task_isolation_cfg_get(unsigned long arg2, unsigned long arg3,
@@ -58,6 +65,10 @@ static inline void tsk_isol_free(struct
 {
 }
 
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+}
+
 static inline int prctl_task_isolation_feat_get(unsigned long arg2,
 						unsigned long arg3,
 						unsigned long arg4,
Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -64,6 +64,7 @@
 #include <linux/rcuwait.h>
 #include <linux/compat.h>
 #include <linux/io_uring.h>
+#include <linux/task_isolation.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -778,6 +779,7 @@ void __noreturn do_exit(long code)
 	}
 
 	io_uring_files_cancel();
+	tsk_isol_exit(tsk);
 	exit_signals(tsk);  /* sets PF_EXITING */
 
 	/* sync mm's RSS info before statistics gathering */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -2446,6 +2446,7 @@ bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
 bad_fork_cleanup_task_isolation:
+	tsk_isol_exit(p);
 	tsk_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch v4 6/8] KVM: x86: call isolation prepare from VM-entry code path
  2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2021-10-07 19:23 ` [patch v4 5/8] task isolation: sync vmstats conditional on changes Marcelo Tosatti
@ 2021-10-07 19:23 ` Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 7/8] mm: vmstat: move need_update Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  7 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

VM-entry code path is an entry point similar to userspace return
when task isolation is concerned.

Call isolation_exit_to_user_mode_prepare before VM-enter.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/kvm/x86.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/arch/x86/kvm/x86.c
===================================================================
--- linux-2.6.orig/arch/x86/kvm/x86.c
+++ linux-2.6/arch/x86/kvm/x86.c
@@ -59,6 +59,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/entry-kvm.h>
 #include <linux/suspend.h>
+#include <linux/task_isolation.h>
 
 #include <trace/events/kvm.h>
 
@@ -9574,6 +9575,8 @@ static int vcpu_enter_guest(struct kvm_v
 		goto cancel_injection;
 	}
 
+	isolation_exit_to_user_mode_prepare();
+
 	preempt_disable();
 
 	static_call(kvm_x86_prepare_guest_switch)(vcpu);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch v4 7/8] mm: vmstat: move need_update
  2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2021-10-07 19:23 ` [patch v4 6/8] KVM: x86: call isolation prepare from VM-entry code path Marcelo Tosatti
@ 2021-10-07 19:23 ` Marcelo Tosatti
  2021-10-07 19:23 ` [patch v4 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  7 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Move need_update() function up in vmstat.c, needed by next patch.
No code changes.

Remove a duplicate comment while at it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   67 +++++++++++++++++++++++++++---------------------------------
 1 file changed, 31 insertions(+), 36 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1864,6 +1864,35 @@ static const struct seq_operations vmsta
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	pg_data_t *last_pgdat = NULL;
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		struct per_cpu_nodestat *n;
+
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 */
+		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
+			return true;
+
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
+		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
+		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
+			return true;
+	}
+	return false;
+}
+
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1945,35 +1974,6 @@ static void vmstat_update(struct work_st
 }
 
 /*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
-			return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch v4 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2021-10-07 19:23 ` [patch v4 7/8] mm: vmstat: move need_update Marcelo Tosatti
@ 2021-10-07 19:23 ` Marcelo Tosatti
  7 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-07 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

It is not necessary to queue work item to run refresh_vm_stats 
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.

This fixes sosreport hang (which uses vmstat_refresh) with 
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   49 ++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 44 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1894,6 +1894,31 @@ static bool need_update(int cpu)
 }
 
 #ifdef CONFIG_PROC_FS
+static bool need_drain_remote_zones(int cpu)
+{
+#ifdef CONFIG_NUMA
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		if (!pcp->count)
+			continue;
+
+		if (!pcp->expire)
+			continue;
+
+		if (zone_to_nid(zone) == cpu_to_node(cpu))
+			continue;
+
+		return true;
+	}
+#endif
+
+	return false;
+}
+
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
@@ -1903,8 +1928,12 @@ int vmstat_refresh(struct ctl_table *tab
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	long val;
-	int err;
-	int i;
+	int i, cpu;
+	struct work_struct __percpu *works;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1918,9 +1947,19 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+
+		INIT_WORK(work, refresh_vm_stats);
+		if (need_update(cpu) || need_drain_remote_zones(cpu))
+			schedule_work_on(cpu, work);
+	}
+	for_each_online_cpu(cpu)
+		flush_work(per_cpu_ptr(works, cpu));
+	cpus_read_unlock();
+	free_percpu(works);
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch v4 1/8] add basic task isolation prctl interface
  2021-10-07 19:23 ` [patch v4 1/8] add basic task isolation prctl interface Marcelo Tosatti
@ 2021-10-12 13:05   ` Peter Zijlstra
  2021-10-13 10:56     ` Marcelo Tosatti
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-10-12 13:05 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli, Alex Belits,
	Peter Xu

On Thu, Oct 07, 2021 at 04:23:47PM -0300, Marcelo Tosatti wrote:
> Add basic prctl task isolation interface, which allows
> informing the kernel that application is executing 
> latency sensitive code (where interruptions are undesired).
> 
> Interface is described by task_isolation.rst (added by
> next patch).

That does not absolve you from actually writing a changelog here.
Life is too short to try and read rst shit.

What is the envisioned usage of these isolating prctl() thingies,
including the kill-me-on-any-interruption thing, vs the inherently racy
nature of some of the don't disturb me stuff.

Also, see:

  https://lkml.kernel.org/r/20210929152429.186930629@infradead.org

Suppose:

	CPU0					CPU1

	sys_prctl()
	<kernel entry>
	  // marks task 'important'
						text_poke_sync()
						  // checks CPU0, not userspace, queues IPI
	<kernel exit>

	$important userspace			  arch_send_call_function_ipi_mask()
	<IPI>
	  // finds task is 'important' and
	  // can't take interrupts
	  sigkill()

*Whoopsie*


Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
therefore CPU0 can't wait for quescence in kernelspace, but if it goes
to userspace, it'll get killed on interruption. Catch-22.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch v4 1/8] add basic task isolation prctl interface
  2021-10-12 13:05   ` Peter Zijlstra
@ 2021-10-13 10:56     ` Marcelo Tosatti
  2021-10-13 11:37       ` Marcelo Tosatti
  2021-10-13 15:01       ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-13 10:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli, Alex Belits,
	Peter Xu

[-- Attachment #1: Type: text/plain, Size: 5840 bytes --]

Hi Peter,

On Tue, Oct 12, 2021 at 03:05:34PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 07, 2021 at 04:23:47PM -0300, Marcelo Tosatti wrote:
> > Add basic prctl task isolation interface, which allows
> > informing the kernel that application is executing 
> > latency sensitive code (where interruptions are undesired).
> > 
> > Interface is described by task_isolation.rst (added by
> > next patch).
> 
> That does not absolve you from actually writing a changelog here.
> Life is too short to try and read rst shit.

The rst is concise and contains all necessary information.

Changelog is on the patch header (I would appreciate reviews of
the interface itself, not sure why the changelog is important).

The rst compiled in PDF form is attached. Its only 6 pages long, it
described the interface (please if you think of any improvement 
to that document, and not only the interface).

Also check the examples on the second patch.

"===============================
Task isolation prctl interface
===============================

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth networking
applications with user-space drivers can fall into the category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and
end of the latency sensitive application section (with configurable
system behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:

        - PR_ISOL_FEAT_GET: Retrieve supported features.
        - PR_ISOL_CFG_GET: Retrieve task isolation configuration.
        - PR_ISOL_CFG_SET: Set task isolation configuration.
        - PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
        - PR_ISOL_ACTIVATE_SET: Set task isolation activation state."

> What is the envisioned usage of these isolating prctl() thingies,

To inform the kernel that an application is entering/exiting 
task isolated mode, where the kernel can behave in different ways
(it depends on what commands you implement using the interface).

At the moment, only a single action is performed: to sync any pending
per-CPU vm statistics that might wake up a CPU's vmstat_worker 
thread.

The point of the interface, as requested, is that it is extensible 
allowing for new features to be implemented.

For example, one could implement blocking of certain kernel activities
which require interruption to userspace.

> including the kill-me-on-any-interruption thing, vs the inherently racy
> nature of some of the don't disturb me stuff.

The kill-me-on-any-interruption thing is not included in this patchset
(maybe it can be implemented using this interface, if desired, but we
do not see the need for such feature at the moment).

The inherently racy nature of some of the don't disturb me stuff:

> 
> Also, see:
> 
>   https://lkml.kernel.org/r/20210929152429.186930629@infradead.org

As you can see from the below pseudocode, we were thinking of queueing
the (invalidate icache or TLB flush) in case app is in userspace,
to perform on return to kernel space, but the approach in your patch might be
superior (will take sometime to parse that thread...).

> Suppose:
> 
> 	CPU0					CPU1
> 
> 	sys_prctl()
> 	<kernel entry>
> 	  // marks task 'important'
> 						text_poke_sync()
> 						  // checks CPU0, not userspace, queues IPI
> 	<kernel exit>
> 
> 	$important userspace			  arch_send_call_function_ipi_mask()
> 	<IPI>
> 	  // finds task is 'important' and
> 	  // can't take interrupts
> 	  sigkill()
> 
> *Whoopsie*
> 
> 
> Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> to userspace, it'll get killed on interruption. Catch-22.

We have been using a BPF tool for logging and monitoring of
interruptions:
https://github.com/xzpeter/rt-trace-bpf

But there is no such thing as 

         // finds task is 'important' and
         // can't take interrupts
         sigkill()

On this patchset.

We have been discussing something like this to avoid TLBs / invalidate
i-cache IPIs, but 

 #define CPU_REQ_FLUSH_TLB       0x1     /*      __flush_tlb_all()       */ 
 #define CPU_REQ_INV_ICACHE      0x2     /*      sync_core()             */ 

 #define IN_USER_MODE            (0x1 << 16) 

 /* when CPU is in kernel mode, should IPI rather than queuing the 
    request on per-cpu state variable */ 
 #define IN_KERNEL_MODE                (0) 

 Then on entry/exit would have to add a conditional branch: 

 Exit to userspace: 
 ----------------- 

       cpu = smp_processor_id(); 

       if (isolation_enabled(cpu)) { 
               atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); 
       } 

 Kernel entry: 
 ------------- 

       cpu = smp_processor_id(); 

       if (isolation_enabled(cpu)) { 
               reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); 
               if (reqs & CPU_REQ_FLUSH_TLB) 
                       flush_tlb_all(); 
               if (reqs & CPU_REQ_INV_ICACHE) 
                       invalidate_icache(); 
       } 

 Request side: 
 ------------- 

       int targetcpu; 

       do { 
               struct percpudata *pcpudata = per_cpu(&percpudata, targetcpu); 

               old_state = pcpudata->user_kernel_state; 

               /* in kernel mode ? */ 
               if (!(old_state & IN_USER_MODE)) { 
                       smp_call_function_single(request_fn, targetcpu, 1); 
                       break; 
               } 
               new_state = remote_state | CPU_REQ_FLUSH_TLB; // (or CPU_REQ_INV_ICACHE)
       } while (atomic_cmpxchg(&pcpudata->user_kernel_state, old_state, new_state) != old_state); 

Thanks for your comments!

[-- Attachment #2: task_isolation.pdf --]
[-- Type: application/pdf, Size: 90947 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch v4 1/8] add basic task isolation prctl interface
  2021-10-13 10:56     ` Marcelo Tosatti
@ 2021-10-13 11:37       ` Marcelo Tosatti
  2021-10-13 15:01       ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-13 11:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli, Alex Belits,
	Peter Xu

On Wed, Oct 13, 2021 at 07:56:37AM -0300, Marcelo Tosatti wrote:
> Hi Peter,
> 
> On Tue, Oct 12, 2021 at 03:05:34PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 07, 2021 at 04:23:47PM -0300, Marcelo Tosatti wrote:
> > > Add basic prctl task isolation interface, which allows
> > > informing the kernel that application is executing 
> > > latency sensitive code (where interruptions are undesired).
> > > 
> > > Interface is described by task_isolation.rst (added by
> > > next patch).
> > 
> > That does not absolve you from actually writing a changelog here.
> > Life is too short to try and read rst shit.
> 
> The rst is concise and contains all necessary information.
> 
> Changelog is on the patch header (I would appreciate reviews of
> the interface itself, not sure why the changelog is important).

Err, changelog is on the patchset intro.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch v4 1/8] add basic task isolation prctl interface
  2021-10-13 10:56     ` Marcelo Tosatti
  2021-10-13 11:37       ` Marcelo Tosatti
@ 2021-10-13 15:01       ` Peter Zijlstra
  2021-10-13 16:06         ` Marcelo Tosatti
  2021-10-19 15:07         ` Marcelo Tosatti
  1 sibling, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2021-10-13 15:01 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli, Alex Belits,
	Peter Xu

On Wed, Oct 13, 2021 at 07:56:37AM -0300, Marcelo Tosatti wrote:
> Hi Peter,
> 
> On Tue, Oct 12, 2021 at 03:05:34PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 07, 2021 at 04:23:47PM -0300, Marcelo Tosatti wrote:
> > > Add basic prctl task isolation interface, which allows
> > > informing the kernel that application is executing 
> > > latency sensitive code (where interruptions are undesired).
> > > 
> > > Interface is described by task_isolation.rst (added by
> > > next patch).
> > 
> > That does not absolve you from actually writing a changelog here.
> > Life is too short to try and read rst shit.
> 
> The rst is concise and contains all necessary information.

rst is a piece of crap and makes it harder to read plain text files.

> Changelog is on the patch header (I would appreciate reviews of
> the interface itself, not sure why the changelog is important).

Adding an interface without Changelog is a no-no. Changelogs go on the
patches themselves, not someplace random.

> The rst compiled in PDF form is attached. Its only 6 pages long, it
> described the interface (please if you think of any improvement 
> to that document, and not only the interface).

Wth would I want to read a pdf? Plain text is what I want, without
added crap on. That's my complaint with rst, it makes reading the actual
document harder by having all that nonsense sprinkled in.

I spend 99% of my time looking at fixed width text in a text editor.
Having to open a browser or some random other crap gui program is a
fail.

> > including the kill-me-on-any-interruption thing, vs the inherently racy
> > nature of some of the don't disturb me stuff.
> 
> The kill-me-on-any-interruption thing is not included in this patchset
> (maybe it can be implemented using this interface, if desired, but we
> do not see the need for such feature at the moment).

It's something that has been requested lots, and has been part of
previous series. Since it's a known and desired feature, any proposed
interface ought to be able to deal with it. Otherwise we need to invent
yet another interface once that feature does get around to be
implemented.

> > Also, see:
> > 
> >   https://lkml.kernel.org/r/20210929152429.186930629@infradead.org
> 
> As you can see from the below pseudocode, we were thinking of queueing
> the (invalidate icache or TLB flush) in case app is in userspace,
> to perform on return to kernel space, but the approach in your patch might be
> superior (will take sometime to parse that thread...).

Let me assume you're talking about kernel TLB invalidates, otherwise it
would be terribly broken.

> > Suppose:
> > 
> > 	CPU0					CPU1
> > 
> > 	sys_prctl()
> > 	<kernel entry>
> > 	  // marks task 'important'
> > 						text_poke_sync()
> > 						  // checks CPU0, not userspace, queues IPI
> > 	<kernel exit>
> > 
> > 	$important userspace			  arch_send_call_function_ipi_mask()
> > 	<IPI>
> > 	  // finds task is 'important' and
> > 	  // can't take interrupts
> > 	  sigkill()
> > 
> > *Whoopsie*
> > 
> > 
> > Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> > therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> > to userspace, it'll get killed on interruption. Catch-22.
> 
> We have been using a BPF tool for logging and monitoring of
> interruptions:
> https://github.com/xzpeter/rt-trace-bpf

I've no idea what you need bpf for, we have tracepoints in the entry
paths that should suffice.

> But there is no such thing as 
> 
>          // finds task is 'important' and
>          // can't take interrupts
>          sigkill()
> 
> On this patchset.

But the interface should allow for it.

> We have been discussing something like this to avoid TLBs / invalidate
> i-cache IPIs, but 
> 
>  #define CPU_REQ_FLUSH_TLB       0x1     /*      __flush_tlb_all()       */ 
>  #define CPU_REQ_INV_ICACHE      0x2     /*      sync_core()             */ 

sync_core() is *NOT* an I$ flush.

>  
>  #define IN_USER_MODE            (0x1 << 16) 
>  
>  /* when CPU is in kernel mode, should IPI rather than queuing the 
>     request on per-cpu state variable */ 
>  #define IN_KERNEL_MODE                (0) 
>  
>  Then on entry/exit would have to add a conditional branch: 
>  
>  Exit to userspace: 
>  ----------------- 
>  
>        cpu = smp_processor_id(); 
>  
>        if (isolation_enabled(cpu)) { 
>                atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); 
>        } 
>  
>  Kernel entry: 
>  ------------- 
>  
>        cpu = smp_processor_id(); 
>  
>        if (isolation_enabled(cpu)) { 
>                reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); 
>                if (reqs & CPU_REQ_FLUSH_TLB) 
>                        flush_tlb_all(); 
>                if (reqs & CPU_REQ_INV_ICACHE) 
>                        invalidate_icache(); 
>        } 
>  
>  Request side: 
>  ------------- 
>  
>        int targetcpu; 
>  
>        do { 
>                struct percpudata *pcpudata = per_cpu(&percpudata, targetcpu); 
>  
>                old_state = pcpudata->user_kernel_state; 
>  
>                /* in kernel mode ? */ 
>                if (!(old_state & IN_USER_MODE)) { 
>                        smp_call_function_single(request_fn, targetcpu, 1); 
>                        break; 
>                } 
>                new_state = remote_state | CPU_REQ_FLUSH_TLB; // (or CPU_REQ_INV_ICACHE)
>        } while (atomic_cmpxchg(&pcpudata->user_kernel_state, old_state, new_state) != old_state); 
> 

That's absolutely terrible :/ you're adding extra unconditinal atomics
to the entry/exit path instead of using the ones that are already there.
That's no good.

Also, you're very much not dealing with that race either.

Also, I think you're broken vs instrumentation, all of this needs to
happen super early on entry, possibly while still on the entry stack,
otherwise the missed TLBi might be handled too late and we just used a
stale TLB entry.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch v4 1/8] add basic task isolation prctl interface
  2021-10-13 15:01       ` Peter Zijlstra
@ 2021-10-13 16:06         ` Marcelo Tosatti
  2021-10-14 13:02           ` Marcelo Tosatti
  2021-10-19 15:07         ` Marcelo Tosatti
  1 sibling, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-13 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli, Alex Belits,
	Peter Xu

On Wed, Oct 13, 2021 at 05:01:29PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 13, 2021 at 07:56:37AM -0300, Marcelo Tosatti wrote:
> > Hi Peter,
> > 
> > On Tue, Oct 12, 2021 at 03:05:34PM +0200, Peter Zijlstra wrote:
> > > On Thu, Oct 07, 2021 at 04:23:47PM -0300, Marcelo Tosatti wrote:
> > > > Add basic prctl task isolation interface, which allows
> > > > informing the kernel that application is executing 
> > > > latency sensitive code (where interruptions are undesired).
> > > > 
> > > > Interface is described by task_isolation.rst (added by
> > > > next patch).
> > > 
> > > That does not absolve you from actually writing a changelog here.
> > > Life is too short to try and read rst shit.
> > 
> > The rst is concise and contains all necessary information.
> 
> rst is a piece of crap and makes it harder to read plain text files.
> 
> > Changelog is on the patch header (I would appreciate reviews of
> > the interface itself, not sure why the changelog is important).
> 
> Adding an interface without Changelog is a no-no. Changelogs go on the
> patches themselves, not someplace random.

OK, can move the changelog to the patches.

> > The rst compiled in PDF form is attached. Its only 6 pages long, it
> > described the interface (please if you think of any improvement 
> > to that document, and not only the interface).
> 
> Wth would I want to read a pdf? Plain text is what I want, without
> added crap on. That's my complaint with rst, it makes reading the actual
> document harder by having all that nonsense sprinkled in.

I understand. At the end of this message is a rst2txt convertion
for your convenience.

> I spend 99% of my time looking at fixed width text in a text editor.
> Having to open a browser or some random other crap gui program is a
> fail.

OK. Please review the API document at the end of this message.

> > > including the kill-me-on-any-interruption thing, vs the inherently racy
> > > nature of some of the don't disturb me stuff.
> > 
> > The kill-me-on-any-interruption thing is not included in this patchset
> > (maybe it can be implemented using this interface, if desired, but we
> > do not see the need for such feature at the moment).
> 
> It's something that has been requested lots, and has been part of
> previous series. 

So what i understand from Alex's comments is that the kill-on-interruption
feature is useful when developing/improving a program to be interruption free:

"The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate   
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

The last stage of enabling task isolation happens in
task_isolation_exit_to_user_mode() that runs last before returning
to userspace and changes ll_isol_flags (see later) to prevent other
CPUs from interfering with isolated task.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will send it a signal to indicate
isolation loss. In addition to sending a signal, the code supports a
kernel command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution."

First of all, the isolation breaking event might be unrelated
to the current state of the task isolated program (but due 
to an unrelated event, such as a TLB flush, caused by
other programs, due to other activities).

So this patchset does not support sending a signal on 
isolation breaking. There is a bcc tool, which allows
one to see the functions which interrupt the application:

Overview
This is a repo that keeps RT related tracing tools implemented using eBPF. Currently it only contains rt-trace-bcc.py.

rt-trace-bcc.py
rt-trace-bcc.py is a python-bcc based tool for tracking RT related issues. When it's executed in the background, it'll dump suspecious kernel func-calls that may affect real-time determinism of target cores. It can also record all the relative information and report when it quits (by either Ctrl-C from the command line, or a kill $PID with SIGTERM).

It should be able to run this script on any modern kernel, however it's majorly targeted at RHEL8.

Install dependencies
sudo dnf install python3 python-bcc
We can also install it using pip3:

pip3 install bcc
How to use
A help message captured from v0.1.7:

$ ./rt-trace-bcc.py -v
Version: 0.1.6

$ ./rt-trace-bcc.py -h
usage: rt-trace-bcc.py [-h] [--cpu-list CPU_LIST] [--backtrace] [--debug] [--version] [--summary] [--quiet]

Bcc-based trace tool for Real-Time workload.

optional arguments:
  -h, --help            show this help message and exit
  --cpu-list CPU_LIST, -c CPU_LIST
                        Cores to trace interruptions (e.g., 1,2-5,8)
  --backtrace, -b       Whether dump backtrace when possible (default: off)
  --debug, -d           Whether run with debug mode (default: off)
  --version, -v         Dump version information (current: 0.1.7)
  --summary, -s         Dump summary when stop/SIGINT (default: off)
  --quiet, -q           Quiet mode; only dump summary (implies "-s" too)
Here --cpu-list parameter is required as the target cores of the tracing. Normally it should be the isolated cores, or a subset of isolated cores on the system.

Example usage
Run below command in the background to start dumping suspecious calls to stdout happened on cores 36-39:

$ sudo ./rt-trace-bcc.py -c 36-39
[There can be some warnings dumped; we can ignore them]
Enabled hook point: process_one_work
Enabled hook point: __queue_work
Enabled hook point: __queue_delayed_work
Enabled hook point: generic_exec_single
Enabled hook point: smp_call_function_many_cond
Enabled hook point: irq_work_queue
Enabled hook point: irq_work_queue_on
TIME(s)            COMM                 CPU  PID      MSG
0.009599293        rcuc/8               8    75       irq_work_queue_on (target=36, func=nohz_full_kick_func)
0.009603039        rcuc/8               8    75       irq_work_queue_on (target=37, func=nohz_full_kick_func)
0.009604047        rcuc/8               8    75       irq_work_queue_on (target=38, func=nohz_full_kick_func)
0.009604848        rcuc/8               8    75       irq_work_queue_on (target=39, func=nohz_full_kick_func)
0.103600589        rcuc/8               8    75       irq_work_queue_on (target=36, func=nohz_full_kick_func)
0.103604182        rcuc/8               8    75       irq_work_queue_on (target=37, func=nohz_full_kick_func)
0.103605222        rcuc/8               8    75       irq_work_queue_on (target=38, func=nohz_full_kick_func)
Use Ctrl-C to stop tracing.

Run below command in the background to start recording suspecious calls to stdout happened on cores 36-39, enable backtrace, with quiet mode (so no record is dumped immediately; the result will only be dumped in JSON when the script quits):

$ sudo ./rt-trace-bcc.py -c 36-39 -b -q
[There can be some warnings dumped; we can ignore them]
Enabled hook point: process_one_work
Enabled hook point: __queue_work
Enabled hook point: __queue_delayed_work
Enabled hook point: generic_exec_single
Enabled hook point: smp_call_function_many_cond
Enabled hook point: irq_work_queue
Enabled hook point: irq_work_queue_on
Press Ctrl-C to show results..
^CDump summary of messages:
{
    "irq_work_queue_on (target=36, func=nohz_full_kick_func)": {
        "count": 66,
        "backtrace": [
            "irq_work_queue_on+0x1",
            "tick_nohz_dep_set_all+0x55",
            "rcu_do_batch+0x435",
            "rcu_core+0x175",
            "rcu_cpu_kthread+0xa5",
            "smpboot_thread_fn+0x1d6",
            "kthread+0x15d",
            "ret_from_fork+0x35"
        ]
    },
    "irq_work_queue (cpu=36, func=nohz_full_kick_func)": {
        "count": 1,
        "backtrace": [
            "irq_work_queue+0x1"
        ]
    },
    "__queue_work (target=36, func=vmstat_update)": {
        "count": 2,
        "backtrace": [
            "__queue_work+0x1",
            "call_timer_fn+0x32",
            "run_timer_softirq+0x482",
            "__do_softirq+0xa5",
            "run_ksoftirqd+0x38",
            "smpboot_thread_fn+0x1d6",
            "kthread+0x15d",
            "ret_from_fork+0x35"
        ]
    },
    "__queue_delayed_work (target=37, func=vmstat_update, delay=9801)": {
        "count": 1,
        "backtrace": [
            "__queue_delayed_work+0x1",
            "queue_delayed_work_on+0x36",
            "process_one_work+0x18f",
            "worker_thread+0x30",
            "kthread+0x15d",
            "ret_from_fork+0x35"
        ]
    },
    ...
}

> Since it's a known and desired feature, any proposed
> interface ought to be able to deal with it. 

Can easily come up with an extension to this interface
that would allow for this feature to be implemented (ignoring
any signal races).

> Otherwise we need to invent
> yet another interface once that feature does get around to be
> implemented.

This interface is extensible. So you can add a new "isolation feature"
to configure it.

What are the requirements of the signal exactly (and why it is popular) ?
Because the interruption event can be due to:

* An IPI.
* A system call.

In the "full task isolation mode" patchset (the one from Alex), a system call
will automatically generate a SIGKILL once a system call is performed
(after the prctl to enable task isolated mode, but
before the prctl to disable task isolated mode).
This can be implemented, if desired, by SECCOMP syscall blocking
(which already exists).

For other interruptions, which happen through IPIs, one can print
the stack trace of the program (or interrupt) that generated
the IPI to find out the cause (which is what rt-trace-bpf.py is doing).

An alternative would be to add tracepoints so that one can
find out which function in the kernel caused the CPU and
task to become "a target for interruptions".

> > > Also, see:
> > > 
> > >   https://lkml.kernel.org/r/20210929152429.186930629@infradead.org
> > 
> > As you can see from the below pseudocode, we were thinking of queueing
> > the (invalidate icache or TLB flush) in case app is in userspace,
> > to perform on return to kernel space, but the approach in your patch might be
> > superior (will take sometime to parse that thread...).
> 
> Let me assume you're talking about kernel TLB invalidates, otherwise it
> would be terribly broken.
> 
> > > Suppose:
> > > 
> > > 	CPU0					CPU1
> > > 
> > > 	sys_prctl()
> > > 	<kernel entry>
> > > 	  // marks task 'important'
> > > 						text_poke_sync()
> > > 						  // checks CPU0, not userspace, queues IPI
> > > 	<kernel exit>
> > > 
> > > 	$important userspace			  arch_send_call_function_ipi_mask()
> > > 	<IPI>
> > > 	  // finds task is 'important' and
> > > 	  // can't take interrupts
> > > 	  sigkill()
> > > 
> > > *Whoopsie*
> > > 
> > > 
> > > Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> > > therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> > > to userspace, it'll get killed on interruption. Catch-22.
> > 
> > We have been using a BPF tool for logging and monitoring of
> > interruptions:
> > https://github.com/xzpeter/rt-trace-bpf
> 
> I've no idea what you need bpf for, we have tracepoints in the entry
> paths that should suffice.

See the output of rt-trace-bpf.py above (it prints useful information,
for example it shows which CPU generated the interruption to an 
isolated CPU, and prints the backtrace on that "interruptor CPU").

> > But there is no such thing as 
> > 
> >          // finds task is 'important' and
> >          // can't take interrupts
> >          sigkill()
> > 
> > On this patchset.
> 
> But the interface should allow for it.

Well, you can add a feature, query whether the kernel supports it.

However the exact reason for the attention to the races of SIGKILL do 
not seem useful to me (you either know your app generated a 
system call, or you get an IPI, in which case you have enough 
information to find the source of the IPI?).

> > We have been discussing something like this to avoid TLBs / invalidate
> > i-cache IPIs, but 
> > 
> >  #define CPU_REQ_FLUSH_TLB       0x1     /*      __flush_tlb_all()       */ 
> >  #define CPU_REQ_INV_ICACHE      0x2     /*      sync_core()             */ 
> 
> sync_core() is *NOT* an I$ flush.
> 
> >  
> >  #define IN_USER_MODE            (0x1 << 16) 
> >  
> >  /* when CPU is in kernel mode, should IPI rather than queuing the 
> >     request on per-cpu state variable */ 
> >  #define IN_KERNEL_MODE                (0) 
> >  
> >  Then on entry/exit would have to add a conditional branch: 
> >  
> >  Exit to userspace: 
> >  ----------------- 
> >  
> >        cpu = smp_processor_id(); 
> >  
> >        if (isolation_enabled(cpu)) { 
> >                atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); 
> >        } 
> >  
> >  Kernel entry: 
> >  ------------- 
> >  
> >        cpu = smp_processor_id(); 
> >  
> >        if (isolation_enabled(cpu)) { 
> >                reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); 
> >                if (reqs & CPU_REQ_FLUSH_TLB) 
> >                        flush_tlb_all(); 
> >                if (reqs & CPU_REQ_INV_ICACHE) 
> >                        invalidate_icache(); 
> >        } 
> >  
> >  Request side: 
> >  ------------- 
> >  
> >        int targetcpu; 
> >  
> >        do { 
> >                struct percpudata *pcpudata = per_cpu(&percpudata, targetcpu); 
> >  
> >                old_state = pcpudata->user_kernel_state; 
> >  
> >                /* in kernel mode ? */ 
> >                if (!(old_state & IN_USER_MODE)) { 
> >                        smp_call_function_single(request_fn, targetcpu, 1); 
> >                        break; 
> >                } 
> >                new_state = remote_state | CPU_REQ_FLUSH_TLB; // (or CPU_REQ_INV_ICACHE)
> >        } while (atomic_cmpxchg(&pcpudata->user_kernel_state, old_state, new_state) != old_state); 
> > 
> 
> That's absolutely terrible :/ you're adding extra unconditinal atomics
> to the entry/exit path instead of using the ones that are already there.
> That's no good.

Ok, will look into what is already there (in the entry/exit paths).

> Also, you're very much not dealing with that race either.

Not sure why the race you mention is relevant. So upon isolation
breaking, the signal (informing of a isolation breaking) has to be sent
at a specific point in execution? Why?

> Also, I think you're broken vs instrumentation, all of this needs to
> happen super early on entry, possibly while still on the entry stack,
> otherwise the missed TLBi might be handled too late and we just used a
> stale TLB entry.

Probably, yes.

-----------

Task isolation prctl interface
******************************

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth
networking applications with user-space drivers can fall into the
category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and end of
the latency sensitive application section (with configurable system
behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:

   * PR_ISOL_FEAT_GET: Retrieve supported features.

   * PR_ISOL_CFG_GET: Retrieve task isolation configuration.

   * PR_ISOL_CFG_SET: Set task isolation configuration.

   * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.

   * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.

Summary of terms:

* feature:

     A distinct attribute or aspect of task isolation. Examples of
     features could be logging, new operating modes (eg: syscalls
     disallowed), userspace notifications, etc. The only feature
     currently available is quiescing.

* configuration:

     A specific choice from a given set of possible choices that
     dictate how the particular feature in question should behave.

* activation state:

     The activation state (whether activate/inactive) of the task
     isolation features (features must be configured before being
     activated).

Inheritance of the isolation parameters and state, across fork(2) and
clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.

At a high-level, task isolation is divided in two steps:

1. Configuration.

2. Activation.

Section "Userspace support" describes how to use task isolation.

In terms of the interface, the sequence of steps to activate task
isolation are:

1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).

2. Configure task isolation features
   (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).

3. Activate or deactivate task isolation features
   (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).

This interface is based on ideas and code from the task isolation
patchset from Alex Belits: https://lwn.net/Articles/816298/

Feature description
===================

   * "ISOL_F_QUIESCE"

   This feature allows quiescing select kernel activities on return
   from system calls.

Interface description
=====================

**PR_ISOL_FEAT**:

   Returns the supported features and feature capabilities, as a
   bitmask:

      prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);

   The 'feat' argument specifies whether to return supported features
   (if zero), or feature capabilities (if not zero). Possible values
   for 'feat' are:

   * "0":

        Return the bitmask of supported features, in the location
        pointed  to  by  "(int *)arg3". The buffer should allow space
        for 8 bytes.

   * "ISOL_F_QUIESCE":

        Return a structure containing which kernel activities are
        supported for quiescing, in the location pointed to by "(int
        *)arg3":

           struct task_isol_quiesce_extensions {
                   __u64 flags;
                   __u64 supported_quiesce_bits;
                   __u64 pad[6];
           };

        Where:

        *flags*: Additional flags (should be zero).

        *supported_quiesce_bits*: Bitmask indicating
           which features are supported for quiescing.

        *pad*: Additional space for future enhancements.

   Features and its capabilities are defined at
   include/uapi/linux/task_isolation.h.

**PR_ISOL_CFG_GET**:

   Retrieve task isolation configuration. The general format is:

      prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);

   The 'what' argument specifies what to configure. Possible values
   are:

   * "I_CFG_FEAT":

        Return configuration of task isolation features. The 'arg3'
        argument specifies whether to return configured features (if
        zero), or individual feature configuration (if not zero), as
        follows.

        * "0":

             Return the bitmask of configured features, in the
             location pointed  to  by  "(int *)arg4". The buffer
             should allow space for 8 bytes.

        * "ISOL_F_QUIESCE":

             Return the control structure for quiescing of background
             kernel activities, in the location pointed to by "(int
             *)arg4":

                struct task_isol_quiesce_control {
                       __u64 flags;
                       __u64 quiesce_mask;
                       __u64 pad[6];
                };

             Where:

             *flags*: Additional flags (should be zero).

             *quiesce_mask*: A bitmask containing which activities are
             configured for quiescing.

             *pad*: Additional space for future enhancements.

   * "I_CFG_INHERIT":

        Retrieve inheritance configuration across fork/clone.

        Return the structure which configures inheritance across
        fork/clone, in the location pointed to by "(int *)arg4":

           struct task_isol_inherit_control {
                   __u8    inherit_mask;
                   __u8    pad[7];
           };

        See PR_ISOL_CFG_SET description for meaning of bits.

**PR_ISOL_CFG_SET**:

   Set task isolation configuration. The general format is:

      prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);

   The 'what' argument specifies what to configure. Possible values
   are:

   * "I_CFG_FEAT":

        Set configuration of task isolation features. 'arg3' specifies
        the feature. Possible values are:

        * "ISOL_F_QUIESCE":

             Set the control structure for quiescing of background
             kernel activities, from the location pointed to by "(int
             *)arg4":

                struct task_isol_quiesce_control {
                       __u64 flags;
                       __u64 quiesce_mask;
                       __u64 pad[6];
                };

             Where:

             *flags*: Additional flags (should be zero).

             *quiesce_mask*: A bitmask containing which kernel
             activities to quiesce.

             *pad*: Additional space for future enhancements.

             For quiesce_mask, possible bit sets are:

             * "ISOL_F_QUIESCE_VMSTATS"

             VM statistics are maintained in per-CPU counters to
             improve performance. When a CPU modifies a VM statistic,
             this modification is kept in the per-CPU counter. Certain
             activities require a global count, which involves
             requesting each CPU to flush its local counters to the
             global VM counters.

             This flush is implemented via a workqueue item, which
             might schedule a workqueue on isolated CPUs.

             To avoid this interruption, task isolation can be
             configured to, upon return from system calls, synchronize
             the per-CPU counters to global counters, thus avoiding
             the interruption.

             To ensure the application returns to userspace with no
             modified per-CPU counters, its necessary to use
             mlockall() in addition to this isolcpus flag.

   * "I_CFG_INHERIT":

        Set inheritance configuration when a new task is created via
        fork and clone.

        The "(int *)arg4" argument is a pointer to:

           struct task_isol_inherit_control {
                   __u8    inherit_mask;
                   __u8    pad[7];
           };

        inherit_mask is a bitmask that specifies which part of task
        isolation should be inherited:

        * Bit ISOL_INHERIT_CONF: Inherit task isolation
          configuration. This is the stated written via
          prctl(PR_ISOL_CFG_SET, ...).

        * Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
          (requires ISOL_INHERIT_CONF to be set). The new task should
          behave, after fork/clone, in the same manner as the parent
          task after it executed:

             prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);

**PR_ISOL_ACTIVATE_GET**:

   Retrieve task isolation activation state.

   The general format is:

      prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);

   'pmask' specifies the location of a feature mask, where the current
   active mask will be copied. See PR_ISOL_ACTIVATE_SET for
   description of individual bits.

**PR_ISOL_ACTIVATE_SET**:

   Set task isolation activation state (activates/deactivates task
   isolation).

   The general format is:

      prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);

   The 'pmask' argument specifies the location of an 8 byte mask
   containing which features should be activated. Features whose bits
   are cleared will be deactivated. The possible bits for this mask
   are:

      * "ISOL_F_QUIESCE":

      Activate quiescing of background kernel activities. Quiescing
      happens on return to userspace from this system call, and on
      return from subsequent system calls (unless quiesce_oneshot_mask
      is configured, see below).

   If the arg3 argument is non-zero, it specifies a pointer to:

      struct task_isol_activate_control {
              __u64 flags;
              __u64 quiesce_oneshot_mask;
              __u64 pad[6];
      };

   Where:

      *flags*: Additional flags (should be zero).

      *quiesce_oneshot_mask*: Quiescing for the kernel activities
         with bits set on this mask will happen on the return from
         this system call, but not on return from subsequent ones.

   Quiescing can be adjusted (while active) by
   prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).

Userspace support
*****************

Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose.

This results in three combinations:

1. Both configuration and activation performed by the latency
sensitive application. Allows fine grained control of what task
isolation features are enabled and when (see samples section below).

2. Only activation can be performed by the latency sensitive app (and
configuration performed by chisol). This allows the admin/user to
control task isolation parameters, and applications have to be
modified only once.

3. Configuration and activation performed by an external tool. This
allows unmodified applications to take advantage of task isolation.
Activation is performed by the "-a" option of chisol.

Examples
********

The "samples/task_isolation/" directory contains 3 examples:

* task_isol_userloop.c:

     Example of program with a loop on userspace scenario.

* task_isol_computation.c:

     Example of program that enters task isolated mode, performs an
     amount of computation, exits task isolated mode, and writes the
     computation to disk.

* task_isol_oneshot.c:

     Example of program that enables one-shot mode for quiescing,
     enters a processing loop, then upon an external event performs a
     number of syscalls to handle that event.

This is a snippet of code to activate task isolation if it has been
previously configured (by chisol for example):

   #include <sys/prctl.h>
   #include <linux/types.h>

   #ifdef PR_ISOL_CFG_GET
   unsigned long long fmask;

   ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
   if (ret != -1 && fmask != 0) {
           ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
           if (ret == -1) {
                   perror("prctl PR_ISOL_ACTIVATE_SET");
                   return ret;
           }
   }
   #endif

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch v4 1/8] add basic task isolation prctl interface
  2021-10-13 16:06         ` Marcelo Tosatti
@ 2021-10-14 13:02           ` Marcelo Tosatti
  0 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-14 13:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli, Alex Belits,
	Peter Xu

<snip>

> What are the requirements of the signal exactly (and why it is popular) ?
> Because the interruption event can be due to:
> 
> * An IPI.
> * A system call.

IRQs (easy to trace), exceptions.

> In the "full task isolation mode" patchset (the one from Alex), a system call
> will automatically generate a SIGKILL once a system call is performed
> (after the prctl to enable task isolated mode, but
> before the prctl to disable task isolated mode).
> This can be implemented, if desired, by SECCOMP syscall blocking
> (which already exists).
> 
> For other interruptions, which happen through IPIs, one can print
> the stack trace of the program (or interrupt) that generated
> the IPI to find out the cause (which is what rt-trace-bpf.py is doing).
> 
> An alternative would be to add tracepoints so that one can
> find out which function in the kernel caused the CPU and
> task to become "a target for interruptions".

For example, adding a tracepoint to mark_vmstat_dirty() function
(allowing to see how that function was invoked on a given CPU, and
by whom) appears to be sufficient information to debug problems.

(mark_vmstat_dirty() from
[patch v4 5/8] task isolation: sync vmstats conditional on changes)

Instead of a coredump image with a SIGKILL sent at that point.

Looking at

https://github.com/abelits/libtmc

One can see the notification via SIGUSR1 being used.

To support something similar to it, one would add a new bit to 
flags field of:

+struct task_isol_activate_control {
+       __u64 flags;
+       __u64 quiesce_oneshot_mask;
+       __u64 pad[6];
+};

Remove 

+ 	       ret = -EINVAL;
+               if (act_ctrl.flags)
+                       goto out;

From the handler, shrink the padded space and use it.

> 
> > > > Also, see:
> > > > 
> > > >   https://lkml.kernel.org/r/20210929152429.186930629@infradead.org
> > > 
> > > As you can see from the below pseudocode, we were thinking of queueing
> > > the (invalidate icache or TLB flush) in case app is in userspace,
> > > to perform on return to kernel space, but the approach in your patch might be
> > > superior (will take sometime to parse that thread...).
> > 
> > Let me assume you're talking about kernel TLB invalidates, otherwise it
> > would be terribly broken.
> > 
> > > > Suppose:
> > > > 
> > > > 	CPU0					CPU1
> > > > 
> > > > 	sys_prctl()
> > > > 	<kernel entry>
> > > > 	  // marks task 'important'
> > > > 						text_poke_sync()
> > > > 						  // checks CPU0, not userspace, queues IPI
> > > > 	<kernel exit>
> > > > 
> > > > 	$important userspace			  arch_send_call_function_ipi_mask()
> > > > 	<IPI>
> > > > 	  // finds task is 'important' and
> > > > 	  // can't take interrupts
> > > > 	  sigkill()
> > > > 
> > > > *Whoopsie*
> > > > 
> > > > 
> > > > Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> > > > therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> > > > to userspace, it'll get killed on interruption. Catch-22.

To reiterate on this point:

> > > >         CPU0                                    CPU1
> > > > 
> > > >         sys_prctl()
> > > >         <kernel entry>
> > > >           // marks task 'important'
> > > >                                                 text_poke_sync()
> > > >                                                   // checks CPU0, not userspace, queues IPI
> > > >         <kernel exit>

1) Such races can be fixed by proper uses of atomic variables.

2) If a signal to an application is desired, fail to see why this
interface (ignoring bugs related to the particular mechanism) does not
allow it.

So hopefully this addresses your comments.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch v4 1/8] add basic task isolation prctl interface
  2021-10-13 15:01       ` Peter Zijlstra
  2021-10-13 16:06         ` Marcelo Tosatti
@ 2021-10-19 15:07         ` Marcelo Tosatti
  1 sibling, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2021-10-19 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Christoph Lameter, Juri Lelli, Alex Belits,
	Peter Xu

On Wed, Oct 13, 2021 at 05:01:29PM +0200, Peter Zijlstra wrote:
> That's absolutely terrible :/ you're adding extra unconditinal atomics
> to the entry/exit path instead of using the ones that are already there.
> That's no good.
> 
> Also, you're very much not dealing with that race either.

Again, because we haven't seen any use for the notification signal.

But anyway, about the general race:

CPU0                                    CPU1

sys_prctl()
<kernel entry>
                                        if (target_cpu->isolated)
                                          // checks CPU0, not userspace, queues IPI
mark task 'task isolated'
<kernel exit>

		                         arch_send_call_function_ipi_mask()

task is interrupted

----

A possible solution would be to synchronize_rcu (or _expedited if
necessary):

CPU0                                    	CPU1

sys_prctl()
<kernel entry>
						rcu_read_lock()
                                        	if (target_cpu->isolated) {
                                          		// checks CPU0, not userspace, queues IPI
cpu0->isolated = true
synchronize_rcu/synchronize_rcu_expedited

		                         		arch_send_call_function_ipi_mask()
						}
						rcu_read_unlock()

<kernel exit>

----

But that cpu0->isolated variable is not usable for the TLB flush
stuff, for example.

But regarding the question
"the inherently racy nature of some of the don't disturb me stuff":
i think it depends on the case (as in what piece of kernel code).
For example, for the TLB flush case the atomic cmpxchg loop gets rid of the race.

But the above RCU protection might be sufficient for other cases.

And about the notification to userspace, can't see a reason why it can't 
be performed in asynchronous fashion (say via eventfd to a manager
thread rather than isolated threads themselves).

> Also, I think you're broken vs instrumentation, all of this needs to
> happen super early on entry, possibly while still on the entry stack,
> otherwise the missed TLBi might be handled too late and we just used a
> stale TLB entry.

Alright, right after switch to kernel CR3, right before switch to user CR3 
(or as early/late as possible).

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-10-19 15:50 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 1/8] add basic task isolation prctl interface Marcelo Tosatti
2021-10-12 13:05   ` Peter Zijlstra
2021-10-13 10:56     ` Marcelo Tosatti
2021-10-13 11:37       ` Marcelo Tosatti
2021-10-13 15:01       ` Peter Zijlstra
2021-10-13 16:06         ` Marcelo Tosatti
2021-10-14 13:02           ` Marcelo Tosatti
2021-10-19 15:07         ` Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 3/8] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 4/8] procfs: add per-pid task isolation state Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 5/8] task isolation: sync vmstats conditional on changes Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 6/8] KVM: x86: call isolation prepare from VM-entry code path Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 7/8] mm: vmstat: move need_update Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).