linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V3 0/8] extensible prctl task isolation interface and vmstat sync
@ 2021-08-24 15:24 Marcelo Tosatti
  2021-08-24 15:24 ` [patch V3 1/8] add basic task isolation prctl interface Marcelo Tosatti
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu


The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.

The patchset is based on ideas and code from the 
task isolation patchset from Alex Belits:
https://lwn.net/Articles/816298/

Please refer to Documentation/userspace-api/task_isolation.rst
(patch 2) for details.

Note: the prctl interface is independent of nohz_full=.

---------

v3: 

 - Split in smaller patches		 (Nitesh Lal).
 - Misc cleanups			 (Nitesh Lal).
 - Clarify nohz_full is not a dependency (Nicolas Saenz).
 - Incorrect values for prctl definitions (kernel robot).
 - Save configured state, so applications  
   can activate externally configured
   task isolation parameters.
 - Remove "system default" notion (chisol should
   make it obsolete).
 - Update documentation: add new section with explanation
   about configuration/activation and code example.
 - Update samples.
 - Report configuration/activation state at
   /proc/pid/task_isolation.
 - Condense dirty information of per-CPU vmstats counters 
   in a bool.
 - In-kernel KVM support.
 - Add support to configure inheritance on fork and exec.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).

- Avoid potential regressions by allowing applications
  to use ISOL_F_QUIESCE_DEFMASK (whose default value
  is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).

v2 can be found at:
https://lore.kernel.org/patchwork/project/lkml/list/?series=510225


---

 Documentation/userspace-api/task_isolation.rst |  281 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c                             |    3 
 fs/proc/base.c                                 |   68 +++++++++++++++++++
 include/linux/sched.h                          |    5 +
 include/linux/task_isolation.h                 |  131 ++++++++++++++++++++++++++++++++++++++
 include/linux/vmstat.h                         |   17 ++++
 include/uapi/linux/prctl.h                     |   27 +++++++
 init/init_task.c                               |    3 
 kernel/Makefile                                |    2 
 kernel/entry/common.c                          |    2 
 kernel/exit.c                                  |    2 
 kernel/fork.c                                  |   23 ++++++
 kernel/sys.c                                   |   26 +++++++
 kernel/task_isolation.c                        |  315 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmstat.c                                    |  167 ++++++++++++++++++++++++++++++++++++------------
 samples/Kconfig                                |    7 ++
 samples/Makefile                               |    1 
 samples/task_isolation/Makefile                |    9 ++
 samples/task_isolation/task_isol.c             |   83 ++++++++++++++++++++++++
 samples/task_isolation/task_isol.h             |    9 ++
 samples/task_isolation/task_isol_userloop.c    |   56 ++++++++++++++++
 21 files changed, 1194 insertions(+), 43 deletions(-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch V3 1/8] add basic task isolation prctl interface
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
@ 2021-08-24 15:24 ` Marcelo Tosatti
  2021-08-24 15:24 ` [patch V3 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Add basic prctl task isolation interface, which allows
informing the kernel that application is executing 
latency sensitive code (where interruptions are undesired).

Interface is described by task_isolation.rst (added by
next patch).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/sched.h          |    5 +
 include/linux/task_isolation.h |   87 +++++++++++++++++
 include/uapi/linux/prctl.h     |   10 ++
 init/init_task.c               |    3 
 kernel/Makefile                |    2 
 kernel/fork.c                  |    6 +
 kernel/sys.c                   |   16 +++
 kernel/task_isolation.c        |  203 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 331 insertions(+), 1 deletion(-)

Index: linux-2.6/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/prctl.h
+++ linux-2.6/include/uapi/linux/prctl.h
@@ -267,4 +267,31 @@ struct prctl_mm_map {
 # define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
 # define PR_SCHED_CORE_MAX		4
 
+/* Task isolation control */
+#define PR_ISOL_INT_GET			63
+#define PR_ISOL_INT_SET			64
+# define INHERIT_CFG			0x0
+
+/*
+ * This structure provides control over
+ * inheritance of task isolation across
+ * clone and fork.
+ */
+struct task_isol_inherit_control {
+	__u8	inherit_mask;
+	__u8	pad[7];
+};
+
+# define ISOL_INHERIT_CONF		(1UL << 0)
+# define ISOL_INHERIT_ACTIVE		(1UL << 1)
+
+#define PR_ISOL_FEAT			65
+#define PR_ISOL_GET			66
+#define PR_ISOL_SET			67
+#define PR_ISOL_CTRL_GET		68
+#define PR_ISOL_CTRL_SET		69
+
+# define ISOL_F_QUIESCE			(1UL << 0)
+#  define ISOL_F_QUIESCE_VMSTATS	(1UL << 0)
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile
+++ linux-2.6/kernel/Makefile
@@ -132,6 +132,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_CPU_ISOLATION) += task_isolation.o
+
 CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
 obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -58,6 +58,7 @@
 #include <linux/sched/coredump.h>
 #include <linux/sched/task.h>
 #include <linux/sched/cputime.h>
+#include <linux/task_isolation.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
@@ -2567,6 +2568,31 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
 #endif
+	case PR_ISOL_INT_GET:
+		error = prctl_task_isolation_int_get(arg2,
+						     (void __user *)arg3, arg4,
+						     arg5);
+		break;
+	case PR_ISOL_INT_SET:
+		error = prctl_task_isolation_int_set(arg2,
+						    (void __user *)arg3, arg4,
+						     arg5);
+		break;
+	case PR_ISOL_FEAT:
+		error = prctl_task_isolation_feat(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_GET:
+		error = prctl_task_isolation_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_SET:
+		error = prctl_task_isolation_set(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CTRL_GET:
+		error = prctl_task_isolation_ctrl_get(arg2, arg3, arg4, arg5);
+		break;
+	case PR_ISOL_CTRL_SET:
+		error = prctl_task_isolation_ctrl_set(arg2, arg3, arg4, arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct isol_info;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1400,6 +1401,10 @@ struct task_struct {
 	struct llist_head               kretprobe_instances;
 #endif
 
+#ifdef CONFIG_CPU_ISOLATION
+	struct isol_info		*isol_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -213,6 +213,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_CPU_ISOLATION
+	.isol_info	= NULL,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/task_isolation.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -734,6 +735,7 @@ void __put_task_struct(struct task_struc
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	tsk_isol_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk, true);
@@ -1511,6 +1513,15 @@ static int copy_io(unsigned long clone_f
 	return 0;
 }
 
+static int copy_task_isolation(struct task_struct *tsk)
+{
+#ifdef CONFIG_CPU_ISOLATION
+	if (current->isol_info)
+		return __copy_task_isolation(tsk);
+#endif
+	return 0;
+}
+
 static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 {
 	struct sighand_struct *sig;
@@ -2084,7 +2095,9 @@ static __latent_entropy struct task_stru
 #ifdef CONFIG_BPF_SYSCALL
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 #endif
-
+#ifdef CONFIG_CPU_ISOLATION
+	p->isol_info = NULL;
+#endif
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
@@ -2128,6 +2141,9 @@ static __latent_entropy struct task_stru
 	retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
+	retval = copy_task_isolation(p);
+	if (retval)
+		goto bad_fork_cleanup_thread;
 
 	stackleak_task_init(p);
 
@@ -2136,7 +2152,7 @@ static __latent_entropy struct task_stru
 				args->set_tid_size);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
-			goto bad_fork_cleanup_thread;
+			goto bad_fork_cleanup_task_isolation;
 		}
 	}
 
@@ -2354,6 +2370,8 @@ bad_fork_put_pidfd:
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
+bad_fork_cleanup_task_isolation:
+	tsk_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);
 bad_fork_cleanup_io:
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/task_isolation.h
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_TASK_ISOL_H
+#define __LINUX_TASK_ISOL_H
+
+#ifdef CONFIG_CPU_ISOLATION
+
+struct isol_info {
+	/* Which features have been configured */
+	int conf_mask;
+	/* Which features are active */
+	int active_mask;
+	/* Quiesce mask */
+	int quiesce_mask;
+
+	u8 inherit_mask;
+};
+
+extern void __tsk_isol_free(struct task_struct *tsk);
+
+static inline void tsk_isol_free(struct task_struct *tsk)
+{
+	if (tsk->isol_info)
+		__tsk_isol_free(tsk);
+}
+
+int prctl_task_isolation_int_get(unsigned long cmd, void __user *addr,
+				 unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_int_set(unsigned long cmd, void __user *addr,
+				 unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_feat(unsigned long arg2, unsigned long arg3,
+			      unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_get(unsigned long arg2, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_set(unsigned long arg2, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_ctrl_set(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5);
+
+int __copy_task_isolation(struct task_struct *tsk);
+
+#else
+
+static inline void tsk_isol_free(struct task_struct *tsk)
+{
+}
+
+static inline int prctl_task_isolation_feat(unsigned long arg2,
+					    unsigned long arg3,
+					    unsigned long arg4,
+					    unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_int_get(unsigned long cmd,
+					       void __user *addr,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_int_set(unsigned long cmd,
+					       void __user *addr,
+					       unsigned long arg4,
+					       unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_get(unsigned long arg2,
+					   unsigned long arg3,
+					   unsigned long arg4,
+					   unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_set(unsigned long arg2,
+					   unsigned long arg3,
+					   unsigned long arg4,
+					   unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_ctrl_get(unsigned long arg2,
+						unsigned long arg3,
+						unsigned long arg4,
+						unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_ctrl_set(unsigned long arg2,
+						unsigned long arg3,
+						unsigned long arg4,
+						unsigned long arg5)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
+#endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/task_isolation.c
@@ -0,0 +1,280 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@belits.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ *   Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/task_isolation.h>
+#include <linux/prctl.h>
+#include <linux/slab.h>
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <linux/sched/task.h>
+
+void __tsk_isol_free(struct task_struct *tsk)
+{
+	if (!tsk->isol_info)
+		return;
+	kfree(tsk->isol_info);
+	tsk->isol_info = NULL;
+}
+
+static int prctl_task_isolation_feat_quiesce(int type)
+{
+	switch (type) {
+	case 0:
+		return ISOL_F_QUIESCE_VMSTATS;
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+int prctl_task_isolation_feat(unsigned long feat, unsigned long arg3,
+			      unsigned long arg4, unsigned long arg5)
+{
+	switch (feat) {
+	case 0:
+		return ISOL_F_QUIESCE;
+	case ISOL_F_QUIESCE:
+		return prctl_task_isolation_feat_quiesce(arg3);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static int task_isolation_get_quiesce(void)
+{
+	int ret = 0;
+
+	if (current->isol_info)
+		ret = current->isol_info->quiesce_mask;
+
+	return ret;
+}
+
+int prctl_task_isolation_get(unsigned long feat, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5)
+{
+	switch (feat) {
+	case 0: {
+		int ret = -ENODATA;
+
+		if (current->isol_info)
+			ret = current->isol_info->conf_mask;
+		return ret;
+	}
+	case ISOL_F_QUIESCE:
+		return task_isolation_get_quiesce();
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+static struct isol_info *tsk_isol_alloc_context(void)
+{
+	struct isol_info *info;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (unlikely(!info))
+		return ERR_PTR(-ENOMEM);
+
+	return info;
+}
+
+static int prepare_set_quiesce(int quiesce_mask)
+{
+	if (quiesce_mask != ISOL_F_QUIESCE_VMSTATS && quiesce_mask != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+int prctl_task_isolation_set(unsigned long feat, unsigned long arg3,
+			     unsigned long arg4, unsigned long arg5)
+{
+	int ret;
+	struct isol_info *isol_info;
+
+	/* Validate input */
+	switch (feat) {
+	case ISOL_F_QUIESCE:
+		ret = prepare_set_quiesce(arg3);
+		if (ret)
+			return -EINVAL;
+		break;
+	default:
+		break;
+	}
+
+	/* current->isol_info is only allocated/freed from task
+	 * context.
+	 */
+	if (!current->isol_info) {
+		isol_info = tsk_isol_alloc_context();
+		if (IS_ERR(isol_info))
+			return PTR_ERR(isol_info);
+		current->isol_info = isol_info;
+	}
+
+	isol_info = current->isol_info;
+	switch (feat) {
+	case ISOL_F_QUIESCE:
+		isol_info->quiesce_mask = arg3;
+		isol_info->conf_mask |= ISOL_F_QUIESCE;
+		break;
+	default:
+		break;
+	}
+
+	return ret;
+}
+
+int __copy_task_isolation(struct task_struct *tsk)
+{
+	struct isol_info *info, *new_info;
+
+	info = current->isol_info;
+	if (!(info->inherit_mask & (ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE)))
+		return 0;
+
+	new_info = tsk_isol_alloc_context();
+	if (IS_ERR(new_info))
+		return PTR_ERR(new_info);
+
+	new_info->inherit_mask = info->inherit_mask;
+
+	if (info->inherit_mask & ISOL_INHERIT_CONF) {
+		new_info->quiesce_mask = info->quiesce_mask;
+		new_info->conf_mask = info->conf_mask;
+	}
+
+	if (info->inherit_mask & ISOL_INHERIT_ACTIVE)
+		new_info->active_mask = info->active_mask;
+
+	tsk->isol_info = new_info;
+
+	return 0;
+}
+
+int prctl_task_isolation_int_get(unsigned long cmd, void __user *addr,
+				 unsigned long arg4, unsigned long arg5)
+{
+	int ret = 0;
+
+	switch (cmd) {
+	case INHERIT_CFG: {
+		struct task_isol_inherit_control *i_ctrl;
+
+		if (!current->isol_info)
+			return -EINVAL;
+
+		i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+				 GFP_KERNEL);
+		if (!i_ctrl)
+			return -ENOMEM;
+
+		i_ctrl->inherit_mask = current->isol_info->inherit_mask;
+
+		if (copy_to_user(addr, i_ctrl, sizeof(*i_ctrl)))
+			ret = -EFAULT;
+		kfree(i_ctrl);
+		break;
+	}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+int prctl_task_isolation_int_set(unsigned long cmd, void __user *addr,
+				 unsigned long arg4, unsigned long arg5)
+{
+	int ret = 0;
+
+	switch (cmd) {
+	case INHERIT_CFG: {
+		struct task_isol_inherit_control *i_ctrl;
+
+		i_ctrl = kzalloc(sizeof(struct task_isol_inherit_control),
+				 GFP_KERNEL);
+		if (!i_ctrl)
+			return -ENOMEM;
+
+		ret = -EFAULT;
+		if (copy_from_user(i_ctrl, addr, sizeof(*i_ctrl)))
+			goto out_free;
+
+		ret = -EINVAL;
+		if (i_ctrl->inherit_mask & ~(ISOL_INHERIT_CONF|ISOL_INHERIT_ACTIVE))
+			goto out_free;
+
+		if (i_ctrl->inherit_mask & ISOL_INHERIT_ACTIVE)
+			if (!(i_ctrl->inherit_mask & ISOL_INHERIT_CONF))
+				goto out_free;
+
+		if (!current->isol_info) {
+			struct isol_info *isol_info;
+
+			isol_info = tsk_isol_alloc_context();
+			if (IS_ERR(isol_info)) {
+				ret = PTR_ERR(isol_info);
+				goto out_free;
+			}
+			current->isol_info = isol_info;
+		}
+
+		ret = 0;
+		current->isol_info->inherit_mask = i_ctrl->inherit_mask;
+out_free:
+		kfree(i_ctrl);
+		break;
+	}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+
+int prctl_task_isolation_ctrl_set(unsigned long feat, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5)
+{
+	struct isol_info *isol_info;
+
+	if (feat != ISOL_F_QUIESCE && feat != 0)
+		return -EINVAL;
+
+	isol_info = current->isol_info;
+	if (!isol_info)
+		return -EINVAL;
+	isol_info->active_mask = feat;
+
+	return 0;
+}
+
+int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
+				  unsigned long arg4, unsigned long arg5)
+{
+	int ret = 0;
+
+	if (current->isol_info)
+		ret = current->isol_info->active_mask;
+
+	return ret;
+}



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2021-08-24 15:24 ` [patch V3 1/8] add basic task isolation prctl interface Marcelo Tosatti
@ 2021-08-24 15:24 ` Marcelo Tosatti
  2021-08-26  9:59   ` Frederic Weisbecker
  2021-09-01 13:11   ` Nitesh Lal
  2021-08-24 15:24 ` [patch V3 3/8] task isolation: sync vmstats on return to userspace Marcelo Tosatti
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Add documentation and userspace sample code for prctl
task isolation interface.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 Documentation/userspace-api/task_isolation.rst |  211 +++++++++++++++++++++++++
 samples/Kconfig                                |    7 
 samples/Makefile                               |    1 
 samples/task_isolation/Makefile                |    9 +
 samples/task_isolation/task_isol.c             |   83 +++++++++
 samples/task_isolation/task_isol.h             |    9 +
 samples/task_isolation/task_isol_userloop.c    |   56 ++++++
 7 files changed, 376 insertions(+)

Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,281 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Task isolation prctl interface
+===============================
+
+Certain types of applications benefit from running uninterrupted by
+background OS activities. Realtime systems and high-bandwidth networking
+applications with user-space drivers can fall into the category.
+
+To create an OS noise free environment for the application, this
+interface allows userspace to inform the kernel the start and
+end of the latency sensitive application section (with configurable
+system behaviour for that section).
+
+Note: the prctl interface is independent of nohz_full=.
+
+The prctl options are:
+
+
+        - PR_ISOL_FEAT: Retrieve supported features.
+        - PR_ISOL_GET: Retrieve task isolation parameters.
+        - PR_ISOL_SET: Set task isolation parameters.
+        - PR_ISOL_CTRL_GET: Retrieve task isolation state.
+        - PR_ISOL_CTRL_SET: Set task isolation state.
+        - PR_ISOL_GET_INT: Retrieve internal parameters.
+        - PR_ISOL_SET_INT: Retrieve internal parameters.
+
+
+
+Inheritance of the isolation parameters and state, across
+fork(2) and clone(2), can be changed via
+PR_ISOL_GET_INT/PR_ISOL_SET_INT.
+
+At a high-level, task isolation is divided in two steps:
+
+1. Configuration.
+2. Activation.
+
+Section "Userspace support" describes how to use
+task isolation.
+
+In terms of the interface, the sequence of steps to activate
+task isolation are:
+
+1. Retrieve supported task isolation features (PR_ISOL_FEAT).
+2. Configure task isolation features (PR_ISOL_SET/PR_ISOL_GET).
+3. Activate or deactivate task isolation features
+   (PR_ISOL_CTRL_GET/PR_ISOL_CTRL_SET).
+4. Optionally configure inheritance (PR_ISOL_GET_INT/PR_ISOL_SET_INT).
+
+This interface is based on ideas and code from the
+task isolation patchset from Alex Belits:
+https://lwn.net/Articles/816298/
+
+--------------------
+Feature description
+--------------------
+
+        - ``ISOL_F_QUIESCE``
+
+        This feature allows quiescing select kernel activities on
+        return from system calls.
+
+---------------------
+Interface description
+---------------------
+
+**PR_ISOL_FEAT**:
+
+        Returns the supported features and feature
+        capabilities, as a bitmask::
+
+                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies whether to return
+        supported features (if zero), or feature capabilities
+        (if not zero). Possible non-zero values for 'feat' are:
+
+        - ``ISOL_F_QUIESCE``:
+
+                Returns a bitmask containing which kernel
+                activities are supported for quiescing.
+
+        Features and its capabilities are defined at include/uapi/linux/task_isolation.h.
+
+**PR_ISOL_GET**:
+
+        Retrieve task isolation feature configuration.
+        The general format is::
+
+                prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies whether to return
+        configured features (if zero), or individual feature
+        configuration (if not zero). Possible non-zero
+        values for 'feat' are:
+
+        - ``ISOL_F_QUIESCE``:
+
+                Returns a bitmask containing which kernel
+                activities are enabled for quiescing.
+
+
+**PR_ISOL_SET**:
+
+        Configures task isolation features. The general format is::
+
+                prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
+
+        The 'feat' argument specifies which feature to configure.
+        Possible values for feat are:
+
+        - ``ISOL_F_QUIESCE``:
+
+                The 'arg3' argument is a bitmask specifying which
+                kernel activities to quiesce. Possible bit sets are:
+
+                - ``ISOL_F_QUIESCE_VMSTATS``
+
+                  VM statistics are maintained in per-CPU counters to
+                  improve performance. When a CPU modifies a VM statistic,
+                  this modification is kept in the per-CPU counter.
+                  Certain activities require a global count, which
+                  involves requesting each CPU to flush its local counters
+                  to the global VM counters.
+
+                  This flush is implemented via a workqueue item, which
+                  might schedule a workqueue on isolated CPUs.
+
+                  To avoid this interruption, task isolation can be
+                  configured to, upon return from system calls, synchronize
+                  the per-CPU counters to global counters, thus avoiding
+                  the interruption.
+
+                  To ensure the application returns to userspace
+                  with no modified per-CPU counters, its necessary to
+                  use mlockall() in addition to this isolcpus flag.
+
+**PR_ISOL_CTRL_GET**:
+
+        Retrieve task isolation control.
+
+                prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
+
+        Returns which isolation features are active.
+
+**PR_ISOL_CTRL_SET**:
+
+        Activates/deactivates task isolation control.
+
+                prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
+
+        The 'mask' argument specifies which features
+        to activate (bit set) or deactivate (bit clear).
+
+        For ISOL_F_QUIESCE, quiescing of background activities
+        happens on return to userspace from the
+        prctl(PR_ISOL_CTRL_SET) call, and on return from
+        subsequent system calls.
+
+        Quiescing can be adjusted (while active) by
+        prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ...).
+
+**PR_ISOL_GET_INT**:
+
+        Retrieves task isolation internal parameters.
+
+        The general format is::
+
+                prctl(PR_ISOL_GET_INT, cmd, arg3, arg4, arg5);
+
+        The 'cmd' argument specifies which parameter to configure.
+        Possible values for cmd are:
+
+        - ``INHERIT_CFG``:
+
+                Retrieve inheritance configuration.
+
+                The 'arg3' argument is a pointer to a struct
+                inherit_control::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                See PR_ISOL_SET_INT description below for meaning
+                of structure fields.
+
+**PR_ISOL_SET_INT**:
+
+        Sets task isolation internal parameters.
+
+        The general format is::
+
+                prctl(PR_ISOL_SET_INT, cmd, arg3, arg4, arg5);
+
+        The 'cmd' argument specifies which parameter to configure.
+        Possible values for cmd are:
+
+        - ``INHERIT_CFG``:
+
+                Set inheritance configuration when a new task
+                is created via fork and clone.
+
+                The 'arg3' argument is a pointer to a struct
+                inherit_control::
+
+                        struct task_isol_inherit_control {
+                                __u8    inherit_mask;
+                                __u8    pad[7];
+                        };
+
+                inherit_mask is a bitmask that specifies which part
+                of task isolation to be inherited:
+
+                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
+                  This is the stated written via prctl(PR_ISOL_SET, ...).
+
+                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
+                  (requires ISOL_INHERIT_CONF to be set). The new task
+                  should behave, right after fork/clone, in the same manner
+                  as the parent task _after_ it executed:
+
+                        prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
+
+                  with a valid mask.
+
+==================
+Userspace support
+==================
+
+Task isolation is divided in two main steps: configuration and activation.
+
+Each step can be performed by an external tool or the latency sensitive
+application itself. util-linux contains the "chisol" tool for this
+purpose.
+
+This results in three combinations:
+
+1. Both configuration and activation performed by the
+latency sensitive application.
+Allows fine grained control of what task isolation
+features are enabled and when (see samples section below).
+
+2. Only activation can be performed by the latency sensitive app
+(and configuration performed by chisol).
+This allows the admin/user to control task isolation parameters,
+and applications have to be modified only once.
+
+3. Configuration and activation performed by an external tool.
+This allows unmodified applications to take advantage of
+task isolation. Activation is performed by the "-a" option
+of chisol.
+
+========
+Examples
+========
+
+The ``samples/task_isolation/`` directory contains sample
+code.
+
+This is a snippet of code to activate task isolation if
+it has been previously configured (by chisol for example)::
+
+        #include <sys/prctl.h>
+
+        #ifdef PR_ISOL_GET
+        ret = prctl(PR_ISOL_GET, 0, 0, 0, 0);
+        if (ret != -1) {
+                unsigned long mask = ret;
+
+                ret = prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
+                if (ret == -1) {
+                        perror("prctl PR_ISOL_CTRL_SET");
+                        return ret;
+                }
+        }
+        #endif
+
Index: linux-2.6/samples/task_isolation/task_isol.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.c
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include <errno.h>
+#include "task_isol.h"
+
+#ifdef PR_ISOL_FEAT
+int task_isol_setup(void)
+{
+	int ret;
+	int errnosv;
+
+	/* Retrieve supported task isolation features */
+	ret = prctl(PR_ISOL_FEAT, 0, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT");
+		return ret;
+	}
+	printf("supported features bitmask: 0x%x\n", ret);
+
+	/* Retrieve supported ISOL_F_QUIESCE bits */
+	ret = prctl(PR_ISOL_FEAT, ISOL_F_QUIESCE, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE)");
+		return ret;
+	}
+	printf("supported ISOL_F_QUIESCE bits: 0x%x\n", ret);
+
+	/* Do not configure task isolation attributes if already set */
+	ret = prctl(PR_ISOL_GET, 0, 0, 0, 0);
+	errnosv = errno;
+	if (ret != -1) {
+		printf("Task isolation parameters already configured!\n");
+		return ret;
+	}
+	if (ret == -1 && errnosv != ENODATA) {
+		perror("prctl PR_ISOL_GET");
+		return ret;
+	}
+	ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS,
+		    0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_SET");
+		return ret;
+	}
+	return ISOL_F_QUIESCE;
+}
+
+int task_isol_ctrl_set(unsigned long mask)
+{
+	int ret;
+
+	ret = prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
+		return -1;
+	}
+
+	return 0;
+}
+
+#else
+
+int task_isol_setup(void)
+{
+	return 0;
+}
+
+int task_isol_ctrl_set(unsigned long mask)
+{
+	return 0;
+}
+#endif
+
+
Index: linux-2.6/samples/task_isolation/task_isol.h
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TASK_ISOL_H
+#define __TASK_ISOL_H
+
+int task_isol_setup(void);
+
+int task_isol_ctrl_set(unsigned long mask);
+
+#endif /* __TASK_ISOL_H */
Index: linux-2.6/samples/task_isolation/task_isol_userloop.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isol_userloop.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+#include "task_isol.h"
+
+int main(void)
+{
+	int ret;
+	void *buf = malloc(4096);
+	unsigned long mask;
+
+	memset(buf, 1, 4096);
+	ret = mlock(buf, 4096);
+	if (ret) {
+		perror("mlock");
+		return EXIT_FAILURE;
+	}
+
+	ret = task_isol_setup();
+	if (ret)
+		return EXIT_FAILURE;
+	mask = ret;
+
+	ret = prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
+		return EXIT_FAILURE;
+	}
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+	/* busy loop */
+	while (ret < NR_LOOPS)  {
+		memset(buf, 0, 4096);
+		ret = ret+1;
+		if (!(ret % NR_PRINT))
+			printf("loops=%d of %d\n", ret, NR_LOOPS);
+	}
+
+	ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);
+	if (ret == -1) {
+		perror("prctl PR_ISOL_CTRL_SET (0)");
+		exit(0);
+	}
+
+	return EXIT_SUCCESS;
+}
+
Index: linux-2.6/samples/Kconfig
===================================================================
--- linux-2.6.orig/samples/Kconfig
+++ linux-2.6/samples/Kconfig
@@ -223,4 +223,11 @@ config SAMPLE_WATCH_QUEUE
 	  Build example userspace program to use the new mount_notify(),
 	  sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
 
+config SAMPLE_TASK_ISOLATION
+	bool "task isolation sample"
+	depends on CC_CAN_LINK && HEADERS_INSTALL
+	help
+	  Build example userspace program to use prctl task isolation
+	  interface.
+
 endif # SAMPLES
Index: linux-2.6/samples/Makefile
===================================================================
--- linux-2.6.orig/samples/Makefile
+++ linux-2.6/samples/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
 subdir-$(CONFIG_SAMPLE_WATCHDOG)	+= watchdog
 subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)	+= watch_queue
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)	+= kmemleak/
+subdir-$(CONFIG_SAMPLE_TASK_ISOLATION)	+= task_isolation
Index: linux-2.6/samples/task_isolation/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+userprogs-always-y += task_isol_userloop
+task_isol_userloop-objs := task_isol.o task_isol_userloop.o
+
+userccflags += -I usr/include
+
+
+#$(CC) $^ -o $@



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch V3 3/8] task isolation: sync vmstats on return to userspace
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  2021-08-24 15:24 ` [patch V3 1/8] add basic task isolation prctl interface Marcelo Tosatti
  2021-08-24 15:24 ` [patch V3 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
@ 2021-08-24 15:24 ` Marcelo Tosatti
  2021-09-10 13:49   ` nsaenzju
  2021-08-24 15:24 ` [patch V3 4/8] procfs: add per-pid task isolation state Marcelo Tosatti
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, use the task isolation prctl interface to quiesce 
deferred actions when returning to userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/task_isolation.h |   12 ++++++++++++
 include/linux/vmstat.h         |    8 ++++++++
 kernel/entry/common.c          |    2 ++
 kernel/task_isolation.c        |   26 ++++++++++++++++++++++++++
 mm/vmstat.c                    |   21 +++++++++++++++++++++
 5 files changed, 69 insertions(+)

Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -41,8 +41,20 @@ int prctl_task_isolation_ctrl_set(unsign
 
 int __copy_task_isolation(struct task_struct *tsk);
 
+void __isolation_exit_to_user_mode_prepare(void);
+
+static inline void isolation_exit_to_user_mode_prepare(void)
+{
+	if (current->isol_info)
+		__isolation_exit_to_user_mode_prepare();
+}
+
 #else
 
+static void isolation_exit_to_user_mode_prepare(void)
+{
+}
+
 static inline void tsk_isol_free(struct task_struct *tsk)
 {
 }
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
 		void *buffer, size_t *length, loff_t *ppos);
 #endif
 
+#ifdef CONFIG_SMP
+void sync_vmstat(void);
+#else
+static inline void sync_vmstat(void)
+{
+}
+#endif
+
 struct reclaim_stat {
 	unsigned nr_dirty;
 	unsigned nr_unqueued_dirty;
Index: linux-2.6/kernel/entry/common.c
===================================================================
--- linux-2.6.orig/kernel/entry/common.c
+++ linux-2.6/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/task_isolation.h>
 
 #include "common.h"
 
@@ -287,6 +288,7 @@ static void syscall_exit_to_user_mode_pr
 static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
 {
 	syscall_exit_to_user_mode_prepare(regs);
+	isolation_exit_to_user_mode_prepare();
 	local_irq_disable_exit_to_user();
 	exit_to_user_mode_prepare(regs);
 }
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -18,6 +18,8 @@
 #include <linux/sysfs.h>
 #include <linux/init.h>
 #include <linux/sched/task.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
 
 void __tsk_isol_free(struct task_struct *tsk)
 {
@@ -278,3 +280,19 @@ int prctl_task_isolation_ctrl_get(unsign
 
 	return ret;
 }
+
+void __isolation_exit_to_user_mode_prepare(void)
+{
+	struct isol_info *i;
+
+ 	i = current->isol_info;
+	if (!i)
+		return;
+
+	if (i->active_mask != ISOL_F_QUIESCE)
+		return;
+
+	if (i->quiesce_mask & ISOL_F_QUIESCE_VMSTATS)
+		sync_vmstat();
+}
+EXPORT_SYMBOL_GPL(__isolation_exit_to_user_mode_prepare);
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1964,6 +1964,27 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+void sync_vmstat(void)
+{
+	int cpu;
+
+	cpu = get_cpu();
+
+	refresh_cpu_vm_stats(false);
+	put_cpu();
+
+	/*
+	 * If task is migrated to another CPU between put_cpu
+	 * and cancel_delayed_work_sync, the code below might
+	 * cancel vmstat_update work for a different cpu
+	 * (than the one from which the vmstats were flushed).
+	 *
+	 * However, vmstat shepherd will re-enable it later,
+	 * so its harmless.
+	 */
+	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+}
+
 static void __init start_shepherd_timer(void)
 {
 	int cpu;



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch V3 4/8] procfs: add per-pid task isolation state
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2021-08-24 15:24 ` [patch V3 3/8] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2021-08-24 15:24 ` Marcelo Tosatti
  2021-08-24 15:24 ` [patch V3 5/8] task isolation: sync vmstats conditional on changes Marcelo Tosatti
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

Add /proc/pid/task_isolation file, to query the state of
task isolation configuration.

---
 fs/proc/base.c                 |    4 ++
 include/linux/task_isolation.h |    2 +
 kernel/task_isolation.c        |   60 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+)

Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -95,6 +95,8 @@
 #include <linux/posix-timers.h>
 #include <linux/time_namespace.h>
 #include <linux/resctrl.h>
+#include <linux/prctl.h>
+#include <linux/task_isolation.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -661,6 +663,69 @@ static int proc_pid_syscall(struct seq_f
 }
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
+#ifdef CONFIG_CPU_ISOLATION
+
+struct qoptions {
+	unsigned long mask;
+	char *name;
+};
+
+static struct qoptions iopts[] = {
+	{ISOL_F_QUIESCE, "quiesce"},
+};
+#define ILEN (sizeof(iopts) / sizeof(struct qoptions))
+
+static struct qoptions qopts[] = {
+	{ISOL_F_QUIESCE_VMSTATS, "vmstat_sync"},
+};
+#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
+
+static void show_isolation_state(struct seq_file *m,
+				 struct qoptions *iopt,
+				 int mask,
+				 const char *hdr)
+{
+	int i;
+
+	seq_printf(m, hdr);
+	for (i = 0; i < ILEN; i++) {
+		if (mask & iopt->mask)
+			seq_printf(m, "%s ", iopt->name);
+		iopt++;
+	}
+	if (mask == 0)
+		seq_printf(m, "none ");
+	seq_printf(m, "\n");
+}
+
+int proc_pid_task_isolation(struct seq_file *m, struct pid_namespace *ns,
+			    struct pid *pid, struct task_struct *t)
+{
+	int active_mask, quiesce_mask, conf_mask;
+	struct isol_info *isol_info;
+	struct inode *inode = m->private;
+	struct task_struct *task = get_proc_task(inode);
+
+	isol_info = t->isol_info;
+	if (!isol_info)
+		active_mask = quiesce_mask = conf_mask = 0;
+	else {
+		active_mask = isol_info->active_mask;
+		quiesce_mask = isol_info->quiesce_mask;
+		conf_mask = isol_info->conf_mask;
+	}
+
+	show_isolation_state(m, iopts, conf_mask, "Configured state: ");
+	show_isolation_state(m, iopts, active_mask, "Active state: ");
+	show_isolation_state(m, qopts, quiesce_mask, "Quiescing: ");
+
+	put_task_struct(task);
+
+	return 0;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
 /************************************************************************/
 /*                       Here the fs part begins                        */
 /************************************************************************/
@@ -3278,6 +3343,9 @@ static const struct pid_entry tgid_base_
 #ifdef CONFIG_SECCOMP_CACHE_DEBUG
 	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
 #endif
+#ifdef CONFIG_CPU_ISOLATION
+	ONE("task_isolation", S_IRUSR, proc_pid_task_isolation),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch V3 5/8] task isolation: sync vmstats conditional on changes
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2021-08-24 15:24 ` [patch V3 4/8] procfs: add per-pid task isolation state Marcelo Tosatti
@ 2021-08-24 15:24 ` Marcelo Tosatti
  2021-08-25  9:46   ` Christoph Lameter
  2021-08-24 15:24 ` [patch V3 6/8] KVM: x86: call isolation prepare from VM-entry code path Marcelo Tosatti
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Rather than syncing VM-stats on every return to userspace
(or VM-entry), keep track of changes through a per-CPU bool.

This improves performance when enabling task isolated
for vcpu VMs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -22,7 +22,16 @@ int sysctl_vm_numa_stat_handler(struct c
 #endif
 
 #ifdef CONFIG_SMP
-void sync_vmstat(void);
+extern struct static_key vmstat_sync_enabled;
+
+void __sync_vmstat(void);
+static inline void sync_vmstat(void)
+{
+	if (static_key_false(&vmstat_sync_enabled))
+		__sync_vmstat();
+}
+
+void init_sync_vmstat(void);
 #else
 static inline void sync_vmstat(void)
 {
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- linux-2.6.orig/kernel/task_isolation.c
+++ linux-2.6/kernel/task_isolation.c
@@ -21,6 +21,17 @@
 #include <linux/mm.h>
 #include <linux/vmstat.h>
 
+void __tsk_isol_exit(struct task_struct *tsk)
+{
+	struct isol_info *i;
+
+	i = tsk->isol_info;
+	if (!i)
+		return;
+
+	static_key_slow_dec(&vmstat_sync_enabled);
+}
+
 void __tsk_isol_free(struct task_struct *tsk)
 {
 	if (!tsk->isol_info)
@@ -92,6 +103,12 @@ static struct isol_info *tsk_isol_alloc_
 	if (unlikely(!info))
 		return ERR_PTR(-ENOMEM);
 
+	preempt_disable();
+	init_sync_vmstat();
+	preempt_enable();
+
+	static_key_slow_inc(&vmstat_sync_enabled);
+
 	return info;
 }
 
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -306,6 +307,22 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+struct static_key vmstat_sync_enabled;
+static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+static inline void mark_vmstat_dirty(void)
+{
+	if (!static_key_false(&vmstat_sync_enabled))
+		return;
+
+	raw_cpu_write(vmstat_dirty, true);
+}
+
+void init_sync_vmstat(void)
+{
+	raw_cpu_write(vmstat_dirty, true);
+}
+
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -328,6 +345,7 @@ void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
@@ -359,6 +377,7 @@ void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 }
 EXPORT_SYMBOL(__mod_node_page_state);
 
@@ -399,6 +418,7 @@ void __inc_zone_state(struct zone *zone,
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
 	}
+	mark_vmstat_dirty();
 }
 
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -417,6 +437,7 @@ void __inc_node_state(struct pglist_data
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
 	}
+	mark_vmstat_dirty();
 }
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -445,6 +466,7 @@ void __dec_zone_state(struct zone *zone,
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
 	}
+	mark_vmstat_dirty();
 }
 
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -463,6 +485,7 @@ void __dec_node_state(struct pglist_data
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
 	}
+	mark_vmstat_dirty();
 }
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -526,6 +549,7 @@ static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	mark_vmstat_dirty();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -594,6 +618,7 @@ static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	mark_vmstat_dirty();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -1964,13 +1989,18 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
-void sync_vmstat(void)
+void __sync_vmstat(void)
 {
 	int cpu;
 
 	cpu = get_cpu();
+	if (per_cpu(vmstat_dirty, cpu) == false) {
+		put_cpu();
+		return;
+	}
 
 	refresh_cpu_vm_stats(false);
+	raw_cpu_write(vmstat_dirty, false);
 	put_cpu();
 
 	/*
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- linux-2.6.orig/include/linux/task_isolation.h
+++ linux-2.6/include/linux/task_isolation.h
@@ -24,6 +24,13 @@ static inline void tsk_isol_free(struct
 		__tsk_isol_free(tsk);
 }
 
+void __tsk_isol_exit(struct task_struct *tsk);
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+       if (tsk->isol_info)
+               __tsk_isol_exit(tsk);
+}
+
 int prctl_task_isolation_int_get(unsigned long cmd, void __user *addr,
 				 unsigned long arg4, unsigned long arg5);
 int prctl_task_isolation_int_set(unsigned long cmd, void __user *addr,
@@ -59,6 +66,10 @@ static inline void tsk_isol_free(struct
 {
 }
 
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+}
+
 static inline int prctl_task_isolation_feat(unsigned long arg2,
 					    unsigned long arg3,
 					    unsigned long arg4,
Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -64,6 +64,7 @@
 #include <linux/rcuwait.h>
 #include <linux/compat.h>
 #include <linux/io_uring.h>
+#include <linux/task_isolation.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -778,6 +779,7 @@ void __noreturn do_exit(long code)
 	}
 
 	io_uring_files_cancel(tsk->files);
+	tsk_isol_exit(tsk);
 	exit_signals(tsk);  /* sets PF_EXITING */
 
 	/* sync mm's RSS info before statistics gathering */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -2371,6 +2371,7 @@ bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
 bad_fork_cleanup_task_isolation:
+	tsk_isol_exit(p);
 	tsk_isol_free(p);
 bad_fork_cleanup_thread:
 	exit_thread(p);



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch V3 6/8] KVM: x86: call isolation prepare from VM-entry code path
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2021-08-24 15:24 ` [patch V3 5/8] task isolation: sync vmstats conditional on changes Marcelo Tosatti
@ 2021-08-24 15:24 ` Marcelo Tosatti
  2021-08-24 15:24 ` [patch V3 7/8] mm: vmstat: move need_update Marcelo Tosatti
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

VM-entry code path is an entry point similar to userspace return
when task isolation is concerned.

Call isolation_exit_to_user_mode_prepare before VM-enter.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/kvm/x86.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/arch/x86/kvm/x86.c
===================================================================
--- linux-2.6.orig/arch/x86/kvm/x86.c
+++ linux-2.6/arch/x86/kvm/x86.c
@@ -59,6 +59,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/entry-kvm.h>
 #include <linux/suspend.h>
+#include <linux/task_isolation.h>
 
 #include <trace/events/kvm.h>
 
@@ -9539,6 +9540,8 @@ static int vcpu_enter_guest(struct kvm_v
 		goto cancel_injection;
 	}
 
+	isolation_exit_to_user_mode_prepare();
+
 	preempt_disable();
 
 	static_call(kvm_x86_prepare_guest_switch)(vcpu);



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch V3 7/8] mm: vmstat: move need_update
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2021-08-24 15:24 ` [patch V3 6/8] KVM: x86: call isolation prepare from VM-entry code path Marcelo Tosatti
@ 2021-08-24 15:24 ` Marcelo Tosatti
  2021-08-24 15:24 ` [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  2021-08-25 10:02 ` [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  8 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

Move need_update() function up in vmstat.c, needed by next patch.
No code changes.

Remove a duplicate comment while at it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   67 +++++++++++++++++++++++++++---------------------------------
 1 file changed, 31 insertions(+), 36 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1819,6 +1819,37 @@ static const struct seq_operations vmsta
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	pg_data_t *last_pgdat = NULL;
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		struct per_cpu_nodestat *n;
+
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 */
+		if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
+			       sizeof(pzstats->vm_stat_diff[0])))
+			return true;
+
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
+		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
+		if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS *
+			       sizeof(n->vm_node_stat_diff[0])))
+		    return true;
+	}
+	return false;
+}
+
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1899,42 +1930,6 @@ static void vmstat_update(struct work_st
 }
 
 /*
- * Switch off vmstat processing and then fold all the remaining differentials
- * until the diffs stay at zero. The function is used by NOHZ and can only be
- * invoked when tick processing is not active.
- */
-/*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
-			       sizeof(pzstats->vm_stat_diff[0])))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS *
-			       sizeof(n->vm_node_stat_diff[0])))
-		    return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2021-08-24 15:24 ` [patch V3 7/8] mm: vmstat: move need_update Marcelo Tosatti
@ 2021-08-24 15:24 ` Marcelo Tosatti
  2021-08-25  9:30   ` Christoph Lameter
  2021-09-01 13:05   ` Nitesh Lal
  2021-08-25 10:02 ` [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
  8 siblings, 2 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-24 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Marcelo Tosatti

It is not necessary to queue work item to run refresh_vm_stats 
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.

This fixes sosreport hang (which uses vmstat_refresh) with 
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/vmstat.c |   49 ++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 44 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1851,6 +1851,31 @@ static bool need_update(int cpu)
 }
 
 #ifdef CONFIG_PROC_FS
+static bool need_drain_remote_zones(int cpu)
+{
+#ifdef CONFIG_NUMA
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		if (!pcp->count)
+			continue;
+
+		if (!pcp->expire)
+			continue;
+
+		if (zone_to_nid(zone) == cpu_to_node(cpu))
+			continue;
+
+		return true;
+	}
+#endif
+
+	return false;
+}
+
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
@@ -1860,8 +1885,12 @@ int vmstat_refresh(struct ctl_table *tab
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	long val;
-	int err;
-	int i;
+	int i, cpu;
+	struct work_struct __percpu *works;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1875,9 +1904,19 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+
+		INIT_WORK(work, refresh_vm_stats);
+		if (need_update(cpu) || need_drain_remote_zones(cpu))
+			schedule_work_on(cpu, work);
+	}
+	for_each_online_cpu(cpu)
+		flush_work(per_cpu_ptr(works, cpu));
+	put_online_cpus();
+	free_percpu(works);
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-08-24 15:24 ` [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
@ 2021-08-25  9:30   ` Christoph Lameter
  2021-09-01 13:05   ` Nitesh Lal
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2021-08-25  9:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Tue, 24 Aug 2021, Marcelo Tosatti wrote:

> It is not necessary to queue work item to run refresh_vm_stats
> on a remote CPU if that CPU has no dirty stats and no per-CPU
> allocations for remote nodes.

The issue in the past was whether the effort to check is adding overhead
that is comparable to run refresh_vm_stats. YMMV.

> This fixes sosreport hang (which uses vmstat_refresh) with
> spinning SCHED_FIFO process.

Ughhh.. SCHED_FIFO is evil....

>  #ifdef CONFIG_PROC_FS
> +static bool need_drain_remote_zones(int cpu)

Well this is not related to vm stats but per cpu pages of the page
allocator. Maybe call this need_drain_remote_pcp or something?

> @@ -1860,8 +1885,12 @@ int vmstat_refresh(struct ctl_table *tab
>  		   void *buffer, size_t *lenp, loff_t *ppos)
>  {
>  	long val;
> -	int err;
> -	int i;
> +	int i, cpu;
> +	struct work_struct __percpu *works;
> +
> +	works = alloc_percpu(struct work_struct);

Do malloc instead? Using the percpu allocator frequently in a function to
allocator temporary variables can cause needless fragmentation there. The
percpu allocator does not have the frag management features of the slab
allocators.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 5/8] task isolation: sync vmstats conditional on changes
  2021-08-24 15:24 ` [patch V3 5/8] task isolation: sync vmstats conditional on changes Marcelo Tosatti
@ 2021-08-25  9:46   ` Christoph Lameter
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2021-08-25  9:46 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Frederic Weisbecker, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Tue, 24 Aug 2021, Marcelo Tosatti wrote:

> Rather than syncing VM-stats on every return to userspace
> (or VM-entry), keep track of changes through a per-CPU bool.
>
> This improves performance when enabling task isolated
> for vcpu VMs.

And it adds overhead for each time the counters are updated. The static
check is not that bad but the per cpu reference causes an extra cacheline
hit in potentially performance sensitive vm code.

On the other hand: Once we have an indicator that the vmstats have been
updated then the checks for the need to perform a vmstat update can be
simplified using that percpu variable and made much faster.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 0/8] extensible prctl task isolation interface and vmstat sync
  2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2021-08-24 15:24 ` [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
@ 2021-08-25 10:02 ` Marcelo Tosatti
  8 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-25 10:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu, Thomas Gleixner


+CC Thomas.

On Tue, Aug 24, 2021 at 12:24:23PM -0300, Marcelo Tosatti wrote:
> 
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
> 
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
> 
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
> 
> To fix this, add task isolation prctl interface to quiesce
> deferred actions when returning to userspace.
> 
> The patchset is based on ideas and code from the 
> task isolation patchset from Alex Belits:
> https://lwn.net/Articles/816298/
> 
> Please refer to Documentation/userspace-api/task_isolation.rst
> (patch 2) for details.
> 
> Note: the prctl interface is independent of nohz_full=.
> 
> ---------
> 
> v3: 
> 
>  - Split in smaller patches		 (Nitesh Lal).
>  - Misc cleanups			 (Nitesh Lal).
>  - Clarify nohz_full is not a dependency (Nicolas Saenz).
>  - Incorrect values for prctl definitions (kernel robot).
>  - Save configured state, so applications  
>    can activate externally configured
>    task isolation parameters.
>  - Remove "system default" notion (chisol should
>    make it obsolete).
>  - Update documentation: add new section with explanation
>    about configuration/activation and code example.
>  - Update samples.
>  - Report configuration/activation state at
>    /proc/pid/task_isolation.
>  - Condense dirty information of per-CPU vmstats counters 
>    in a bool.
>  - In-kernel KVM support.
>  - Add support to configure inheritance on fork and exec.
> 
> v2:
> 
> - Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).
> 
> - Avoid potential regressions by allowing applications
>   to use ISOL_F_QUIESCE_DEFMASK (whose default value
>   is configurable in /sys/).         (Nitesh Lal / Nicolas Saenz).
> 
> v2 can be found at:
> https://lore.kernel.org/patchwork/project/lkml/list/?series=510225
> 
> 
> ---
> 
>  Documentation/userspace-api/task_isolation.rst |  281 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c                             |    3 
>  fs/proc/base.c                                 |   68 +++++++++++++++++++
>  include/linux/sched.h                          |    5 +
>  include/linux/task_isolation.h                 |  131 ++++++++++++++++++++++++++++++++++++++
>  include/linux/vmstat.h                         |   17 ++++
>  include/uapi/linux/prctl.h                     |   27 +++++++
>  init/init_task.c                               |    3 
>  kernel/Makefile                                |    2 
>  kernel/entry/common.c                          |    2 
>  kernel/exit.c                                  |    2 
>  kernel/fork.c                                  |   23 ++++++
>  kernel/sys.c                                   |   26 +++++++
>  kernel/task_isolation.c                        |  315 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmstat.c                                    |  167 ++++++++++++++++++++++++++++++++++++------------
>  samples/Kconfig                                |    7 ++
>  samples/Makefile                               |    1 
>  samples/task_isolation/Makefile                |    9 ++
>  samples/task_isolation/task_isol.c             |   83 ++++++++++++++++++++++++
>  samples/task_isolation/task_isol.h             |    9 ++
>  samples/task_isolation/task_isol_userloop.c    |   56 ++++++++++++++++
>  21 files changed, 1194 insertions(+), 43 deletions(-)
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-24 15:24 ` [patch V3 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
@ 2021-08-26  9:59   ` Frederic Weisbecker
  2021-08-26 12:11     ` Marcelo Tosatti
  2021-09-01 13:11   ` Nitesh Lal
  1 sibling, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2021-08-26  9:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Tue, Aug 24, 2021 at 12:24:25PM -0300, Marcelo Tosatti wrote:
> Add documentation and userspace sample code for prctl
> task isolation interface.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  Documentation/userspace-api/task_isolation.rst |  211 +++++++++++++++++++++++++
>  samples/Kconfig                                |    7 
>  samples/Makefile                               |    1 
>  samples/task_isolation/Makefile                |    9 +
>  samples/task_isolation/task_isol.c             |   83 +++++++++
>  samples/task_isolation/task_isol.h             |    9 +
>  samples/task_isolation/task_isol_userloop.c    |   56 ++++++
>  7 files changed, 376 insertions(+)
> 
> Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
> ===================================================================
> --- /dev/null
> +++ linux-2.6/Documentation/userspace-api/task_isolation.rst
> @@ -0,0 +1,281 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============================
> +Task isolation prctl interface
> +===============================
> +
> +Certain types of applications benefit from running uninterrupted by
> +background OS activities. Realtime systems and high-bandwidth networking
> +applications with user-space drivers can fall into the category.
> +
> +To create an OS noise free environment for the application, this
> +interface allows userspace to inform the kernel the start and
> +end of the latency sensitive application section (with configurable
> +system behaviour for that section).
> +
> +Note: the prctl interface is independent of nohz_full=.
> +
> +The prctl options are:
> +
> +
> +        - PR_ISOL_FEAT: Retrieve supported features.
> +        - PR_ISOL_GET: Retrieve task isolation parameters.
> +        - PR_ISOL_SET: Set task isolation parameters.
> +        - PR_ISOL_CTRL_GET: Retrieve task isolation state.
> +        - PR_ISOL_CTRL_SET: Set task isolation state.
> +        - PR_ISOL_GET_INT: Retrieve internal parameters.
> +        - PR_ISOL_SET_INT: Retrieve internal parameters.

There should be some short summary here to explain the difference
between parameter, state, and internal parameter. I personally have
no clue so far.

> +
> +
> +Inheritance of the isolation parameters and state, across
> +fork(2) and clone(2), can be changed via
> +PR_ISOL_GET_INT/PR_ISOL_SET_INT.
> +
> +At a high-level, task isolation is divided in two steps:
> +
> +1. Configuration.
> +2. Activation.
> +
> +Section "Userspace support" describes how to use
> +task isolation.
> +
> +In terms of the interface, the sequence of steps to activate
> +task isolation are:
> +
> +1. Retrieve supported task isolation features (PR_ISOL_FEAT).
> +2. Configure task isolation features (PR_ISOL_SET/PR_ISOL_GET).
> +3. Activate or deactivate task isolation features
> +   (PR_ISOL_CTRL_GET/PR_ISOL_CTRL_SET).
> +4. Optionally configure inheritance (PR_ISOL_GET_INT/PR_ISOL_SET_INT).
> +
> +This interface is based on ideas and code from the
> +task isolation patchset from Alex Belits:
> +https://lwn.net/Articles/816298/
> +
> +--------------------
> +Feature description
> +--------------------
> +
> +        - ``ISOL_F_QUIESCE``
> +
> +        This feature allows quiescing select kernel activities on
> +        return from system calls.
> +
> +---------------------
> +Interface description
> +---------------------
> +
> +**PR_ISOL_FEAT**:
> +
> +        Returns the supported features and feature
> +        capabilities, as a bitmask::
> +
> +                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> +
> +        The 'feat' argument specifies whether to return
> +        supported features (if zero), or feature capabilities
> +        (if not zero). Possible non-zero values for 'feat' are:
> +
> +        - ``ISOL_F_QUIESCE``:
> +
> +                Returns a bitmask containing which kernel
> +                activities are supported for quiescing.
> +
> +        Features and its capabilities are defined at
> include/uapi/linux/task_isolation.h.

Preferably have feat a parameter name. We never know if we want
to extend it in the future.

> +
> +**PR_ISOL_GET**:
> +
> +        Retrieve task isolation feature configuration.
> +        The general format is::
> +
> +                prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
> +
> +        The 'feat' argument specifies whether to return
> +        configured features (if zero), or individual feature
> +        configuration (if not zero).

You might need to elaborate a bit on the "feat" behaviour difference.

> +        values for 'feat' are:
> +
> +        - ``ISOL_F_QUIESCE``:
> +
> +                Returns a bitmask containing which kernel
> +                activities are enabled for quiescing.
> +
> +
> +**PR_ISOL_SET**:
> +
> +        Configures task isolation features. The general format is::
> +
> +                prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
> +
> +        The 'feat' argument specifies which feature to configure.
> +        Possible values for feat are:
> +
> +        - ``ISOL_F_QUIESCE``:
> +
> +                The 'arg3' argument is a bitmask specifying which
> +                kernel activities to quiesce. Possible bit sets are:
> +
> +                - ``ISOL_F_QUIESCE_VMSTATS``
> +
> +                  VM statistics are maintained in per-CPU counters to
> +                  improve performance. When a CPU modifies a VM statistic,
> +                  this modification is kept in the per-CPU counter.
> +                  Certain activities require a global count, which
> +                  involves requesting each CPU to flush its local counters
> +                  to the global VM counters.
> +
> +                  This flush is implemented via a workqueue item, which
> +                  might schedule a workqueue on isolated CPUs.
> +
> +                  To avoid this interruption, task isolation can be
> +                  configured to, upon return from system calls, synchronize
> +                  the per-CPU counters to global counters, thus avoiding
> +                  the interruption.
> +
> +                  To ensure the application returns to userspace
> +                  with no modified per-CPU counters, its necessary to
> +                  use mlockall() in addition to this isolcpus flag.

So prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS, ...) will quiesce
on all subsequent return to userspace, right?

> +
> +**PR_ISOL_CTRL_GET**:
> +
> +        Retrieve task isolation control.
> +
> +                prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
> +
> +        Returns which isolation features are active.
> +
> +**PR_ISOL_CTRL_SET**:
> +
> +        Activates/deactivates task isolation control.
> +
> +                prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> +
> +        The 'mask' argument specifies which features
> +        to activate (bit set) or deactivate (bit clear).
> +
> +        For ISOL_F_QUIESCE, quiescing of background activities
> +        happens on return to userspace from the
> +        prctl(PR_ISOL_CTRL_SET) call, and on return from
> +        subsequent system calls.

Now I'm lost again on the difference with PR_ISOL_SET

> +
> +        Quiescing can be adjusted (while active) by
> +        prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ...).
> +
> +**PR_ISOL_GET_INT**:
> +
> +        Retrieves task isolation internal parameters.
> +
> +        The general format is::
> +
> +                prctl(PR_ISOL_GET_INT, cmd, arg3, arg4, arg5);
> +
> +        The 'cmd' argument specifies which parameter to configure.
> +        Possible values for cmd are:
> +
> +        - ``INHERIT_CFG``:
> +
> +                Retrieve inheritance configuration.
> +
> +                The 'arg3' argument is a pointer to a struct
> +                inherit_control::
> +
> +                        struct task_isol_inherit_control {
> +                                __u8    inherit_mask;
> +                                __u8    pad[7];
> +                        };
> +
> +                See PR_ISOL_SET_INT description below for meaning
> +                of structure fields.
> +
> +**PR_ISOL_SET_INT**:
> +
> +        Sets task isolation internal parameters.
> +
> +        The general format is::
> +
> +                prctl(PR_ISOL_SET_INT, cmd, arg3, arg4, arg5);
> +
> +        The 'cmd' argument specifies which parameter to configure.
> +        Possible values for cmd are:
> +
> +        - ``INHERIT_CFG``:
> +
> +                Set inheritance configuration when a new task
> +                is created via fork and clone.
> +
> +                The 'arg3' argument is a pointer to a struct
> +                inherit_control::
> +
> +                        struct task_isol_inherit_control {
> +                                __u8    inherit_mask;
> +                                __u8    pad[7];
> +                        };
> +
> +                inherit_mask is a bitmask that specifies which part
> +                of task isolation to be inherited:
> +
> +                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
> +                  This is the stated written via prctl(PR_ISOL_SET, ...).
> +
> +                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
> +                  (requires ISOL_INHERIT_CONF to be set). The new task
> +                  should behave, right after fork/clone, in the same manner
> +                  as the parent task _after_ it executed:
> +
> +                        prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> +
> +                  with a valid mask.

I'm wondering if those things shouldn't be set on arg4 for PR_ISOL_SET instead?
Arguably having a whole prctl for that makes it easier to extend. But then
PR_ISOL_SET_INT must always be called before PR_ISOL_SET, otherwise we create
noise, right?

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-26  9:59   ` Frederic Weisbecker
@ 2021-08-26 12:11     ` Marcelo Tosatti
  2021-08-26 19:15       ` Christoph Lameter
  2021-08-27 13:08       ` Frederic Weisbecker
  0 siblings, 2 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-26 12:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

Hi Frederic,

On Thu, Aug 26, 2021 at 11:59:58AM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 24, 2021 at 12:24:25PM -0300, Marcelo Tosatti wrote:
> > Add documentation and userspace sample code for prctl
> > task isolation interface.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > ---
> >  Documentation/userspace-api/task_isolation.rst |  211 +++++++++++++++++++++++++
> >  samples/Kconfig                                |    7 
> >  samples/Makefile                               |    1 
> >  samples/task_isolation/Makefile                |    9 +
> >  samples/task_isolation/task_isol.c             |   83 +++++++++
> >  samples/task_isolation/task_isol.h             |    9 +
> >  samples/task_isolation/task_isol_userloop.c    |   56 ++++++
> >  7 files changed, 376 insertions(+)
> > 
> > Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/Documentation/userspace-api/task_isolation.rst
> > @@ -0,0 +1,281 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===============================
> > +Task isolation prctl interface
> > +===============================
> > +
> > +Certain types of applications benefit from running uninterrupted by
> > +background OS activities. Realtime systems and high-bandwidth networking
> > +applications with user-space drivers can fall into the category.
> > +
> > +To create an OS noise free environment for the application, this
> > +interface allows userspace to inform the kernel the start and
> > +end of the latency sensitive application section (with configurable
> > +system behaviour for that section).
> > +
> > +Note: the prctl interface is independent of nohz_full=.
> > +
> > +The prctl options are:
> > +
> > +
> > +        - PR_ISOL_FEAT: Retrieve supported features.
> > +        - PR_ISOL_GET: Retrieve task isolation parameters.
> > +        - PR_ISOL_SET: Set task isolation parameters.
> > +        - PR_ISOL_CTRL_GET: Retrieve task isolation state.
> > +        - PR_ISOL_CTRL_SET: Set task isolation state.
> > +        - PR_ISOL_GET_INT: Retrieve internal parameters.
> > +        - PR_ISOL_SET_INT: Retrieve internal parameters.
> 
> There should be some short summary here to explain the difference
> between parameter, state, and internal parameter. I personally have
> no clue so far.

Yes, those have been written without clear definitions and can be
improved (it makes sense to me, so please indicate what is not 
clear to you). So:

* "Feature": a generic name for a task isolation feature.
Examples of features could be logging, new operating modes (syscalls
disallowed), userspace notifications, etc. One feature is quiescing.

* "Parameter": a specific choice from a given set of possible choices
that dictate how the particular feature in question should act.

* "Internal parameter": A parameter (as in above) but not related to
task isolation features themselves, but to "internal characteristics"
(well, there is only one example of internal parameter so far
and that is inheritance across clone/fork).

Maybe "internal parameter" is a bad name and something different should
be used instead ?

Should i add the description aboves to the document file?

> > +Inheritance of the isolation parameters and state, across
> > +fork(2) and clone(2), can be changed via
> > +PR_ISOL_GET_INT/PR_ISOL_SET_INT.
> > +
> > +At a high-level, task isolation is divided in two steps:
> > +
> > +1. Configuration.
> > +2. Activation.
> > +
> > +Section "Userspace support" describes how to use
> > +task isolation.
> > +
> > +In terms of the interface, the sequence of steps to activate
> > +task isolation are:
> > +
> > +1. Retrieve supported task isolation features (PR_ISOL_FEAT).
> > +2. Configure task isolation features (PR_ISOL_SET/PR_ISOL_GET).
> > +3. Activate or deactivate task isolation features
> > +   (PR_ISOL_CTRL_GET/PR_ISOL_CTRL_SET).
> > +4. Optionally configure inheritance (PR_ISOL_GET_INT/PR_ISOL_SET_INT).
> > +
> > +This interface is based on ideas and code from the
> > +task isolation patchset from Alex Belits:
> > +https://lwn.net/Articles/816298/
> > +
> > +--------------------
> > +Feature description
> > +--------------------
> > +
> > +        - ``ISOL_F_QUIESCE``
> > +
> > +        This feature allows quiescing select kernel activities on
> > +        return from system calls.
> > +
> > +---------------------
> > +Interface description
> > +---------------------
> > +
> > +**PR_ISOL_FEAT**:
> > +
> > +        Returns the supported features and feature
> > +        capabilities, as a bitmask::
> > +
> > +                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> > +
> > +        The 'feat' argument specifies whether to return
> > +        supported features (if zero), or feature capabilities
> > +        (if not zero). Possible non-zero values for 'feat' are:
> > +
> > +        - ``ISOL_F_QUIESCE``:
> > +
> > +                Returns a bitmask containing which kernel
> > +                activities are supported for quiescing.
> > +
> > +        Features and its capabilities are defined at
> > include/uapi/linux/task_isolation.h.
> 
> Preferably have feat a parameter name. We never know if we want
> to extend it in the future.

It is a parameter name:

prctl(PR_ISOL_FEAT, feat-A, arg3, arg4, arg5);

prctl(PR_ISOL_FEAT, feat-B, arg3, arg4, arg5);

And yes, the idea is that new features can be added.

So unless i misunderstood you, there are no changes necessary here.

> > +
> > +**PR_ISOL_GET**:
> > +
> > +        Retrieve task isolation feature configuration.
> > +        The general format is::
> > +
> > +                prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
> > +
> > +        The 'feat' argument specifies whether to return
> > +        configured features (if zero), or individual feature
> > +        configuration (if not zero).
> 
> You might need to elaborate a bit on the "feat" behaviour difference.

Not sure what you mean? There is only one "feat" yet, which is
ISOL_F_QUIESCE:

> > +        values for 'feat' are:
> > +
> > +        - ``ISOL_F_QUIESCE``:
> > +
> > +                Returns a bitmask containing which kernel
> > +                activities are enabled for quiescing.

If one were to add a new feature, he would add:

	     - ``ISOL_F_FEATURE2``:

		     Returns a ... 

Unclear what improvement are you suggesting?

> > +**PR_ISOL_SET**:
> > +
> > +        Configures task isolation features. The general format is::
> > +
> > +                prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
> > +
> > +        The 'feat' argument specifies which feature to configure.
> > +        Possible values for feat are:
> > +
> > +        - ``ISOL_F_QUIESCE``:
> > +
> > +                The 'arg3' argument is a bitmask specifying which
> > +                kernel activities to quiesce. Possible bit sets are:
> > +
> > +                - ``ISOL_F_QUIESCE_VMSTATS``
> > +
> > +                  VM statistics are maintained in per-CPU counters to
> > +                  improve performance. When a CPU modifies a VM statistic,
> > +                  this modification is kept in the per-CPU counter.
> > +                  Certain activities require a global count, which
> > +                  involves requesting each CPU to flush its local counters
> > +                  to the global VM counters.
> > +
> > +                  This flush is implemented via a workqueue item, which
> > +                  might schedule a workqueue on isolated CPUs.
> > +
> > +                  To avoid this interruption, task isolation can be
> > +                  configured to, upon return from system calls, synchronize
> > +                  the per-CPU counters to global counters, thus avoiding
> > +                  the interruption.
> > +
> > +                  To ensure the application returns to userspace
> > +                  with no modified per-CPU counters, its necessary to
> > +                  use mlockall() in addition to this isolcpus flag.
> 
> So prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS, ...) will quiesce
> on all subsequent return to userspace, right?

Yes. Hum, i think i dropped that clarification (by mistake). Will re-add
it.

> > +
> > +**PR_ISOL_CTRL_GET**:
> > +
> > +        Retrieve task isolation control.
> > +
> > +                prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
> > +
> > +        Returns which isolation features are active.
> > +
> > +**PR_ISOL_CTRL_SET**:
> > +
> > +        Activates/deactivates task isolation control.
> > +
> > +                prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> > +
> > +        The 'mask' argument specifies which features
> > +        to activate (bit set) or deactivate (bit clear).
> > +
> > +        For ISOL_F_QUIESCE, quiescing of background activities
> > +        happens on return to userspace from the
> > +        prctl(PR_ISOL_CTRL_SET) call, and on return from
> > +        subsequent system calls.
> 
> Now I'm lost again on the difference with PR_ISOL_SET

PR_ISOL_SET configures the features parameters.

PR_ISOL_CTRL_SET _activates_ task isolation.

You earlier wrote:

"I would rather decouple the above with, for modes:

  PR_TASK_ISOLATION_SET
  PR_TASK_ISOLATION_GET

And for oneshot requests:

  PR_TASK_ISOLATION_REQUEST"

Now we have PR_ISOL_SET/PR_ISOL_GET (to configure the parameters of 
task isolation features), and PR_ISOL_CTRL_SET to activate that
isolation (and we pass a bitmask to PR_ISOL_CTRL_SET indicating what
features should be active). How the particular features behave 
is determined at PR_ISOL_SET time.

This allows the administrator to, via chisol, configure task isolation:

+
+       if (argc - optind < 1) {
+	       warnx(_("bad usage"));
+               errtryhelp(EXIT_FAILURE);
+ 	}
+
+       if (quiesce_act_mask) {
+ 	        ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, quiesce_act_mask, 0, 0);
+	       if (ret == -1) {
+	                perror("prctl PR_ISOL_SET");
+                       exit(EXIT_FAILURE);
+	       }
+       }

And the application, which has to be modified only once with:

+#ifdef PR_ISOL_GET
+       ret = prctl(PR_ISOL_GET, 0, 0, 0, 0);
+	if (ret != -1) {
+               unsigned long mask = ret;
+
+               TEST0(prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0));
+	}
+#endif
+
        frc(&ts2);
        do {
                workload_fn(t->dst_buf, t->src_buf, g.workload_mem_size);

Makes sense?

(BTW, please let me know how the wording would be so it is 
easier to understand... if it is not simple to you, it won't
be simple to others as well).

> > +
> > +        Quiescing can be adjusted (while active) by
> > +        prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ...).
> > +
> > +**PR_ISOL_GET_INT**:
> > +
> > +        Retrieves task isolation internal parameters.
> > +
> > +        The general format is::
> > +
> > +                prctl(PR_ISOL_GET_INT, cmd, arg3, arg4, arg5);
> > +
> > +        The 'cmd' argument specifies which parameter to configure.
> > +        Possible values for cmd are:
> > +
> > +        - ``INHERIT_CFG``:
> > +
> > +                Retrieve inheritance configuration.
> > +
> > +                The 'arg3' argument is a pointer to a struct
> > +                inherit_control::
> > +
> > +                        struct task_isol_inherit_control {
> > +                                __u8    inherit_mask;
> > +                                __u8    pad[7];
> > +                        };
> > +
> > +                See PR_ISOL_SET_INT description below for meaning
> > +                of structure fields.
> > +
> > +**PR_ISOL_SET_INT**:
> > +
> > +        Sets task isolation internal parameters.
> > +
> > +        The general format is::
> > +
> > +                prctl(PR_ISOL_SET_INT, cmd, arg3, arg4, arg5);
> > +
> > +        The 'cmd' argument specifies which parameter to configure.
> > +        Possible values for cmd are:
> > +
> > +        - ``INHERIT_CFG``:
> > +
> > +                Set inheritance configuration when a new task
> > +                is created via fork and clone.
> > +
> > +                The 'arg3' argument is a pointer to a struct
> > +                inherit_control::
> > +
> > +                        struct task_isol_inherit_control {
> > +                                __u8    inherit_mask;
> > +                                __u8    pad[7];
> > +                        };
> > +
> > +                inherit_mask is a bitmask that specifies which part
> > +                of task isolation to be inherited:
> > +
> > +                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
> > +                  This is the stated written via prctl(PR_ISOL_SET, ...).
> > +
> > +                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
> > +                  (requires ISOL_INHERIT_CONF to be set). The new task
> > +                  should behave, right after fork/clone, in the same manner
> > +                  as the parent task _after_ it executed:
> > +
> > +                        prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> > +
> > +                  with a valid mask.
> 
> I'm wondering if those things shouldn't be set on arg4 for PR_ISOL_SET instead?
> Arguably having a whole prctl for that makes it easier to extend. But then
> PR_ISOL_SET_INT must always be called before PR_ISOL_SET, otherwise we create
> noise, right?

It has to be called before PR_ISOL_CTRL_SET, yes.

Decided on a separate prctl because the inheritance control
is not a feature itself: it acts on all features (or how task isolation
features are inherited across fork/clone).

So yes, first idea was to "lets add this to PR_ISOL_SET", but then it
became weird to have something that controls the features as a feature
itself... It would be ISOL_F_INHERIT_CONTROL. Can change to that, if 
you prefer.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-26 12:11     ` Marcelo Tosatti
@ 2021-08-26 19:15       ` Christoph Lameter
  2021-08-26 20:37         ` Marcelo Tosatti
  2021-08-27 13:08       ` Frederic Weisbecker
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2021-08-26 19:15 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Frederic Weisbecker, linux-kernel, Nitesh Lal,
	Nicolas Saenz Julienne, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Thu, 26 Aug 2021, Marcelo Tosatti wrote:

> Decided on a separate prctl because the inheritance control
> is not a feature itself: it acts on all features (or how task isolation
> features are inherited across fork/clone).

I am having a hard time imagening use cases for such a feature since I
usally see special code sections optimized to run without OS jitter and
not whole processes. AFAICT You would not want to have any of these on
because they cause performance regression if you must do syscalls related
to process startup and termination.

Since we are adding docs: Could we have some sample use cases for
when these features are useful?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-26 19:15       ` Christoph Lameter
@ 2021-08-26 20:37         ` Marcelo Tosatti
  0 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-26 20:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, linux-kernel, Nitesh Lal,
	Nicolas Saenz Julienne, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Thu, Aug 26, 2021 at 09:15:57PM +0200, Christoph Lameter wrote:
> On Thu, 26 Aug 2021, Marcelo Tosatti wrote:
> 
> > Decided on a separate prctl because the inheritance control
> > is not a feature itself: it acts on all features (or how task isolation
> > features are inherited across fork/clone).
> 
> I am having a hard time imagening use cases for such a feature since I
> usally see special code sections optimized to run without OS jitter and
> not whole processes. AFAICT You would not want to have any of these on
> because they cause performance regression if you must do syscalls related
> to process startup and termination.

The documentation has:

+==================
+Userspace support
+==================
+
+Task isolation is divided in two main steps: configuration and activation.
+
+Each step can be performed by an external tool or the latency sensitive
+application itself. util-linux contains the "chisol" tool for this
+purpose.
+
+This results in three combinations:
+
+1. Both configuration and activation performed by the
+latency sensitive application.
+Allows fine grained control of what task isolation
+features are enabled and when (see samples section below).
+
+2. Only activation can be performed by the latency sensitive app
+(and configuration performed by chisol).
+This allows the admin/user to control task isolation parameters,
+and applications have to be modified only once.
+
+3. Configuration and activation performed by an external tool.
+This allows unmodified applications to take advantage of
+task isolation. Activation is performed by the "-a" option
+of chisol.

The util-linux patch changelog has:

util-linux: add chisol tool to configure task isolation
  
Add chisol tool to configure task isolation. See chisol -h
for details.

For example, to launch a version of oslat that activates
task isolation:

chisol -q vmstat_sync -I conf ./oslat -f 1 -c 5 -D 5m

-q vmstat_sync: enable quiescing of per-CPU vmstats 
-I conf: inherit task isolation configuration.

====

So you can _configure_ the parameters of task isolation outside 
your application, but activation (just before the realtime loop),
can be done inside it (which requires modification of the 
application). See the oslat/cyclictest patches.

So to answer your question: chisol allows task isolation to be
configured and activated externally from a latency sensitive app 
(which at this moment means every system call, after activation,
will sync the vmstats on return to userspace... which obviously
slow things down). But you can then run unmodified applications
(while paying the cost of slower startup times).

Activation of different features before exec'ing a new application
will of course depend on what the feature does...

> Since we are adding docs: Could we have some sample use cases for
> when these features are useful?

There is one sample at samples/task_isolation/task_isol_userloop.c,
do you mean more samples would be useful ? (there are two i know of:
the one you mentioned and the one Thomas mentioned).


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-26 12:11     ` Marcelo Tosatti
  2021-08-26 19:15       ` Christoph Lameter
@ 2021-08-27 13:08       ` Frederic Weisbecker
  2021-08-27 14:44         ` Marcelo Tosatti
  1 sibling, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2021-08-27 13:08 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Thu, Aug 26, 2021 at 09:11:31AM -0300, Marcelo Tosatti wrote:
> Hi Frederic,
> 
> On Thu, Aug 26, 2021 at 11:59:58AM +0200, Frederic Weisbecker wrote:
> > > +Note: the prctl interface is independent of nohz_full=.
> > > +
> > > +The prctl options are:
> > > +
> > > +
> > > +        - PR_ISOL_FEAT: Retrieve supported features.
> > > +        - PR_ISOL_GET: Retrieve task isolation parameters.
> > > +        - PR_ISOL_SET: Set task isolation parameters.
> > > +        - PR_ISOL_CTRL_GET: Retrieve task isolation state.
> > > +        - PR_ISOL_CTRL_SET: Set task isolation state.
> > > +        - PR_ISOL_GET_INT: Retrieve internal parameters.
> > > +        - PR_ISOL_SET_INT: Retrieve internal parameters.
> > 
> > There should be some short summary here to explain the difference
> > between parameter, state, and internal parameter. I personally have
> > no clue so far.
> 
> Yes, those have been written without clear definitions and can be
> improved (it makes sense to me, so please indicate what is not 
> clear to you). So:
> 
> * "Feature": a generic name for a task isolation feature.
> Examples of features could be logging, new operating modes (syscalls
> disallowed), userspace notifications, etc. One feature is quiescing.
> 
> * "Parameter": a specific choice from a given set of possible choices
> that dictate how the particular feature in question should act.
> 
> * "Internal parameter": A parameter (as in above) but not related to
> task isolation features themselves, but to "internal characteristics"
> (well, there is only one example of internal parameter so far
> and that is inheritance across clone/fork).
> 
> Maybe "internal parameter" is a bad name and something different should
> be used instead ?
> 
> Should i add the description aboves to the document file?

Ok so to make things clearer, may I suggest:

   s/PR_ISOL_FEAT/PR_ISOL_GET_FEAT

to make it more obvious that we are not going to write or configure something.

Also:

  s/PR_ISOL_SET/PR_ISOL_CFG_SET  or  s/PR_ISOL_SET/PR_ISOL_PARAM_SET
  s/PR_ISOL_GET/PR_ISOL_CFG_GET  or  s/PR_ISOL_GET/PR_ISOL_PARAM_GET

because SET or GET alone are too general. I first thought they were the
activation interface whereas they are only the configuration stage.

And then PR_ISOL_CTRL_GET/SET look good. Although perhaps
PR_ISOL_ACTIVATE_SET/GET would probably be clearer. Or even this is where
the trimmed name PR_ISOL_SET/GET would make sense.

> > > +---------------------
> > > +Interface description
> > > +---------------------
> > > +
> > > +**PR_ISOL_FEAT**:
> > > +
> > > +        Returns the supported features and feature
> > > +        capabilities, as a bitmask::
> > > +
> > > +                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> > > +
> > > +        The 'feat' argument specifies whether to return
> > > +        supported features (if zero), or feature capabilities
> > > +        (if not zero). Possible non-zero values for 'feat' are:
> > > +
> > > +        - ``ISOL_F_QUIESCE``:
> > > +
> > > +                Returns a bitmask containing which kernel
> > > +                activities are supported for quiescing.
> > > +
> > > +        Features and its capabilities are defined at
> > > include/uapi/linux/task_isolation.h.
> > 
> > Preferably have feat a parameter name. We never know if we want
> > to extend it in the future.
> 
> It is a parameter name:
> 
> prctl(PR_ISOL_FEAT, feat-A, arg3, arg4, arg5);
> 
> prctl(PR_ISOL_FEAT, feat-B, arg3, arg4, arg5);
> 
> And yes, the idea is that new features can be added.
> 
> So unless i misunderstood you, there are no changes necessary here.

Ok, indeed that was my misunderstanding.

> > > +**PR_ISOL_GET**:
> > > +
> > > +        Retrieve task isolation feature configuration.
> > > +        The general format is::
> > > +
> > > +                prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
> > > +
> > > +        The 'feat' argument specifies whether to return
> > > +        configured features (if zero), or individual feature
> > > +        configuration (if not zero).
> > 
> > You might need to elaborate a bit on the "feat" behaviour difference.
> 
> Not sure what you mean? There is only one "feat" yet, which is
> ISOL_F_QUIESCE:

Sorry my misunderstanding again. So if I understand correctly prctl(PR_ISOL_GET,
0, ...) returns a mask of all features that have been configured through
PR_ISOL_SET(), right?

> > > +**PR_ISOL_SET**:
> > > +
> > > +        Configures task isolation features. The general format is::
> > > +
> > > +                prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
> > > +
> > > +        The 'feat' argument specifies which feature to configure.
> > > +        Possible values for feat are:
> > > +
> > > +        - ``ISOL_F_QUIESCE``:
> > > +
> > > +                The 'arg3' argument is a bitmask specifying which
> > > +                kernel activities to quiesce. Possible bit sets are:
> > > +
> > > +                - ``ISOL_F_QUIESCE_VMSTATS``
> > > +
> > > +                  VM statistics are maintained in per-CPU counters to
> > > +                  improve performance. When a CPU modifies a VM statistic,
> > > +                  this modification is kept in the per-CPU counter.
> > > +                  Certain activities require a global count, which
> > > +                  involves requesting each CPU to flush its local counters
> > > +                  to the global VM counters.
> > > +
> > > +                  This flush is implemented via a workqueue item, which
> > > +                  might schedule a workqueue on isolated CPUs.
> > > +
> > > +                  To avoid this interruption, task isolation can be
> > > +                  configured to, upon return from system calls, synchronize
> > > +                  the per-CPU counters to global counters, thus avoiding
> > > +                  the interruption.
> > > +
> > > +                  To ensure the application returns to userspace
> > > +                  with no modified per-CPU counters, its necessary to
> > > +                  use mlockall() in addition to this isolcpus flag.
> > 
> > So prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS, ...) will quiesce
> > on all subsequent return to userspace, right?
> 
> Yes. Hum, i think i dropped that clarification (by mistake). Will re-add
> it.

So how are we going to implement oneshot quiescing? As in quiescing only upon
the return of a given prctl().

Maybe using a feature something like ISOL_F_QUIESCE_ONCE?

But then suppose I do this:

prctl(PR_ISOL_SET, ISOL_F_QUIESCE_ONCE, ISOL_F_QUIESCE_VMSTATS, ...)
prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE_ONCE, ...) //will quiesce on this return only
prctl(PR_ISOL_CTRL_GET,  ...)

What should PR_ISOL_CTRL_GET return above? Probably nothing.

> 
> > > +
> > > +**PR_ISOL_CTRL_GET**:
> > > +
> > > +        Retrieve task isolation control.
> > > +
> > > +                prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
> > > +
> > > +        Returns which isolation features are active.
> > > +
> > > +**PR_ISOL_CTRL_SET**:
> > > +
> > > +        Activates/deactivates task isolation control.
> > > +
> > > +                prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> > > +
> > > +        The 'mask' argument specifies which features
> > > +        to activate (bit set) or deactivate (bit clear).
> > > +
> > > +        For ISOL_F_QUIESCE, quiescing of background activities
> > > +        happens on return to userspace from the
> > > +        prctl(PR_ISOL_CTRL_SET) call, and on return from
> > > +        subsequent system calls.
> > 
> > Now I'm lost again on the difference with PR_ISOL_SET
> 
> PR_ISOL_SET configures the features parameters.
> 
> PR_ISOL_CTRL_SET _activates_ task isolation.
> 
> You earlier wrote:
> 
> "I would rather decouple the above with, for modes:
> 
>   PR_TASK_ISOLATION_SET
>   PR_TASK_ISOLATION_GET
> 
> And for oneshot requests:
> 
>   PR_TASK_ISOLATION_REQUEST"
> 
> Now we have PR_ISOL_SET/PR_ISOL_GET (to configure the parameters of 
> task isolation features), and PR_ISOL_CTRL_SET to activate that
> isolation (and we pass a bitmask to PR_ISOL_CTRL_SET indicating what
> features should be active). How the particular features behave 
> is determined at PR_ISOL_SET time.

I guess that makes sense. This way we can quiesce everything in one go
instead of issuing a prctl() for each features, which adds further noise.
Sounds proper.

> 
> This allows the administrator to, via chisol, configure task isolation:
> 
> +
> +       if (argc - optind < 1) {
> +	       warnx(_("bad usage"));
> +               errtryhelp(EXIT_FAILURE);
> + 	}
> +
> +       if (quiesce_act_mask) {
> + 	        ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, quiesce_act_mask, 0, 0);
> +	       if (ret == -1) {
> +	                perror("prctl PR_ISOL_SET");
> +                       exit(EXIT_FAILURE);
> +	       }
> +       }
> 
> And the application, which has to be modified only once with:
> 
> +#ifdef PR_ISOL_GET
> +       ret = prctl(PR_ISOL_GET, 0, 0, 0, 0);
> +	if (ret != -1) {
> +               unsigned long mask = ret;
> +
> +               TEST0(prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0));
> +	}
> +#endif
> +
>         frc(&ts2);
>         do {
>                 workload_fn(t->dst_buf, t->src_buf, g.workload_mem_size);
> 
> Makes sense?

Yes! Btw you might want to fetch the mask of PR_ISOL_GET into the
second parameter instead of using the return value which is only
32 bits or prctl() and the highest significant bit is even reserved
for the error.

> > > +**PR_ISOL_GET_INT**:
> > > +
> > > +        Retrieves task isolation internal parameters.
> > > +
> > > +        The general format is::
> > > +
> > > +                prctl(PR_ISOL_GET_INT, cmd, arg3, arg4, arg5);
> > > +
> > > +        The 'cmd' argument specifies which parameter to configure.
> > > +        Possible values for cmd are:
> > > +
> > > +        - ``INHERIT_CFG``:
> > > +
> > > +                Retrieve inheritance configuration.
> > > +
> > > +                The 'arg3' argument is a pointer to a struct
> > > +                inherit_control::
> > > +
> > > +                        struct task_isol_inherit_control {
> > > +                                __u8    inherit_mask;
> > > +                                __u8    pad[7];
> > > +                        };
> > > +
> > > +                See PR_ISOL_SET_INT description below for meaning
> > > +                of structure fields.
> > > +
> > > +**PR_ISOL_SET_INT**:
> > > +
> > > +        Sets task isolation internal parameters.
> > > +
> > > +        The general format is::
> > > +
> > > +                prctl(PR_ISOL_SET_INT, cmd, arg3, arg4, arg5);
> > > +
> > > +        The 'cmd' argument specifies which parameter to configure.
> > > +        Possible values for cmd are:
> > > +
> > > +        - ``INHERIT_CFG``:
> > > +
> > > +                Set inheritance configuration when a new task
> > > +                is created via fork and clone.
> > > +
> > > +                The 'arg3' argument is a pointer to a struct
> > > +                inherit_control::
> > > +
> > > +                        struct task_isol_inherit_control {
> > > +                                __u8    inherit_mask;
> > > +                                __u8    pad[7];
> > > +                        };
> > > +
> > > +                inherit_mask is a bitmask that specifies which part
> > > +                of task isolation to be inherited:
> > > +
> > > +                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
> > > +                  This is the stated written via prctl(PR_ISOL_SET, ...).
> > > +
> > > +                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
> > > +                  (requires ISOL_INHERIT_CONF to be set). The new task
> > > +                  should behave, right after fork/clone, in the same manner
> > > +                  as the parent task _after_ it executed:
> > > +
> > > +                        prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> > > +
> > > +                  with a valid mask.
> > 
> > I'm wondering if those things shouldn't be set on arg4 for PR_ISOL_SET instead?
> > Arguably having a whole prctl for that makes it easier to extend. But then
> > PR_ISOL_SET_INT must always be called before PR_ISOL_SET, otherwise we create
> > noise, right?
> 
> It has to be called before PR_ISOL_CTRL_SET, yes.
> 
> Decided on a separate prctl because the inheritance control
> is not a feature itself: it acts on all features (or how task isolation
> features are inherited across fork/clone).
> 
> So yes, first idea was to "lets add this to PR_ISOL_SET", but then it
> became weird to have something that controls the features as a feature
> itself... It would be ISOL_F_INHERIT_CONTROL. Can change to that, if 
> you prefer.

Funny but that would work. Either way, let's keep things that way for now.
Just the naming is unfortunate.

Well that could be a clone flag after all... But what about exec()? Should we
make its inheritance tunable? Well we can still extend the interface later if
necessary for that.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-27 13:08       ` Frederic Weisbecker
@ 2021-08-27 14:44         ` Marcelo Tosatti
  2021-08-30 11:38           ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2021-08-27 14:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Fri, Aug 27, 2021 at 03:08:20PM +0200, Frederic Weisbecker wrote:
> On Thu, Aug 26, 2021 at 09:11:31AM -0300, Marcelo Tosatti wrote:
> > Hi Frederic,
> > 
> > On Thu, Aug 26, 2021 at 11:59:58AM +0200, Frederic Weisbecker wrote:
> > > > +Note: the prctl interface is independent of nohz_full=.
> > > > +
> > > > +The prctl options are:
> > > > +
> > > > +
> > > > +        - PR_ISOL_FEAT: Retrieve supported features.
> > > > +        - PR_ISOL_GET: Retrieve task isolation parameters.
> > > > +        - PR_ISOL_SET: Set task isolation parameters.
> > > > +        - PR_ISOL_CTRL_GET: Retrieve task isolation state.
> > > > +        - PR_ISOL_CTRL_SET: Set task isolation state.
> > > > +        - PR_ISOL_GET_INT: Retrieve internal parameters.
> > > > +        - PR_ISOL_SET_INT: Retrieve internal parameters.
> > > 
> > > There should be some short summary here to explain the difference
> > > between parameter, state, and internal parameter. I personally have
> > > no clue so far.
> > 
> > Yes, those have been written without clear definitions and can be
> > improved (it makes sense to me, so please indicate what is not 
> > clear to you). So:
> > 
> > * "Feature": a generic name for a task isolation feature.
> > Examples of features could be logging, new operating modes (syscalls
> > disallowed), userspace notifications, etc. One feature is quiescing.
> > 
> > * "Parameter": a specific choice from a given set of possible choices
> > that dictate how the particular feature in question should act.
> > 
> > * "Internal parameter": A parameter (as in above) but not related to
> > task isolation features themselves, but to "internal characteristics"
> > (well, there is only one example of internal parameter so far
> > and that is inheritance across clone/fork).
> > 
> > Maybe "internal parameter" is a bad name and something different should
> > be used instead ?
> > 
> > Should i add the description aboves to the document file?
> 
> Ok so to make things clearer, may I suggest:
> 
>    s/PR_ISOL_FEAT/PR_ISOL_GET_FEAT
> 
> to make it more obvious that we are not going to write or configure something.
> 
> Also:
> 
>   s/PR_ISOL_SET/PR_ISOL_CFG_SET  or  s/PR_ISOL_SET/PR_ISOL_PARAM_SET
>   s/PR_ISOL_GET/PR_ISOL_CFG_GET  or  s/PR_ISOL_GET/PR_ISOL_PARAM_GET
> 
> because SET or GET alone are too general. I first thought they were the
> activation interface whereas they are only the configuration stage.

Sounds good.

> And then PR_ISOL_CTRL_GET/SET look good. Although perhaps
> PR_ISOL_ACTIVATE_SET/GET would probably be clearer. Or even this is where
> the trimmed name PR_ISOL_SET/GET would make sense.

PR_ISOL_ACTIVATE_SET/GET indeed is quite clear. Will change.

> > > > +---------------------
> > > > +Interface description
> > > > +---------------------
> > > > +
> > > > +**PR_ISOL_FEAT**:
> > > > +
> > > > +        Returns the supported features and feature
> > > > +        capabilities, as a bitmask::
> > > > +
> > > > +                prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> > > > +
> > > > +        The 'feat' argument specifies whether to return
> > > > +        supported features (if zero), or feature capabilities
> > > > +        (if not zero). Possible non-zero values for 'feat' are:
> > > > +
> > > > +        - ``ISOL_F_QUIESCE``:
> > > > +
> > > > +                Returns a bitmask containing which kernel
> > > > +                activities are supported for quiescing.
> > > > +
> > > > +        Features and its capabilities are defined at
> > > > include/uapi/linux/task_isolation.h.
> > > 
> > > Preferably have feat a parameter name. We never know if we want
> > > to extend it in the future.
> > 
> > It is a parameter name:
> > 
> > prctl(PR_ISOL_FEAT, feat-A, arg3, arg4, arg5);
> > 
> > prctl(PR_ISOL_FEAT, feat-B, arg3, arg4, arg5);
> > 
> > And yes, the idea is that new features can be added.
> > 
> > So unless i misunderstood you, there are no changes necessary here.
> 
> Ok, indeed that was my misunderstanding.
> 
> > > > +**PR_ISOL_GET**:
> > > > +
> > > > +        Retrieve task isolation feature configuration.
> > > > +        The general format is::
> > > > +
> > > > +                prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
> > > > +
> > > > +        The 'feat' argument specifies whether to return
> > > > +        configured features (if zero), or individual feature
> > > > +        configuration (if not zero).
> > > 
> > > You might need to elaborate a bit on the "feat" behaviour difference.
> > 
> > Not sure what you mean? There is only one "feat" yet, which is
> > ISOL_F_QUIESCE:
> 
> Sorry my misunderstanding again. So if I understand correctly prctl(PR_ISOL_GET,
> 0, ...) returns a mask of all features that have been configured through
> PR_ISOL_SET(), right?

Yes.

> > > > +**PR_ISOL_SET**:
> > > > +
> > > > +        Configures task isolation features. The general format is::
> > > > +
> > > > +                prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
> > > > +
> > > > +        The 'feat' argument specifies which feature to configure.
> > > > +        Possible values for feat are:
> > > > +
> > > > +        - ``ISOL_F_QUIESCE``:
> > > > +
> > > > +                The 'arg3' argument is a bitmask specifying which
> > > > +                kernel activities to quiesce. Possible bit sets are:
> > > > +
> > > > +                - ``ISOL_F_QUIESCE_VMSTATS``
> > > > +
> > > > +                  VM statistics are maintained in per-CPU counters to
> > > > +                  improve performance. When a CPU modifies a VM statistic,
> > > > +                  this modification is kept in the per-CPU counter.
> > > > +                  Certain activities require a global count, which
> > > > +                  involves requesting each CPU to flush its local counters
> > > > +                  to the global VM counters.
> > > > +
> > > > +                  This flush is implemented via a workqueue item, which
> > > > +                  might schedule a workqueue on isolated CPUs.
> > > > +
> > > > +                  To avoid this interruption, task isolation can be
> > > > +                  configured to, upon return from system calls, synchronize
> > > > +                  the per-CPU counters to global counters, thus avoiding
> > > > +                  the interruption.
> > > > +
> > > > +                  To ensure the application returns to userspace
> > > > +                  with no modified per-CPU counters, its necessary to
> > > > +                  use mlockall() in addition to this isolcpus flag.
> > > 
> > > So prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS, ...) will quiesce
> > > on all subsequent return to userspace, right?
> > 
> > Yes. Hum, i think i dropped that clarification (by mistake). Will re-add
> > it.
> 
> So how are we going to implement oneshot quiescing? As in quiescing only upon
> the return of a given prctl().
> 
> Maybe using a feature something like ISOL_F_QUIESCE_ONCE?
> 
> But then suppose I do this:
> 
> prctl(PR_ISOL_SET, ISOL_F_QUIESCE_ONCE, ISOL_F_QUIESCE_VMSTATS, ...)
> prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE_ONCE, ...) //will quiesce on this return only
> prctl(PR_ISOL_CTRL_GET,  ...)
> 
> What should PR_ISOL_CTRL_GET return above? Probably nothing.

Yeah, nothing.

So the "quiesce once" feature, as i understand, was suggested by
Christoph for the following type of application:

lat_loop:

	do {
		events = pending_events();
		if (events & DATAPATH_EVENT)
			process_data()
	} while (!(events & UNFREQUENT_ERROR_EVENT))

	syscall1()
	syscall2()
	...
	syscallN()
	goto lat_loop;

With the V3 patchset, one would have to:

	prctl(PR_ISOL_SET, ISOL_F_OTHER, ...);
	prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS, ...);
	...
	prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE|ISOL_F_OTHER);
lat_loop:
	do {
		events = pending_events();
		if (events & DATAPATH_EVENT)
			process_data()
	} while (!(events & UNFREQUENT_ERROR_EVENT))

	/* disables quiescing while executing system calls */
	prctl(PR_ISOL_CTRL_SET, ISOL_F_OTHER);
	syscall1()
	syscall2()
	...
	syscallN()

	/* no more system calls, enables quiescing */
	prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE|ISOL_F_OTHER);
	goto lat_loop;

But for an interface with less system calls (true "quiesce once") one could do:

	prctl(PR_ISOL_SET, ISOL_F_OTHER, ...);
	/* rather than do it at _CTRL_SET as you suggest, enable it at
	 * configuration time.
	 */
	prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS|ISOL_F_QUIESCE_ONCE, ...);
	...

lat_loop:
	prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE|ISOL_F_OTHER);
	do {
		events = pending_events();
		if (events & DATAPATH_EVENT)
			process_data()
	} while (!(events & UNFREQUENT_ERROR_EVENT))

	/* disables quiescing while executing system calls */
	syscall1()
	syscall2()
	...
	syscallN()

	goto lat_loop;

But see how it starts to get weird: both versions (new feature, 
ISOL_F_QUIESCE_ONCE, or new "quiesce feature', ISOL_F_QUIESCE_ONCE)
are using space reserved to

"a list of different features" 
or
"a list of different quiesce features".

To add something which is not either a new task isolation 
feature or quiesce feature, but a separate control
(which could apply to all of features, or which one might want
to apply only to certain features, and in that case a bitmap
might be specified).

So i think adding a new parameter such as:

prctl(PR_ISOL_SET, ISOL_F_QUIESCE, CMD, arg, ...);

is a good idea. So one can have (with two commands, SET_QUIESCE 
and SET_ONESHOT).

prctl(PR_ISOL_SET, ISOL_F_QUIESCE, SET_QUIESCE, ISOL_F_QUIESCE_VMSTAT);
prctl(PR_ISOL_SET, ISOL_F_QUIESCE, SET_ONESHOT, ISOL_F_QUIESCE_VMSTAT);

Then its possible to add random commands with random parameters
(rather than be limited by a single bitmask to control quiescing).

Does that make sense?

> > > > +
> > > > +**PR_ISOL_CTRL_GET**:
> > > > +
> > > > +        Retrieve task isolation control.
> > > > +
> > > > +                prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
> > > > +
> > > > +        Returns which isolation features are active.
> > > > +
> > > > +**PR_ISOL_CTRL_SET**:
> > > > +
> > > > +        Activates/deactivates task isolation control.
> > > > +
> > > > +                prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> > > > +
> > > > +        The 'mask' argument specifies which features
> > > > +        to activate (bit set) or deactivate (bit clear).
> > > > +
> > > > +        For ISOL_F_QUIESCE, quiescing of background activities
> > > > +        happens on return to userspace from the
> > > > +        prctl(PR_ISOL_CTRL_SET) call, and on return from
> > > > +        subsequent system calls.
> > > 
> > > Now I'm lost again on the difference with PR_ISOL_SET
> > 
> > PR_ISOL_SET configures the features parameters.
> > 
> > PR_ISOL_CTRL_SET _activates_ task isolation.
> > 
> > You earlier wrote:
> > 
> > "I would rather decouple the above with, for modes:
> > 
> >   PR_TASK_ISOLATION_SET
> >   PR_TASK_ISOLATION_GET
> > 
> > And for oneshot requests:
> > 
> >   PR_TASK_ISOLATION_REQUEST"
> > 
> > Now we have PR_ISOL_SET/PR_ISOL_GET (to configure the parameters of 
> > task isolation features), and PR_ISOL_CTRL_SET to activate that
> > isolation (and we pass a bitmask to PR_ISOL_CTRL_SET indicating what
> > features should be active). How the particular features behave 
> > is determined at PR_ISOL_SET time.
> 
> I guess that makes sense. This way we can quiesce everything in one go
> instead of issuing a prctl() for each features, which adds further noise.
> Sounds proper.
> 
> > 
> > This allows the administrator to, via chisol, configure task isolation:
> > 
> > +
> > +       if (argc - optind < 1) {
> > +	       warnx(_("bad usage"));
> > +               errtryhelp(EXIT_FAILURE);
> > + 	}
> > +
> > +       if (quiesce_act_mask) {
> > + 	        ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, quiesce_act_mask, 0, 0);
> > +	       if (ret == -1) {
> > +	                perror("prctl PR_ISOL_SET");
> > +                       exit(EXIT_FAILURE);
> > +	       }
> > +       }
> > 
> > And the application, which has to be modified only once with:
> > 
> > +#ifdef PR_ISOL_GET
> > +       ret = prctl(PR_ISOL_GET, 0, 0, 0, 0);
> > +	if (ret != -1) {
> > +               unsigned long mask = ret;
> > +
> > +               TEST0(prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0));
> > +	}
> > +#endif
> > +
> >         frc(&ts2);
> >         do {
> >                 workload_fn(t->dst_buf, t->src_buf, g.workload_mem_size);
> > 
> > Makes sense?
> 
> Yes! Btw you might want to fetch the mask of PR_ISOL_GET into the
> second parameter instead of using the return value which is only
> 32 bits or prctl() and the highest significant bit is even reserved
> for the error.

Would be good to do this for all cases, so you can extend the
struct (or pad it).
> 
> > > > +**PR_ISOL_GET_INT**:
> > > > +
> > > > +        Retrieves task isolation internal parameters.
> > > > +
> > > > +        The general format is::
> > > > +
> > > > +                prctl(PR_ISOL_GET_INT, cmd, arg3, arg4, arg5);
> > > > +
> > > > +        The 'cmd' argument specifies which parameter to configure.
> > > > +        Possible values for cmd are:
> > > > +
> > > > +        - ``INHERIT_CFG``:
> > > > +
> > > > +                Retrieve inheritance configuration.
> > > > +
> > > > +                The 'arg3' argument is a pointer to a struct
> > > > +                inherit_control::
> > > > +
> > > > +                        struct task_isol_inherit_control {
> > > > +                                __u8    inherit_mask;
> > > > +                                __u8    pad[7];
> > > > +                        };
> > > > +
> > > > +                See PR_ISOL_SET_INT description below for meaning
> > > > +                of structure fields.
> > > > +
> > > > +**PR_ISOL_SET_INT**:
> > > > +
> > > > +        Sets task isolation internal parameters.
> > > > +
> > > > +        The general format is::
> > > > +
> > > > +                prctl(PR_ISOL_SET_INT, cmd, arg3, arg4, arg5);
> > > > +
> > > > +        The 'cmd' argument specifies which parameter to configure.
> > > > +        Possible values for cmd are:
> > > > +
> > > > +        - ``INHERIT_CFG``:
> > > > +
> > > > +                Set inheritance configuration when a new task
> > > > +                is created via fork and clone.
> > > > +
> > > > +                The 'arg3' argument is a pointer to a struct
> > > > +                inherit_control::
> > > > +
> > > > +                        struct task_isol_inherit_control {
> > > > +                                __u8    inherit_mask;
> > > > +                                __u8    pad[7];
> > > > +                        };
> > > > +
> > > > +                inherit_mask is a bitmask that specifies which part
> > > > +                of task isolation to be inherited:
> > > > +
> > > > +                - Bit ISOL_INHERIT_CONF: Inherit task isolation configuration.
> > > > +                  This is the stated written via prctl(PR_ISOL_SET, ...).
> > > > +
> > > > +                - Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
> > > > +                  (requires ISOL_INHERIT_CONF to be set). The new task
> > > > +                  should behave, right after fork/clone, in the same manner
> > > > +                  as the parent task _after_ it executed:
> > > > +
> > > > +                        prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> > > > +
> > > > +                  with a valid mask.
> > > 
> > > I'm wondering if those things shouldn't be set on arg4 for PR_ISOL_SET instead?
> > > Arguably having a whole prctl for that makes it easier to extend. But then
> > > PR_ISOL_SET_INT must always be called before PR_ISOL_SET, otherwise we create
> > > noise, right?
> > 
> > It has to be called before PR_ISOL_CTRL_SET, yes.
> > 
> > Decided on a separate prctl because the inheritance control
> > is not a feature itself: it acts on all features (or how task isolation
> > features are inherited across fork/clone).
> > 
> > So yes, first idea was to "lets add this to PR_ISOL_SET", but then it
> > became weird to have something that controls the features as a feature
> > itself... It would be ISOL_F_INHERIT_CONTROL. Can change to that, if 
> > you prefer.
> 
> Funny but that would work. Either way, let's keep things that way for now.
> Just the naming is unfortunate.
> 
> Well that could be a clone flag after all... 

Yes, it could as well. But there are no more bits for older clone
interfaces, and clone3 seems to be problematic (moreover, this
interface would need backporting to older kernels).

> But what about exec()? Should we
> make its inheritance tunable? Well we can still extend the interface later if
> necessary for that.

Hum, not sure... Yes, can extend it later.

Thanks!


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-27 14:44         ` Marcelo Tosatti
@ 2021-08-30 11:38           ` Frederic Weisbecker
  0 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2021-08-30 11:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nitesh Lal, Nicolas Saenz Julienne,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Fri, Aug 27, 2021 at 11:44:16AM -0300, Marcelo Tosatti wrote:
> On Fri, Aug 27, 2021 at 03:08:20PM +0200, Frederic Weisbecker wrote:
> > Ok so to make things clearer, may I suggest:
> > 
> >    s/PR_ISOL_FEAT/PR_ISOL_GET_FEAT

<nit>
In fact PR_ISOL_FEAT_GET so that it follows the same naming convention
than PR_ISOL_CFG_GET/PR_ISOL_PARAM_GET and PR_ISOL_ACTIVATE_GET


> > But then suppose I do this:
> > 
> > prctl(PR_ISOL_SET, ISOL_F_QUIESCE_ONCE, ISOL_F_QUIESCE_VMSTATS, ...)
> > prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE_ONCE, ...) //will quiesce on this return only
> > prctl(PR_ISOL_CTRL_GET,  ...)
> > 
> > What should PR_ISOL_CTRL_GET return above? Probably nothing.
> 
> Yeah, nothing.
> 
> So the "quiesce once" feature, as i understand, was suggested by
> Christoph for the following type of application:
> 
> lat_loop:
> 
> 	do {
> 		events = pending_events();
> 		if (events & DATAPATH_EVENT)
> 			process_data()
> 	} while (!(events & UNFREQUENT_ERROR_EVENT))
> 
> 	syscall1()
> 	syscall2()
> 	...
> 	syscallN()
> 	goto lat_loop;
> 
> With the V3 patchset, one would have to:
> 
> 	prctl(PR_ISOL_SET, ISOL_F_OTHER, ...);
> 	prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS, ...);
> 	...
> 	prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE|ISOL_F_OTHER);
> lat_loop:
> 	do {
> 		events = pending_events();
> 		if (events & DATAPATH_EVENT)
> 			process_data()
> 	} while (!(events & UNFREQUENT_ERROR_EVENT))
> 
> 	/* disables quiescing while executing system calls */
> 	prctl(PR_ISOL_CTRL_SET, ISOL_F_OTHER);
> 	syscall1()
> 	syscall2()
> 	...
> 	syscallN()
> 
> 	/* no more system calls, enables quiescing */
> 	prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE|ISOL_F_OTHER);
> 	goto lat_loop;
> 
> But for an interface with less system calls (true "quiesce once") one could do:
> 
> 	prctl(PR_ISOL_SET, ISOL_F_OTHER, ...);
> 	/* rather than do it at _CTRL_SET as you suggest, enable it at
> 	 * configuration time.
> 	 */
> 	prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ISOL_F_QUIESCE_VMSTATS|ISOL_F_QUIESCE_ONCE, ...);
> 	...
> 
> lat_loop:
> 	prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE|ISOL_F_OTHER);
> 	do {
> 		events = pending_events();
> 		if (events & DATAPATH_EVENT)
> 			process_data()
> 	} while (!(events & UNFREQUENT_ERROR_EVENT))
> 
> 	/* disables quiescing while executing system calls */
> 	syscall1()
> 	syscall2()
> 	...
> 	syscallN()
> 
> 	goto lat_loop;
> 
> But see how it starts to get weird: both versions (new feature, 
> ISOL_F_QUIESCE_ONCE, or new "quiesce feature', ISOL_F_QUIESCE_ONCE)
> are using space reserved to
> 
> "a list of different features" 
> or
> "a list of different quiesce features".
> 
> To add something which is not either a new task isolation 
> feature or quiesce feature, but a separate control
> (which could apply to all of features, or which one might want
> to apply only to certain features, and in that case a bitmap
> might be specified).
> 
> So i think adding a new parameter such as:
> 
> prctl(PR_ISOL_SET, ISOL_F_QUIESCE, CMD, arg, ...);
> 
> is a good idea. So one can have (with two commands, SET_QUIESCE 
> and SET_ONESHOT).
> 
> prctl(PR_ISOL_SET, ISOL_F_QUIESCE, SET_QUIESCE, ISOL_F_QUIESCE_VMSTAT);
> prctl(PR_ISOL_SET, ISOL_F_QUIESCE, SET_ONESHOT, ISOL_F_QUIESCE_VMSTAT);
> 
> Then its possible to add random commands with random parameters
> (rather than be limited by a single bitmask to control quiescing).
> 
> Does that make sense?

We can but it means that the ONESHOT property applies to all ISOL_F_QUIESCE
features. So you can't, for example, quiesce ISOL_F_QUIESCE_VMSTAT only once
and quiesce ISOL_F_QUIESCE_FOO all the time.

I have no idea if it matters or not but be aware of limitations.

> > > +#ifdef PR_ISOL_GET
> > > +       ret = prctl(PR_ISOL_GET, 0, 0, 0, 0);
> > > +	if (ret != -1) {
> > > +               unsigned long mask = ret;
> > > +
> > > +               TEST0(prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0));
> > > +	}
> > > +#endif
> > > +
> > >         frc(&ts2);
> > >         do {
> > >                 workload_fn(t->dst_buf, t->src_buf, g.workload_mem_size);
> > > 
> > > Makes sense?
> > 
> > Yes! Btw you might want to fetch the mask of PR_ISOL_GET into the
> > second parameter instead of using the return value which is only
> > 32 bits or prctl() and the highest significant bit is even reserved
> > for the error.
> 
> Would be good to do this for all cases, so you can extend the
> struct (or pad it).

Yep.

> > Funny but that would work. Either way, let's keep things that way for now.
> > Just the naming is unfortunate.
> > 
> > Well that could be a clone flag after all... 
> 
> Yes, it could as well. But there are no more bits for older clone
> interfaces, and clone3 seems to be problematic (moreover, this
> interface would need backporting to older kernels).

Another way to go is to use all the features as a mask in PR_ISOL_CFG_SET:

prctl(PR_ISOL_FEAT_GET, 0, &all_features, ...)
prctl(PR_ISOL_CFG_SET, &all_features, PR_ISOL_INHERIT...)

or simply:

prctl(PR_ISOL_CFG_SET, -1, PR_ISOL_INHERIT...)

or even:

prctl(PR_ISOL_CFG_SET, 0, PR_ISOL_INHERIT...)

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-08-24 15:24 ` [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  2021-08-25  9:30   ` Christoph Lameter
@ 2021-09-01 13:05   ` Nitesh Lal
  2021-09-01 17:32     ` Marcelo Tosatti
  1 sibling, 1 reply; 27+ messages in thread
From: Nitesh Lal @ 2021-09-01 13:05 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

Hi Marcelo,

On Tue, Aug 24, 2021 at 11:42 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> It is not necessary to queue work item to run refresh_vm_stats
> on a remote CPU if that CPU has no dirty stats and no per-CPU
> allocations for remote nodes.
>
> This fixes sosreport hang (which uses vmstat_refresh) with
> spinning SCHED_FIFO process.
>

I was still able to reproduce the sosreport hang with this patchset and I
am wondering if that is because right now we do vmstat_sync and then cancel
any pending jobs on a CPU in the context of one task.

However, while this task is running another process can come in and can
dirty the stats resulting in vmstat job getting placed on CPUs running
SCHED_FIFO tasks.
Am I missing something?

What we can probably do is to communicate that a CPU is running on task
isolation mode to any other process that is trying to run and schedule
jobs there.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-08-24 15:24 ` [patch V3 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
  2021-08-26  9:59   ` Frederic Weisbecker
@ 2021-09-01 13:11   ` Nitesh Lal
  2021-09-01 17:34     ` Marcelo Tosatti
  1 sibling, 1 reply; 27+ messages in thread
From: Nitesh Lal @ 2021-09-01 13:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Tue, Aug 24, 2021 at 11:42 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> Add documentation and userspace sample code for prctl
> task isolation interface.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>
> ---
>  Documentation/userspace-api/task_isolation.rst |  211 +++++++++++++++++++++++++
>  samples/Kconfig                                |    7
>  samples/Makefile                               |    1
>  samples/task_isolation/Makefile                |    9 +
>  samples/task_isolation/task_isol.c             |   83 +++++++++
>  samples/task_isolation/task_isol.h             |    9 +
>  samples/task_isolation/task_isol_userloop.c    |   56 ++++++
>  7 files changed, 376 insertions(+)

[...]

> +       if (ret) {
> +               perror("mlock");
> +               return EXIT_FAILURE;
> +       }
> +
> +       ret = task_isol_setup();
> +       if (ret)
> +               return EXIT_FAILURE;

The above check condition should be 'ret == -1', isn't it?

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-09-01 13:05   ` Nitesh Lal
@ 2021-09-01 17:32     ` Marcelo Tosatti
  2021-09-01 18:33       ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2021-09-01 17:32 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Wed, Sep 01, 2021 at 09:05:55AM -0400, Nitesh Lal wrote:
> Hi Marcelo,
> 
> On Tue, Aug 24, 2021 at 11:42 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > It is not necessary to queue work item to run refresh_vm_stats
> > on a remote CPU if that CPU has no dirty stats and no per-CPU
> > allocations for remote nodes.
> >
> > This fixes sosreport hang (which uses vmstat_refresh) with
> > spinning SCHED_FIFO process.
> >
> 
> I was still able to reproduce the sosreport hang with this patchset and I
> am wondering if that is because right now we do vmstat_sync and then cancel
> any pending jobs on a CPU in the context of one task.

Hi Nitesh,

Did you use chisol (with proper flags) and the modified oslat?

Tested with "echo 1 > /proc/sys/vmstat_refresh" and it was successful
(no hangs).

> However, while this task is running another process can come in and can
> dirty the stats resulting in vmstat job getting placed on CPUs running
> SCHED_FIFO tasks.
> Am I missing something?
> What we can probably do is to communicate that a CPU is running on task
> isolation mode to any other process that is trying to run and schedule
> jobs there.

No, that can happen. Can use sched notifiers to handle this problem.
Good point.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-09-01 13:11   ` Nitesh Lal
@ 2021-09-01 17:34     ` Marcelo Tosatti
  2021-09-01 17:49       ` Nitesh Lal
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2021-09-01 17:34 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Wed, Sep 01, 2021 at 09:11:56AM -0400, Nitesh Lal wrote:
> On Tue, Aug 24, 2021 at 11:42 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > Add documentation and userspace sample code for prctl
> > task isolation interface.
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >
> > ---
> >  Documentation/userspace-api/task_isolation.rst |  211 +++++++++++++++++++++++++
> >  samples/Kconfig                                |    7
> >  samples/Makefile                               |    1
> >  samples/task_isolation/Makefile                |    9 +
> >  samples/task_isolation/task_isol.c             |   83 +++++++++
> >  samples/task_isolation/task_isol.h             |    9 +
> >  samples/task_isolation/task_isol_userloop.c    |   56 ++++++
> >  7 files changed, 376 insertions(+)
> 
> [...]
> 
> > +       if (ret) {
> > +               perror("mlock");
> > +               return EXIT_FAILURE;
> > +       }
> > +
> > +       ret = task_isol_setup();
> > +       if (ret)
> > +               return EXIT_FAILURE;
> 
> The above check condition should be 'ret == -1', isn't it?

task_isol_setup returns 0 on success, so fail to see the point 
of testing for ret == -1 rather than ret != 0 ?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 2/8] add prctl task isolation prctl docs and samples
  2021-09-01 17:34     ` Marcelo Tosatti
@ 2021-09-01 17:49       ` Nitesh Lal
  0 siblings, 0 replies; 27+ messages in thread
From: Nitesh Lal @ 2021-09-01 17:49 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Wed, Sep 1, 2021 at 1:38 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Wed, Sep 01, 2021 at 09:11:56AM -0400, Nitesh Lal wrote:
> > On Tue, Aug 24, 2021 at 11:42 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >
> > > Add documentation and userspace sample code for prctl
> > > task isolation interface.
> > >
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > >
> > > ---
> > >  Documentation/userspace-api/task_isolation.rst |  211 +++++++++++++++++++++++++
> > >  samples/Kconfig                                |    7
> > >  samples/Makefile                               |    1
> > >  samples/task_isolation/Makefile                |    9 +
> > >  samples/task_isolation/task_isol.c             |   83 +++++++++
> > >  samples/task_isolation/task_isol.h             |    9 +
> > >  samples/task_isolation/task_isol_userloop.c    |   56 ++++++
> > >  7 files changed, 376 insertions(+)
> >
> > [...]
> >
> > > +       if (ret) {
> > > +               perror("mlock");
> > > +               return EXIT_FAILURE;
> > > +       }
> > > +
> > > +       ret = task_isol_setup();
> > > +       if (ret)
> > > +               return EXIT_FAILURE;
> >
> > The above check condition should be 'ret == -1', isn't it?
>
> task_isol_setup returns 0 on success, so fail to see the point
> of testing for ret == -1 rather than ret != 0 ?
>

Hmm, could be something that I misinterpreted.
I will double-check.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-09-01 17:32     ` Marcelo Tosatti
@ 2021-09-01 18:33       ` Marcelo Tosatti
  2021-09-03 17:38         ` Nitesh Lal
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2021-09-01 18:33 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Wed, Sep 01, 2021 at 02:32:04PM -0300, Marcelo Tosatti wrote:
> On Wed, Sep 01, 2021 at 09:05:55AM -0400, Nitesh Lal wrote:
> > Hi Marcelo,
> > 
> > On Tue, Aug 24, 2021 at 11:42 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >
> > > It is not necessary to queue work item to run refresh_vm_stats
> > > on a remote CPU if that CPU has no dirty stats and no per-CPU
> > > allocations for remote nodes.
> > >
> > > This fixes sosreport hang (which uses vmstat_refresh) with
> > > spinning SCHED_FIFO process.
> > >
> > 
> > I was still able to reproduce the sosreport hang with this patchset and I
> > am wondering if that is because right now we do vmstat_sync and then cancel
> > any pending jobs on a CPU in the context of one task.
> 
> Hi Nitesh,
> 
> Did you use chisol (with proper flags) and the modified oslat?
> 
> Tested with "echo 1 > /proc/sys/vmstat_refresh" and it was successful
> (no hangs).
> 
> > However, while this task is running another process can come in and can
> > dirty the stats resulting in vmstat job getting placed on CPUs running
> > SCHED_FIFO tasks.
> > Am I missing something?
> > What we can probably do is to communicate that a CPU is running on task
> > isolation mode to any other process that is trying to run and schedule
> > jobs there.
> 
> No, that can happen. Can use sched notifiers to handle this problem.
> Good point.

Well, actually: the goal is an isolated CPU.

So task A (latency sensitive, in task isolated mode, polling) executing on
an isolated CPU will require something like stalld to execute this other
task that dirties the vmstats.

Therefore stalld can also sched-in/sched-out vmstat_worker, with similar
impact to latency.








^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-09-01 18:33       ` Marcelo Tosatti
@ 2021-09-03 17:38         ` Nitesh Lal
  0 siblings, 0 replies; 27+ messages in thread
From: Nitesh Lal @ 2021-09-03 17:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
	Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
	Peter Xu

On Wed, Sep 1, 2021 at 2:34 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Wed, Sep 01, 2021 at 02:32:04PM -0300, Marcelo Tosatti wrote:
> > On Wed, Sep 01, 2021 at 09:05:55AM -0400, Nitesh Lal wrote:
> > > Hi Marcelo,
> > >
> > > On Tue, Aug 24, 2021 at 11:42 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > >
> > > > It is not necessary to queue work item to run refresh_vm_stats
> > > > on a remote CPU if that CPU has no dirty stats and no per-CPU
> > > > allocations for remote nodes.
> > > >
> > > > This fixes sosreport hang (which uses vmstat_refresh) with
> > > > spinning SCHED_FIFO process.
> > > >
> > >
> > > I was still able to reproduce the sosreport hang with this patchset and I
> > > am wondering if that is because right now we do vmstat_sync and then cancel
> > > any pending jobs on a CPU in the context of one task.
> >
> > Hi Nitesh,
> >
> > Did you use chisol (with proper flags) and the modified oslat?
> >

Yes, I used your patches.
This is the command that I used:
chisol -q vmstat_sync -I conf ./oslat -f 1 -c 5,6,7,8,9,10,11,12,13,14,15 -D 15m

> > Tested with "echo 1 > /proc/sys/vmstat_refresh" and it was successful
> > (no hangs).

I see, I tried with "sos report --batch", which should have a similar effect.


--
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch V3 3/8] task isolation: sync vmstats on return to userspace
  2021-08-24 15:24 ` [patch V3 3/8] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2021-09-10 13:49   ` nsaenzju
  0 siblings, 0 replies; 27+ messages in thread
From: nsaenzju @ 2021-09-10 13:49 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel
  Cc: Nitesh Lal, Frederic Weisbecker, Christoph Lameter, Juri Lelli,
	Peter Zijlstra, Alex Belits, Peter Xu

Hi Marcelo,

On Tue, 2021-08-24 at 12:24 -0300, Marcelo Tosatti wrote:
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
> 
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
> 
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
> 
> To fix this, use the task isolation prctl interface to quiesce 
> deferred actions when returning to userspace.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---

[...]

> Index: linux-2.6/kernel/entry/common.c
> ===================================================================
> --- linux-2.6.orig/kernel/entry/common.c
> +++ linux-2.6/kernel/entry/common.c
> @@ -6,6 +6,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/audit.h>
>  #include <linux/tick.h>
> +#include <linux/task_isolation.h>
>  
>  #include "common.h"
>  
> @@ -287,6 +288,7 @@ static void syscall_exit_to_user_mode_pr
>  static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
>  {
>  	syscall_exit_to_user_mode_prepare(regs);
> +	isolation_exit_to_user_mode_prepare();

Are safe from migration at this stage?

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-09-10 13:49 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24 15:24 [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
2021-08-24 15:24 ` [patch V3 1/8] add basic task isolation prctl interface Marcelo Tosatti
2021-08-24 15:24 ` [patch V3 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
2021-08-26  9:59   ` Frederic Weisbecker
2021-08-26 12:11     ` Marcelo Tosatti
2021-08-26 19:15       ` Christoph Lameter
2021-08-26 20:37         ` Marcelo Tosatti
2021-08-27 13:08       ` Frederic Weisbecker
2021-08-27 14:44         ` Marcelo Tosatti
2021-08-30 11:38           ` Frederic Weisbecker
2021-09-01 13:11   ` Nitesh Lal
2021-09-01 17:34     ` Marcelo Tosatti
2021-09-01 17:49       ` Nitesh Lal
2021-08-24 15:24 ` [patch V3 3/8] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2021-09-10 13:49   ` nsaenzju
2021-08-24 15:24 ` [patch V3 4/8] procfs: add per-pid task isolation state Marcelo Tosatti
2021-08-24 15:24 ` [patch V3 5/8] task isolation: sync vmstats conditional on changes Marcelo Tosatti
2021-08-25  9:46   ` Christoph Lameter
2021-08-24 15:24 ` [patch V3 6/8] KVM: x86: call isolation prepare from VM-entry code path Marcelo Tosatti
2021-08-24 15:24 ` [patch V3 7/8] mm: vmstat: move need_update Marcelo Tosatti
2021-08-24 15:24 ` [patch V3 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
2021-08-25  9:30   ` Christoph Lameter
2021-09-01 13:05   ` Nitesh Lal
2021-09-01 17:32     ` Marcelo Tosatti
2021-09-01 18:33       ` Marcelo Tosatti
2021-09-03 17:38         ` Nitesh Lal
2021-08-25 10:02 ` [patch V3 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).